Monitor and optimize Fibre Channel SAN performance

Tune Up

Optimizing Array Performance

For a storage array, the load on the front-end processors, the cache write pending rate, and the response times of all LUNs presented to the servers are important values you will want to monitor. In the case of the LUN response times, however, you need to differentiate between random and sequential access, because the block sizes of the two access types differ considerably. For example, sequential processing within a storage array often takes far longer because of the larger block size than random processing, and this difference is reflected in response time.

Many of the values in the storage array differ depending on the system architecture and cannot be set across the board; you will need to contact the vendor to find out at which utilization level a component's performance is likely to be impaired and inquire about further critical measuring points, as well. Various vendor tools offer preset limits based on best practices, which can also be adapted to your own requirements.

Additionally, when planning the growth of your environment, make sure that if a central component (e.g., an HBA on the server, a SAN switch, or a cache or processor board) fails, the storage array can continue to work without problems and does not lead to a massive impairment of operations or even to outages of individual applications.

Equipped for Emergencies

Even if you are familiar with the SAN infrastructure and have set up appropriate monitoring at key points (Table 1), performance bottlenecks cannot be completely ruled out. A component failure, a driver problem, or a faulty Fibre Channel cable can cause sudden problems. If such an incident occurs and important applications are affected, it is important to gain a quick overview of the essential performance parameters of the infrastructure. Therefore, it is very helpful if you have the relevant values from unrestricted normal operation as a baseline to compare with the current values of the problem situation.

Table 1

Key Fibre Channel SAN Performance Parameters

Parameter	Measuring Point	Recommended Value
SAN-ISL port buffer-to-buffer zero counter	ISL ports on SAN switch or director	<1,000,000 within 5 minutes
Server LUN queue depth	HBA driver	1-32, depending on the number of LUNs at the front-end port
SAN-ISL data throughput	ISL ports on SAN switch or director	<80% of maximum data throughput
Server I/O response time (average)	Server operating system, volume manager	<10ms
Memory system processors	Storage system	<70%-80%
Storage system front-end port data throughput	Storage system	<80%-90%
Memory system cache write pending rate	Storage system	<30%
Storage system LUN I/O service time (average)	Storage system	<10ms

This comparison would reveal, for example, whether performance-hungry servers or applications are suddenly generating 30 percent more I/O operations after software updates and affecting other servers in the same environment as noisy neighbors, or whether I/O operations can no longer be processed by individual connections because of defective components or cables. However, you need to gain experience in the handling and interpretation of the performance indicators from these tools to be sufficiently prepared for genuine problems. Storage is often mistakenly suspected of being the endpoint of performance problems.

If you can make a well-founded and verifiable statement about the load situation of your SAN environment within a few minutes and precisely put your finger on the overload situation and its causes – or provide contrary evidence, backed up with well-founded figures that help to discover where the problem is arising – you will leave observers with a positive impression.

Conclusions

Given compliance with a few important rules and monitoring in the right places, even large Fibre Channel storage networks can be operated with great performance and stability. If you give priority to the most important applications at a suitable point, you can keep them available even in the event of a problem. If you are also trained in the use of performance tools and have the values from normal operation as a reference, the causes of performance problems can often be identified very quickly.

« Previous 1 2 3