When I/O Workloads Don’t Perform
Isolating and diagnosing the root causes of your performance troubles.
Every now and then, you find yourself in a situation where you expect better performance from your data storage drives. Either they once performed very well and one day just stopped, or they came straight out of the box underperforming. I explore a few of the reasons why this might happen.
Sometimes, the easiest and quickest way to determine the root cause of a slow drive is to check its local logging data. The method by which this log data is stored will differ by the drive type, but in the end, the results are generally the same. For instance, a SCSI-based drive such as a Serial Attached SCSI (SAS) drive collects drive log data and general metrics in something called the SCSI log pages (plural because each page separates the collected data into its respective category). The easiest way to access this data is by using the sg3_utils package available for Linux. To find out what categories the drive supports, execute the sg_logs binary with the SAS drive or SCSI generic identifier in which you are interested (Listing 1).
Listing 1: sg_logs
$ sudo sg_logs /dev/sdc SEAGATE ST14000NM0001 K001 Supported log pages [0x0]: 0x00 Supported log pages [sp] 0x02 Write error [we] 0x03 Read error [re] 0x05 Verify error [ve] 0x06 Non medium [nm] 0x08 Format status [fs] 0x0d Temperature [temp] 0x0e Start-stop cycle counter [sscc] 0x0f Application client [ac] 0x10 Self test results [str] 0x15 Background scan results [bsr] 0x18 Protocol specific port [psp] 0x1a Power condition transitions [pct] 0x2f Informational exceptions [ie] 0x37 Cache (seagate) [c_se] 0x38 0x3e Factory (seagate) [f_se]
As you can see, you can observe data for write, read, and drive temperature errors, and more. To specify a specific page, you will need to use the -p parameter followed by the page number. For instance, look at the log page for write errors (i.e., 0x2; Listing 2).
Listing 2: Log Page for Write Errors
$ sudo sg_logs -p 0x2 /dev/sdc SEAGATE ST14000NM0001 K001 Write error counter page [0x2] Errors corrected with possible delays = 0 Total rewrites or rereads = 0 Total errors corrected = 0 Total times correction algorithm processed = 0 Total bytes processed = 3951500537856 Total uncorrected errors = 0
Seemingly, this drive does not have any write errors (corrected and uncorrected by the drive firmware), so it looks to be in good shape. Typically, if you see errors, especially of the uncorrected type, the printout will include failing logical block addresses (LBAs). If the failed LBA regions (i.e., sectors) were listed under the read error category, it would likely be in a pending reallocation state (waiting for a future write to the same address). A sector pending reallocation is a sector that is unable to be read from or written to and must be reallocated elsewhere on the disk drive. This reallocation will only happen on the next write operation to that failed sector, if the drive has spare sectors it can use to relocate the data. A failing sector or a sector pending reallocation by the drive’s firmware will affect overall drive performance, and if enough of it occurs, it would be highly recommended to replace the drive as soon as possible.
Another thing that needs to be understood is that if a log page starts to list a significant count of corrected read or write errors, chances are that the disk drive’s surrounding environment may be at fault. For instance, vibration will often cause a disk drive’s head to misread or miswrite a length of sectors on a drive track, which results in the firmware taking action to correct it. This process alone will introduce unwanted I/O latencies (reducing performance to the drive).
If you’d like to list all of the log pages at once, use the -a parameter (Listing 3). (Warning: You will get a lot of information.)
Listing 3: List All Log Pages
$ sudo sg_logs -a /dev/sdc SEAGATE ST14000NM0001 K001 Supported log pages [0x0]: 0x00 Supported log pages [sp] 0x02 Write error [we] 0x03 Read error [re] 0x05 Verify error [ve] 0x06 Non medium [nm] 0x08 Format status [fs] 0x0d Temperature [temp] 0x0e Start-stop cycle counter [sscc] 0x0f Application client [ac] 0x10 Self test results [str] 0x15 Background scan results [bsr] 0x18 Protocol specific port [psp] 0x1a Power condition transitions [pct] 0x2f Informational exceptions [ie] 0x37 Cache (seagate) [c_se] 0x38 0x3e Factory (seagate) [f_se] Write error counter page [0x2] Errors corrected with possible delays = 0 Total rewrites or rereads = 0 Total errors corrected = 0 Total times correction algorithm processed = 0 Total bytes processed = 3951500537856 Total uncorrected errors = 0 Reserved or vendor specific [0xf800] = 0 Reserved or vendor specific [0xf801] = 0 Reserved or vendor specific [0xf802] = 0 Reserved or vendor specific [0xf803] = 0 Reserved or vendor specific [0xf804] = 0 Reserved or vendor specific [0xf805] = 0 Reserved or vendor specific [0xf806] = 0 Reserved or vendor specific [0xf807] = 0 Reserved or vendor specific [0xf810] = 0 Reserved or vendor specific [0xf811] = 0 Reserved or vendor specific [0xf812] = 0 Reserved or vendor specific [0xf813] = 0 Reserved or vendor specific [0xf814] = 0 Reserved or vendor specific [0xf815] = 0 Reserved or vendor specific [0xf816] = 0 Reserved or vendor specific [0xf817] = 0 Reserved or vendor specific [0xf820] = 0 Read error counter page [0x3] Errors corrected without substantial delay = 0 Errors corrected with possible delays = 0 Total rewrites or rereads = 0 Total errors corrected = 0 Total times correction algorithm processed = 0 Total bytes processed = 35801804845056 Total uncorrected errors = 0 Verify error counter page [0x5] Errors corrected without substantial delay = 0 Errors corrected with possible delays = 0 Total rewrites or rereads = 0 Total errors corrected = 0 Total times correction algorithm processed = 0 Total bytes processed = 0 Total uncorrected errors = 0 Non-medium error page [0x6] Non-medium error count = 0 Format status page [0x8] Format data out:Grown defects during certification Total blocks reassigned during format Total new blocks reassigned Power on minutes since format Temperature page [0xd] Current temperature = 28 C Reference temperature = 60 C Start-stop cycle counter page [0xe] Date of manufacture, year: 2019, week: 26 Accounting date, year: , week: Specified cycle count over device lifetime = 50000 Accumulated start-stop cycles = 498 Specified load-unload count over device lifetime = 600000 Accumulated load-unload cycles = 553 [ … ]
Other tools exist to extract similar and sometimes the same amount of data from a SAS drive (e.g., smartctl). If a drive supports the industry standard Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.), you can use the smartmontools package and, again, more specifically, the smartctl binary (Listing 4).
Listing 4: smartctl on SAS Drive
$ sudo smartctl -a /dev/sdc smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-66-generic] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: SEAGATE Product: ST14000NM0001 Revision: K001 Compliance: SPC-5 User Capacity: 7,000,259,821,568 bytes [7.00 TB] Logical block size: 4096 bytes LU is fully provisioned Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x6000c500a7b3ceeb0000000000000000 Serial number: ZKL00CYG0000G925020A Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Sun Mar 21 15:00:00 2021 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Disabled or Not Supported === START OF READ SMART DATA SECTION === SMART Health Status: OK Grown defects during certificationTotal blocks reassigned during format Total new blocks reassigned Power on minutes since format Current Drive Temperature: 28 C Drive Trip Temperature: 60 C Manufactured in week 26 of year 2019 Specified cycle count over device lifetime: 50000 Accumulated start-stop cycles: 498 Specified load-unload count over device lifetime: 600000 Accumulated load-unload cycles: 553 Elements in grown defect list: 0 Vendor (Seagate Cache) information Blocks sent to initiator = 150743545 Blocks received from initiator = 964465354 Blocks read from cache and sent to initiator = 1014080851 Number of read and write commands whose size <= segment size = 8318611 Number of read and write commands whose size > segment size = 12 Vendor (Seagate/Hitachi) factory information number of hours powered up = 264.67 number of minutes until next internal SMART test = 48 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 0 0 0 0 35801.805 0 write: 0 0 0 0 0 3951.501 0 Non-medium error count: 0 No Self-tests have been logged
The smartmontools package is most beneficial for Serial ATA (SATA) drives, because most SATA drives tend to support the feature out of the box. Note that the S.M.A.R.T. output, such as the type of data and the way it is formatted, will differ on SATA drives from its SAS counterpart (Listing 5).
Listing 5: smartctl on SATA Drive
$ sudo smartctl -a /dev/sda smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-66-generic] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Constellation ES (SATA 6Gb/s) Device Model: ST500NM0011 Serial Number: Z1M11WAJ LU WWN Device Id: 5 000c50 04edcb79a Add. Product Id: DELL(tm) Firmware Version: PA08 User Capacity: 500,107,862,016 bytes [500 GB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s) Local Time is: Sun Mar 21 15:00:38 2021 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 609) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 75) minutes. Conveyance self-test routine recommended polling time: ( 3) minutes. SCT capabilities: (0x10bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 077 063 --- Pre-fail Always - 56770409 3 Spin_Up_Time 0x0003 096 092 --- Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 --- Old_age Always - 137 5 Reallocated_Sector_Ct 0x0033 100 100 --- Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 067 060 --- Pre-fail Always - 5578572 9 Power_On_Hours 0x0032 100 100 --- Old_age Always - 400 10 Spin_Retry_Count 0x0013 100 099 --- Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 --- Old_age Always - 135 184 End-to-End_Error 0x0032 100 100 --- Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 --- Old_age Always - 0 188 Command_Timeout 0x0032 100 100 --- Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 --- Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 072 058 --- Old_age Always - 28 (Min/Max 22/28) 191 G-Sense_Error_Rate 0x0032 100 100 --- Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 --- Old_age Always - 35 193 Load_Cycle_Count 0x0032 100 100 --- Old_age Always - 487 194 Temperature_Celsius 0x0022 028 042 --- Old_age Always - 28 (0 18 0 0 0) 195 Hardware_ECC_Recovered 0x001a 113 099 --- Old_age Always - 56770409 197 Current_Pending_Sector 0x0012 100 100 --- Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 --- Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 --- Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 --- Old_age Offline - 344 (218 109 0) 241 Total_LBAs_Written 0x0000 100 253 --- Old_age Offline - 1389095282 242 Total_LBAs_Read 0x0000 100 253 --- Old_age Offline - 619165492 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 2 - # 2 Extended offline Completed without error 00% 2 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
For the most part, the information is generally the same. For instance, when you look at the drive attributes category, attribute 197 or Current_Pending_Sector is the same sector pending reallocation discussed earlier. Again, you can gather drive temperature information, lifetime hours, and more.
How About CPU and Drive Utilization?
Now you have checked all your drives, but for some reason, they are still not performing as expected. The next step should be to determine whether drive utilization is too high or the CPU is too busy and is having a difficult time keeping up with I/O requests. The sysstat package provides a nice little utility called iostat that gathers both sets of data. In the example in Listing 6, iostat is showing an extended set of metrics and the CPU utilization at two-second intervals.
Listing 6: iostat Output
$ iostat -x -d 2 -c Linux 5.4.12-050412-generic (dev-machine) 03/14/2021 _x86_64_ (4 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.79 0.07 1.19 2.89 0.00 95.06 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 10.91 6.97 768.20 584.64 4.87 18.20 30.85 72.31 13.16 20.40 0.26 70.44 83.89 1.97 3.52 nvme0n1 58.80 12.22 17720.47 48.71 230.91 0.01 79.70 0.08 0.42 0.03 0.00 301.34 3.98 1.02 7.24 sdb 0.31 55.97 4.13 17676.32 0.00 231.64 0.00 80.54 2.50 8.47 0.32 13.45 315.84 1.30 7.32 sdc 0.24 0.00 3.76 0.00 0.00 0.00 0.00 0.00 2.47 0.00 0.00 15.64 0.00 1.03 0.02 sde 2.47 0.00 62.57 0.00 0.00 0.00 0.00 0.00 0.63 0.00 0.00 25.34 0.00 0.29 0.07 sdf 1.51 0.00 32.42 0.00 0.00 0.00 0.00 0.00 0.69 0.00 0.00 21.40 0.00 0.31 0.05 sdd 1.42 0.00 50.96 0.00 0.00 0.00 0.00 0.00 0.44 0.00 0.00 35.83 0.00 0.38 0.05 md0 12.43 12.17 54.39 48.68 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.37 4.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.76 0.00 3.03 1.26 0.00 94.95 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 0.00 9.00 0.00 88.00 0.00 8.00 0.00 47.06 0.00 30.83 0.26 0.00 9.78 0.67 0.60 nvme0n1 2769.50 2682.00 29592.00 10723.25 241.00 0.00 8.01 0.00 0.11 0.02 0.01 10.68 4.00 0.14 77.60 sdb 0.00 2731.00 0.00 27814.00 0.00 241.00 0.00 8.11 0.00 12.20 30.13 0.00 10.18 0.29 79.40 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 2717.50 2679.00 10870.00 10716.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 4.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.51 0.00 2.42 0.00 0.00 97.07 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme0n1 2739.00 2747.50 27336.00 10988.50 210.00 0.00 7.12 0.00 0.12 0.02 0.00 9.98 4.00 0.14 77.20 sdb 0.00 2797.50 0.00 28270.00 0.00 210.00 0.00 6.98 0.00 11.75 29.38 0.00 10.11 0.28 78.80 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 2688.00 2746.50 10752.00 10986.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 4.00 0.00 0.00
The first interval should probably be ignored because iostat has no real data with which to compare (i.e., no initial state), so the numbers look a bit off. Once the utility stabilizes by the second interval, you will see a more accurate picture of your disk drive and what it reports with reads per second (r/s), writes per second (w/s), average I/O waiting to complete in reads (r_await) and writes (w_await), a calculation of how much of the drive is in use (%util), and more. The higher the %util number, the busier the drive is likely to be completing I/O requests. If that number is high, you might need to consider methods either to throttle the amount of I/O sent to the drive or find ways to balance the same I/O across multiple drives (e.g., in a RAID0, 5, or 6 configuration).
Also, notice the average CPU metrics at the top of each interval. Here, you will find a breakdown of how much of the collected CPU is busy performing tasks, waiting on completion of tasks (%iowait), idling, and so on. The less idle in the system, the more affected your drive performance.
You can view a real-time breakdown of these CPU cores with the top utility. After opening the top application at the command line, press the 1 key (Listing 7).
Listing 7: top Output
top - 19:44:01 up 15 min, 3 users, load average: 1.08, 0.68, 0.42 Tasks: 145 total, 1 running, 144 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.7 us, 1.4 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st %Cpu1 : 0.3 us, 3.1 sy, 0.0 ni, 95.9 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st %Cpu2 : 0.3 us, 1.7 sy, 0.0 ni, 97.6 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu3 : 0.0 us, 1.3 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 7951.2 total, 6269.5 free, 210.9 used, 1470.8 buff/cache MiB Swap: 3934.0 total, 3934.0 free, 0.0 used. 7079.6 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3294 root 20 0 748016 4760 988 S 13.3 0.1 0:01.19 fio 3155 root 20 0 0 0 0 D 2.3 0.0 0:14.55 md0_resync 18 root 20 0 0 0 0 S 0.3 0.0 0:00.05 ksoftirqd/1 3152 root 20 0 0 0 0 S 0.3 0.0 0:05.38 md0_raid1 3284 petros 20 0 9496 4080 3356 R 0.3 0.1 0:00.04 top 3286 root 20 0 813548 428684 424916 S 0.3 5.3 0:00.37 fio
Enough Free Memory?
If the CPU is not the problem and the drives are being underutilized, do you have a constraint on memory resources? The easiest and quickest way to check is with free:
$ free -m total used free shared buff/cache available Mem: 7951 201 7037 1 712 7493 Swap: 4095 0 4095
The free utility dumps the amount of total, used, and free memory on the system, but it will also show how much of it is used as a buffer or temporary cache (buff/cache) and how much of it is available and reclaimable (available). The output is in megabytes – the -g argument reports output in gigabytes and -k in kilobytes – and will match the data found in /proc/meminfo. Note that the output only looks different from the free output because it is calculated in kilobytes instead of megabytes:
$ cat /proc/meminfo | grep -e "^Mem" MemTotal: 8142012 kB MemFree: 7204048 kB MemAvailable: 7672148 kB
When available memory starts to increase while free memory drastically decreases, a lot of memory is being consumed by the system and its applications, a percentage of which can be reclaimed from temporary caches when the operating system is under memory pressure. If the system begins to reclaim memory, it will affect overall system performance. You will also observe a kswapd or swapper message in the dmesg or syslog output, indicating that the kernel is hard at work freeing up reclaimable memory. If the system is in a situation in which both free and available memory decreases, it means the system has less memory to reclaim, so when an application asks for more memory, it will fault on the page allocation, likely forcing the application to exit early. This condition is typically accompanied by an entry in dmesg or syslog notifying the system administrator or user that a page allocation has occurred.
If you find yourself in a situation in which the system is struggling to find memory resources to serve applications and I/O requests, you might have to figure out the greatest offender(s) and attempt to find a resolution to the problem. In a worst case scenario, an application may contain a memory leak and not properly free unused memory, consuming more system memory that cannot be reclaimed. This scenario will need to be addressed in the application’s code. A simple way to observe the greatest consumers of your memory resources is with the ps utility (Listing 8).
Listing 8: ps Output
$ ps aux | head -1; ps aux | sort -rnk 4 | head -9 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 5088 5.0 5.2 815852 429928 pts/2 Sl+ 15:35 0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test root 5097 2.2 0.4 783092 38248 ? Ds 15:35 0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test root 5096 2.1 0.4 783088 38168 ? Ds 15:35 0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test root 5095 2.0 0.4 783084 38204 ? Ds 15:35 0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test root 5094 2.1 0.4 783080 38236 ? Ds 15:35 0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test root 1421 0.2 0.3 1295556 29472 ? Ssl 14:51 0:05 /usr/lib/snapd/snapd root 990 0.0 0.2 107888 16912 ? Ssl 14:50 0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal root 844 0.0 0.2 280452 18256 ? SLsl 14:50 0:00 /sbin/multipathd -d -s systemd+ 912 0.0 0.1 24092 10612 ? Ss 14:50 0:00 /lib/systemd/systemd-resolved
The fourth column (%MEM) is the column on which you should focus. In this example, the fio utility is consuming 5.2% of the system memory, and as soon as it exits, it will free that memory back into the larger pool of available memory for future use.
Other things worth considering are tuning the kernel’s virtual memory subsystem with the sysctl utility. A guide of what parameters can be tuned are found in the Documentation/admin-guide/sysctl/vm.txt file of the Linux kernel source. Tunables include filesystem buffering and caching thresholds, memory swapping, and others.
Is Other Traffic Hitting the Drive?
Now is the time to dive a little deeper by looking at the traffic hitting your drive. Are you seeing unexpected I/O requests? The blktrace and blkparse applications are appropriately built to dive into this space. Start by creating a small shell script to generate “outside” traffic. The script will rely on the sg_inq binary provided by the sg3_utils package mentioned earlier:
#!/bin/sh while [ 1 ]; do sg_inq /dev/sdf sleep 2 done
To execute the script, enter:
$ sh send_sg_inq.sh
If you were to use blktrace to capture the traffic hitting the drive, you would see sg_inq or a SCSI INQUIRY request being sent to the drive. This operation is considered a read (outside of the general data path) because it is requesting information back from the drive. Therefore, the seventh column of the lines following those ending in sg_inq are labeled with an R.
Listing 9: blktrace Output
$ sudo blktrace -d /dev/sdf -o -|blkparse -i - 8,80 1 1 0.000000000 4693 D R 36 [sg_inq] 8,80 1 2 0.000590071 0 C R [0] 8,80 1 3 0.000624346 4693 D R 144 [sg_inq] 8,80 1 4 0.001093812 0 C R [0] 8,80 1 5 0.001121290 4693 D R 252 [sg_inq] 8,80 1 6 0.001577685 0 C R [0] 8,80 1 7 0.001593793 4693 D R 252 [sg_inq] 8,80 1 8 0.002063534 0 C R [0] 8,80 0 1 2.004478167 4695 D R 36 [sg_inq] 8,80 0 2 2.005054640 0 C R [0] 8,80 0 3 2.005078936 4695 D R 144 [sg_inq]
What will this look like when you send more I/O? You can use the dd utility to send 1MB of all-zero data to the drive:
$ sudo dd if=/dev/zero of=/dev/sdf1 bs=1M
The blktrace utility will dump the captured output (Listing 10); the seventh column is saying that a write (W) was sent, and the eighth column shows the size of the transfer (1342768 + 8). In one line, you see that a kernel worker thread is tasked with sending the write, and in the following line you see the logical block addressing (LBA) location relative to that drive or drive partition’s starting point (e.g., 1326392)
Listing 10: blktrace Output
8,80 1 4398 0.027331017 218 M W 1342768 + 8 [kworker/u16:3] 8,80 1 4399 0.027332069 218 A W 1342776 + 8 <- (8,81) 1326392 8,80 1 4400 0.027332354 218 Q W 1342776 + 8 [kworker/u16:3] 8,80 1 4401 0.027332675 218 M W 1342776 + 8 [kworker/u16:3] 8,80 1 4402 0.027333775 218 A W 1342784 + 8 <- (8,81) 1326400 8,80 1 4403 0.027334066 218 Q W 1342784 + 8 [kworker/u16:3] 8,80 1 4404 0.027334387 218 M W 1342784 + 8 [kworker/u16:3] 8,80 1 4405 0.027335587 218 A W 1342792 + 8 <- (8,81) 1326408 8,80 1 4406 0.027335872 218 Q W 1342792 + 8 [kworker/u16:3] 8,80 1 4407 0.027336186 218 M W 1342792 + 8 [kworker/u16:3] 8,80 1 4408 0.027337256 218 A W 1342800 + 8 <- (8,81) 1326416 8,80 1 4409 0.027337542 218 Q W 1342800 + 8 [kworker/u16:3] 8,80 1 4410 0.027337900 218 M W 1342800 + 8 [kworker/u16:3] 8,80 1 4411 0.027339749 218 A W 1342808 + 8 <- (8,81) 1326424
This output can and will get large and ugly, and it might take some time to parse through the content manually. Be sure not to let the utility capture for too long.
Similar to blktrace, you are able to view traffic hitting the SAS drives with the SCSI logging level function of the Linux kernel. The simplest way to observe and manage the kernel’s logging level settings is with the scsi_logging_level tool. For instance, to list the current logging level setting, invoke the command with the -g option:
$ sudo scsi_logging_level -g Current scsi logging level: /proc/sys/dev/scsi/logging_level = 0 SCSI_LOG_ERROR=0 SCSI_LOG_TIMEOUT=0 SCSI_LOG_SCAN=0 SCSI_LOG_MLQUEUE=0 SCSI_LOG_MLCOMPLETE=0 SCSI_LOG_LLQUEUE=0 SCSI_LOG_LLCOMPLETE=0 SCSI_LOG_HLQUEUE=0 SCSI_LOG_HLCOMPLETE=0 SCSI_LOG_IOCTL=0
The Linux SCSI subsystem is separated into three main levels: the upper layer comprises entry point drivers that allow you to connect with the underlying device (e.g., sd, sg, sr), and the mid layer connects the upper level to the interface driver of the lower level (i.e., Fibre Channel, SAS, iSCSI, etc.). Aside from connecting both upper and lower levels, the mid level is responsible for failure handling and error correction of failed requests propagated up from the underlying device or the host bus adaptor (HBA) in the low level connecting to that underlying device (Figure 1).
To enable the highest level of logging for all SCSI errors and traffic in the low and mid layers, type:
$ sudo scsi_logging_level -s -E 7 -L 7 -M 7 New scsi logging level: /proc/sys/dev/scsi/logging_level = 2096647 SCSI_LOG_ERROR=7 SCSI_LOG_TIMEOUT=0 SCSI_LOG_SCAN=0 SCSI_LOG_MLQUEUE=7 SCSI_LOG_MLCOMPLETE=7 SCSI_LOG_LLQUEUE=7 SCSI_LOG_LLCOMPLETE=7 SCSI_LOG_HLQUEUE=0 SCSI_LOG_HLCOMPLETE=0 SCSI_LOG_IOCTL=0
When logging is enabled, it will print SCSI level traces in both dmesg and syslog or the kernel log. For instance, in the example shown in Listing 11, you can the see SCSI Command Descriptor Blocks (CDBs) and traces printed for write, cache synchronization, and inquiry commands.
Listing 11: Kernel Log
Mar 27 15:21:24 dev-machine kernel: [ 387.117129] sd 0:0:0:0: [sda] tag#3 Send: scmd 0x000000001842ae5b Mar 27 15:21:24 dev-machine kernel: [ 387.117134] sd 0:0:0:0: [sda] tag#3 CDB: Write(10) 2a 00 1d 04 5d 88 00 00 20 00 Mar 27 15:21:24 dev-machine kernel: [ 387.117468] sd 0:0:0:0: [sda] tag#3 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK Mar 27 15:21:24 dev-machine kernel: [ 387.117485] sd 0:0:0:0: [sda] tag#3 CDB: Write(10) 2a 00 1d 04 5d 88 00 00 20 00 Mar 27 15:21:24 dev-machine kernel: [ 387.117491] sd 0:0:0:0: [sda] tag#3 scsi host busy 1 failed 0 Mar 27 15:21:24 dev-machine kernel: [ 387.117495] sd 0:0:0:0: Notifying upper driver of completion (result 0) Mar 27 15:21:24 dev-machine kernel: [ 387.117585] sd 0:0:0:0: [sda] tag#24 Send: scmd 0x00000000fa2f2e1a Mar 27 15:21:24 dev-machine kernel: [ 387.117590] sd 0:0:0:0: [sda] tag#24 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 Mar 27 15:21:24 dev-machine kernel: [ 387.130198] sd 0:0:0:0: [sda] tag#24 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK Mar 27 15:21:24 dev-machine kernel: [ 387.130202] sd 0:0:0:0: [sda] tag#24 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 Mar 27 15:21:24 dev-machine kernel: [ 387.130205] sd 0:0:0:0: [sda] tag#24 scsi host busy 1 failed 0 Mar 27 15:21:24 dev-machine kernel: [ 387.130207] sd 0:0:0:0: Notifying upper driver of completion (result 0) [ … ] Mar 27 15:21:25 dev-machine kernel: [ 387.655119] sd 2:0:1:1: [sdf] tag#3669 Send: scmd 0x00000000b06d61c4 Mar 27 15:21:25 dev-machine kernel: [ 387.655122] sd 2:0:1:1: [sdf] tag#3669 CDB: Inquiry 12 00 00 00 90 00 Mar 27 15:21:25 dev-machine kernel: [ 387.655587] sd 2:0:1:1: [sdf] tag#3669 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK Mar 27 15:21:25 dev-machine kernel: [ 387.655589] sd 2:0:1:1: [sdf] tag#3669 CDB: Inquiry 12 00 00 00 90 00 Mar 27 15:21:25 dev-machine kernel: [ 387.655591] sd 2:0:1:1: [sdf] tag#3669 scsi host busy 1 failed 0 Mar 27 15:21:25 dev-machine kernel: [ 387.655593] sd 2:0:1:1: Notifying upper driver of completion (result 0)
Much like blktrace, this log can get pretty large and pretty ugly and should be enabled for brief periods at a time to avoid overwhelming the system. Another thing worth noting is that, with capture tools such as these, latencies are introduced and may alter the original I/O profile being chased, so proceed with caution.
To disable all logging, type:
$ sudo scsi_logging_level -s -E 0 -L 0 -M 0 New scsi logging level: /proc/sys/dev/scsi/logging_level = 7 SCSI_LOG_ERROR=7 SCSI_LOG_TIMEOUT=0 SCSI_LOG_SCAN=0 SCSI_LOG_MLQUEUE=0 SCSI_LOG_MLCOMPLETE=0 SCSI_LOG_LLQUEUE=0 SCSI_LOG_LLCOMPLETE=0 SCSI_LOG_HLQUEUE=0 SCSI_LOG_HLCOMPLETE=0 SCSI_LOG_IOCTL=0
Perf Tracing
Another tool captures system traces, but at the Linux kernel level and in the context of CPU-executed operations. That tool is called perf and can be obtained from the Linux Kernel Archives. Some Linux distributions will offer it in their downloadable package repositories, but it might be best to download the latest and greatest from the website (Listing 12).
Listing 12: Get perf from kernel.org
$ wget https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/v5.9.0/perf-5.9.0.tar.xz --2021-03-21 15:40:30-- https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/v5.9.0/perf-5.9.0.tar.xz Resolving mirrors.edge.kernel.org (mirrors.edge.kernel.org)... 2604:1380:1:3600::1, 147.75.197.195 Connecting to mirrors.edge.kernel.org (mirrors.edge.kernel.org)|2604:1380:1:3600::1|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1857268 (1.8M) [application/x-xz] Saving to: ‘perf-5.9.0.tar.xz’ perf-5.9.0.tar.xz 100%[=============================================================================================>] 1.77M 8.85MB/s in 0.2s 2021-03-21 15:40:30 (8.85 MB/s) - ‘perf-5.9.0.tar.xz’ saved [1857268/1857268]
Extract the file into a local working directory
$ tar xJf perf-5.9.0.tar.xz
and change into the perf source code subdirectory:
$ cd perf-5.9.0/tools/perf/
Resolve any and all dependencies and run make and make install utilities.
Now that perf is installed, you can generate I/O to a disk drive with fio,
$ sudo fio --bs=1K --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=1 --name=test
and in another terminal window, invoke perf to record kernel operations executed at the CPU level for 30 seconds:
$ sudo ./perf record -a -g -- sleep 30 [ perf record: Woken up 8 times to write data ] [ perf record: Captured and wrote 2.476 MB perf.data (12373 samples) ]
A perf.data file is generated in the local working directory. You can use tools to open this file directly for analysis or to generate other output from that file. For instance, to generate a report summarizing the 30 seconds, enter:
$ sudo ./perf script > out.script $ sudo ./perf report > out.report
You will need both files just generated:
$ ls out.* out.report out.script
If you open the report file (i.e., out.report), you will see a summary of all the kernel operations and the amount of CPU it utilized during that 30-second snapshot, and you will see the subroutines tied to the top-level functions (Listing 13).
Listing 13: Summary of Kernel Operations
# To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 12K of event 'cycles' # Event count (approx.): 1354753461 # # Children Self Command Shared Object Symbol # ........ ........ ............... ........................ ................................................ # 57.94% 0.00% swapper [kernel.kallsyms] [k] secondary_startup_64 | ---secondary_startup_64 | |--48.28%--start_secondary | | | --48.25%--cpu_startup_entry | | | --47.68%--do_idle | | | |--36.89%--call_cpuidle | | | | | --36.81%--cpuidle_enter | | | | | --35.93%--cpuidle_enter_state | | | | | |--16.95%--intel_idle | | | | | |--10.25%--ret_from_intr | | | | | | | --10.21%--do_IRQ | | | | | | | |--7.62%--irq_exit | | | | | | | | | --7.47%--__softirqentry_text_start | | | | | | | | | --7.17%--blk_done_softirq [ … ] 0.51% 0.04% fio [kernel.kallsyms] [k] mpage_alloc.isra.0 0.50% 0.00% fio libc-2.31.so [.] 0x00007f0576ab611a | ---0x7f0576ab611a 0.50% 0.24% swapper [kernel.kallsyms] [k] sched_clock_cpu 0.49% 0.00% perf perf [.] 0x00005570000237de 0.48% 0.18% swapper [kernel.kallsyms] [k] ttwu_do_wakeup 0.46% 0.19% swapper [kernel.kallsyms] [k] set_next_entity 0.46% 0.00% rcu_sched [kernel.kallsyms] [k] ret_from_fork 0.46% 0.00% rcu_sched [kernel.kallsyms] [k] kthread 0.46% 0.10% swapper [kernel.kallsyms] [k] load_balance [ … ]
The perf utility is useful when you might need to see where the CPU spends most of its time and provides insight into what could be holding up the completion of a function or shed light on poorly developed functions in the kernel space.
Sometimes it can be difficult to parse this data, which is why you can convert it into a visual chart called a flame graph. To do this, clone the git repo and convert the original output script file into an SVG (Listing 14).
Listing 14: Generating a Flame Graph
$ git clone https://github.com/brendangregg/FlameGraph.git Cloning into 'FlameGraph'... remote: Enumerating objects: 1067, done. remote: Total 1067 (delta 0), reused 0 (delta 0), pack-reused 1067 Receiving objects: 100% (1067/1067), 1.87 MiB | 6.67 MiB/s, done. Resolving deltas: 100% (612/612), done. $ ./FlameGraph/stackcollapse-perf.pl out.script > out.folded $ ./FlameGraph/flamegraph.pl out.folded > out.svg
When you open the file, you will see a visual representation of the 30-second capture (Figure 2). You can graphically navigate this graph with the click of your mouse. As you begin to click into the kernel functions, you will drill deeper and deeper into each function and the subfunctions it might be waiting on (Figures 3 and 4). You will also observe the total CPU time of the functions at the bottom of the SVG page.
Other Things to Consider
You will want to know whether you are looking at a solid state drive (SSD). If so, performance can drop drastically once the SSD reaches its write cliff, or the point at which the drive exhausts its first writes to each NAND memory cell. Once you reach the drive’s write cliff, you are now exercising a programmable erase (PE) cycle for each write operation. This multistep process involves erasing an entire cell before writing to it. Each NAND cell has a finite number of PE cycles. The PE cycles to a cell will vary on the type of NAND technology employed. Vendors resort to something called over-provisioning to prolong or delay this write cliff.
Often you will get double the NAND memory when purchasing a drive. For instance, on a 3TB PCIe SSD, you might be paying for 6TB of storage, of which you are only able to use 3TB. There are two methods by which someone can restore an SSD to its former performing glory. One is by employing a technique called TRIM (called Discard on SAS technologies). In this background process the software tells the drive to prep unused data blocks for a future write operation. Tools such as fstrim will enable this function, but many filesystems (e.g., ext4) and volume managers (e.g., logical volume management, LVM) offer discard support.
Conclusion
When something does not seem quite right, you have many ways to peek into a system – a few of these mechanisms were explored here. This guide does not cover everything and just provides a starting point, but it is enough of a starting point to get you moving in the right direction of isolating and diagnosing the root cause of your performance troubles.
The Author
Petros Koutoupis is currently a senior performance software engineer at Cray (now HPE) for its Lustre High Performance File System division. He is also the creator and maintainer of the RapidDisk Project. Koutoupis has worked in the data storage industry for well over a decade and has helped pioneer the many technologies unleashed in the wild today.