When I/O Workloads Don’t Perform

Petros Koutoupis

05/12/2021 06:23 pm

Isolating and diagnosing the root causes of your performance troubles.

Every now and then, you find yourself in a situation where you expect better performance from your data storage drives. Either they once performed very well and one day just stopped, or they came straight out of the box underperforming. I explore a few of the reasons why this might happen.

Sometimes, the easiest and quickest way to determine the root cause of a slow drive is to check its local logging data. The method by which this log data is stored will differ by the drive type, but in the end, the results are generally the same. For instance, a SCSI-based drive such as a Serial Attached SCSI (SAS) drive collects drive log data and general metrics in something called the SCSI log pages (plural because each page separates the collected data into its respective category). The easiest way to access this data is by using the sg3_utils package available for Linux. To find out what categories the drive supports, execute the sg_logs binary with the SAS drive or SCSI generic identifier in which you are interested (Listing 1).

Listing 1: sg_logs

$ sudo sg_logs /dev/sdc
    SEAGATE   ST14000NM0001     K001
Supported log pages  [0x0]:
    0x00        Supported log pages [sp]
    0x02        Write error [we]
    0x03        Read error [re]
    0x05        Verify error [ve]
    0x06        Non medium [nm]
    0x08        Format status [fs]
    0x0d        Temperature [temp]
    0x0e        Start-stop cycle counter [sscc]
    0x0f        Application client [ac]
    0x10        Self test results [str]
    0x15        Background scan results [bsr]
    0x18        Protocol specific port [psp]
    0x1a        Power condition transitions [pct]
    0x2f        Informational exceptions [ie]
    0x37        Cache (seagate) [c_se]
    0x38        
    0x3e        Factory (seagate) [f_se]

As you can see, you can observe data for write, read, and drive temperature errors, and more. To specify a specific page, you will need to use the -p parameter followed by the page number. For instance, look at the log page for write errors (i.e., 0x2; Listing 2).

Listing 2: Log Page for Write Errors

$ sudo sg_logs -p 0x2 /dev/sdc
    SEAGATE   ST14000NM0001     K001
Write error counter page  [0x2]
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 0
  Total times correction algorithm processed = 0
  Total bytes processed = 3951500537856
  Total uncorrected errors = 0

Seemingly, this drive does not have any write errors (corrected and uncorrected by the drive firmware), so it looks to be in good shape. Typically, if you see errors, especially of the uncorrected type, the printout will include failing logical block addresses (LBAs). If the failed LBA regions (i.e., sectors) were listed under the read error category, it would likely be in a pending reallocation state (waiting for a future write to the same address). A sector pending reallocation is a sector that is unable to be read from or written to and must be reallocated elsewhere on the disk drive. This reallocation will only happen on the next write operation to that failed sector, if the drive has spare sectors it can use to relocate the data. A failing sector or a sector pending reallocation by the drive’s firmware will affect overall drive performance, and if enough of it occurs, it would be highly recommended to replace the drive as soon as possible.

Another thing that needs to be understood is that if a log page starts to list a significant count of corrected read or write errors, chances are that the disk drive’s surrounding environment may be at fault. For instance, vibration will often cause a disk drive’s head to misread or miswrite a length of sectors on a drive track, which results in the firmware taking action to correct it. This process alone will introduce unwanted I/O latencies (reducing performance to the drive).

If you’d like to list all of the log pages at once, use the -a parameter (Listing 3). (Warning: You will get a lot of information.)

Listing 3: List All Log Pages

$ sudo sg_logs -a /dev/sdc
    SEAGATE   ST14000NM0001     K001
 
Supported log pages  [0x0]:
    0x00        Supported log pages [sp]
    0x02        Write error [we]
    0x03        Read error [re]
    0x05        Verify error [ve]
    0x06        Non medium [nm]
    0x08        Format status [fs]
    0x0d        Temperature [temp]
    0x0e        Start-stop cycle counter [sscc]
    0x0f        Application client [ac]
    0x10        Self test results [str]
    0x15        Background scan results [bsr]
    0x18        Protocol specific port [psp]
    0x1a        Power condition transitions [pct]
    0x2f        Informational exceptions [ie]
    0x37        Cache (seagate) [c_se]
    0x38
    0x3e        Factory (seagate) [f_se]
 
Write error counter page  [0x2]
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 0
  Total times correction algorithm processed = 0
  Total bytes processed = 3951500537856
  Total uncorrected errors = 0
  Reserved or vendor specific [0xf800] = 0
  Reserved or vendor specific [0xf801] = 0
  Reserved or vendor specific [0xf802] = 0
  Reserved or vendor specific [0xf803] = 0
  Reserved or vendor specific [0xf804] = 0
  Reserved or vendor specific [0xf805] = 0
  Reserved or vendor specific [0xf806] = 0
  Reserved or vendor specific [0xf807] = 0
  Reserved or vendor specific [0xf810] = 0
  Reserved or vendor specific [0xf811] = 0
  Reserved or vendor specific [0xf812] = 0
  Reserved or vendor specific [0xf813] = 0
  Reserved or vendor specific [0xf814] = 0
  Reserved or vendor specific [0xf815] = 0
  Reserved or vendor specific [0xf816] = 0
  Reserved or vendor specific [0xf817] = 0
  Reserved or vendor specific [0xf820] = 0
 
Read error counter page  [0x3]
  Errors corrected without substantial delay = 0
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 0
  Total times correction algorithm processed = 0
  Total bytes processed = 35801804845056
  Total uncorrected errors = 0
 
Verify error counter page  [0x5]
  Errors corrected without substantial delay = 0
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 0
  Total times correction algorithm processed = 0
  Total bytes processed = 0
  Total uncorrected errors = 0
 
Non-medium error page  [0x6]
  Non-medium error count = 0
 
Format status page  [0x8]
  Format data out: 
  Grown defects during certification 
  Total blocks reassigned during format 
  Total new blocks reassigned 
  Power on minutes since format 
 
Temperature page  [0xd]
  Current temperature = 28 C
  Reference temperature = 60 C
 
Start-stop cycle counter page  [0xe]
  Date of manufacture, year: 2019, week: 26
  Accounting date, year:     , week:   
  Specified cycle count over device lifetime = 50000
  Accumulated start-stop cycles = 498
  Specified load-unload count over device lifetime = 600000
  Accumulated load-unload cycles = 553
 
[ … ]

Other tools exist to extract similar and sometimes the same amount of data from a SAS drive (e.g., smartctl). If a drive supports the industry standard Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.), you can use the smartmontools package and, again, more specifically, the smartctl binary (Listing 4).

Listing 4: smartctl on SAS Drive

$ sudo smartctl -a /dev/sdc
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-66-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST14000NM0001
Revision:             K001
Compliance:           SPC-5
User Capacity:        7,000,259,821,568 bytes [7.00 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x6000c500a7b3ceeb0000000000000000
Serial number:        ZKL00CYG0000G925020A
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Sun Mar 21 15:00:00 2021 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported
 
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
 
Grown defects during certification 
Total blocks reassigned during format 
Total new blocks reassigned 
Power on minutes since format 
Current Drive Temperature:     28 C
Drive Trip Temperature:        60 C
 
Manufactured in week 26 of year 2019
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  498
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  553
Elements in grown defect list: 0
 
Vendor (Seagate Cache) information
  Blocks sent to initiator = 150743545
  Blocks received from initiator = 964465354
  Blocks read from cache and sent to initiator = 1014080851
  Number of read and write commands whose size <= segment size = 8318611
  Number of read and write commands whose size > segment size = 12
 
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 264.67
  number of minutes until next internal SMART test = 48
 
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      35801.805           0
write:         0        0         0         0          0       3951.501           0
 
Non-medium error count:        0
 
No Self-tests have been logged

The smartmontools package is most beneficial for Serial ATA (SATA) drives, because most SATA drives tend to support the feature out of the box. Note that the S.M.A.R.T. output, such as the type of data and the way it is formatted, will differ on SATA drives from its SAS counterpart (Listing 5).

Listing 5: smartctl on SATA Drive

$ sudo smartctl -a /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-66-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Constellation ES (SATA 6Gb/s)
Device Model:     ST500NM0011
Serial Number:    Z1M11WAJ
LU WWN Device Id: 5 000c50 04edcb79a
Add. Product Id:  DELL(tm)
Firmware Version: PA08
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Mar 21 15:00:38 2021 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
          was completed without error.
          Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
          without error or no self-test has ever 
          been run.
Total time to complete Offline 
data collection:    (  609) seconds.
Offline data collection
capabilities:        (0x7b) SMART execute Offline immediate.
          Auto Offline data collection on/off support.
          Suspend Offline collection upon new
          command.
          Offline surface scan supported.
          Self-test supported.
          Conveyance Self-test supported.
          Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
          power-saving mode.
          Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
          General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (  75) minutes.
Conveyance self-test routine
recommended polling time:    (   3) minutes.
SCT capabilities:          (0x10bd) SCT Status supported.
          SCT Error Recovery Control supported.
          SCT Feature Control supported.
          SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   077   063   ---    Pre-fail  Always       -       56770409
  3 Spin_Up_Time            0x0003   096   092   ---    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   ---    Old_age   Always       -       137
  5 Reallocated_Sector_Ct   0x0033   100   100   ---    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   067   060   ---    Pre-fail  Always       -       5578572
  9 Power_On_Hours          0x0032   100   100   ---    Old_age   Always       -       400
 10 Spin_Retry_Count        0x0013   100   099   ---    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       135
184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   ---    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   072   058   ---    Old_age   Always       -       28 (Min/Max 22/28)
191 G-Sense_Error_Rate      0x0032   100   100   ---    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   ---    Old_age   Always       -       35
193 Load_Cycle_Count        0x0032   100   100   ---    Old_age   Always       -       487
194 Temperature_Celsius     0x0022   028   042   ---    Old_age   Always       -       28 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   113   099   ---    Old_age   Always       -       56770409
197 Current_Pending_Sector  0x0012   100   100   ---    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   ---    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   ---    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   ---    Old_age   Offline      -       344 (218 109 0)
241 Total_LBAs_Written      0x0000   100   253   ---    Old_age   Offline      -       1389095282
242 Total_LBAs_Read         0x0000   100   253   ---    Old_age   Offline      -       619165492
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         2         -
# 2  Extended offline    Completed without error       00%         2         -
 
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

For the most part, the information is generally the same. For instance, when you look at the drive attributes category, attribute 197 or Current_Pending_Sector is the same sector pending reallocation discussed earlier. Again, you can gather drive temperature information, lifetime hours, and more.

How About CPU and Drive Utilization?

Now you have checked all your drives, but for some reason, they are still not performing as expected. The next step should be to determine whether drive utilization is too high or the CPU is too busy and is having a difficult time keeping up with I/O requests. The sysstat package provides a nice little utility called iostat that gathers both sets of data. In the example in Listing 6, iostat is showing an extended set of metrics and the CPU utilization at two-second intervals.

Listing 6: iostat Output

$ iostat -x -d 2 -c
Linux 5.4.12-050412-generic (dev-machine)     03/14/2021     _x86_64_    (4 CPU)
 
avg-cpu:  %user   %nice %system %iowait  %steal  %idle
           0.79    0.07    1.19    2.89    0.00  95.06
 
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda             10.91    6.97    768.20    584.64     4.87    18.20  30.85  72.31   13.16   20.40   0.26    70.44    83.89   1.97   3.52
nvme0n1         58.80   12.22  17720.47     48.71   230.91     0.01  79.70   0.08    0.42    0.03   0.00   301.34     3.98   1.02   7.24
sdb              0.31   55.97      4.13  17676.32     0.00   231.64   0.00  80.54    2.50    8.47   0.32    13.45   315.84   1.30   7.32
sdc              0.24    0.00      3.76      0.00     0.00     0.00   0.00   0.00    2.47    0.00   0.00    15.64     0.00   1.03   0.02
sde              2.47    0.00     62.57      0.00     0.00     0.00   0.00   0.00    0.63    0.00   0.00    25.34     0.00   0.29   0.07
sdf              1.51    0.00     32.42      0.00     0.00     0.00   0.00   0.00    0.69    0.00   0.00    21.40     0.00   0.31   0.05
sdd              1.42    0.00     50.96      0.00     0.00     0.00   0.00   0.00    0.44    0.00   0.00    35.83     0.00   0.38   0.05
md0             12.43   12.17     54.39     48.68     0.00     0.00   0.00   0.00    0.00    0.00   0.00     4.37     4.00   0.00   0.00
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.76    0.00    3.03    1.26    0.00   94.95
 
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda              0.00    9.00      0.00     88.00     0.00     8.00   0.00  47.06    0.00   30.83   0.26     0.00     9.78   0.67   0.60
nvme0n1       2769.50 2682.00  29592.00  10723.25   241.00     0.00   8.01   0.00    0.11    0.02   0.01    10.68     4.00   0.14  77.60
sdb              0.00 2731.00      0.00  27814.00     0.00   241.00   0.00   8.11    0.00   12.20  30.13     0.00    10.18   0.29  79.40
sdc              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sde              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdf              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdd              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md0           2717.50 2679.00  10870.00  10716.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     4.00     4.00   0.00   0.00
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.51    0.00    2.42    0.00    0.00   97.07
 
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
nvme0n1       2739.00 2747.50  27336.00  10988.50   210.00     0.00   7.12   0.00    0.12    0.02   0.00     9.98     4.00   0.14  77.20
sdb              0.00 2797.50      0.00  28270.00     0.00   210.00   0.00   6.98    0.00   11.75  29.38     0.00    10.11   0.28  78.80
sdc              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sde              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdf              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdd              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md0           2688.00 2746.50  10752.00  10986.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     4.00     4.00   0.00   0.00

The first interval should probably be ignored because iostat has no real data with which to compare (i.e., no initial state), so the numbers look a bit off. Once the utility stabilizes by the second interval, you will see a more accurate picture of your disk drive and what it reports with reads per second (r/s), writes per second (w/s), average I/O waiting to complete in reads (r_await) and writes (w_await), a calculation of how much of the drive is in use (%util), and more. The higher the %util number, the busier the drive is likely to be completing I/O requests. If that number is high, you might need to consider methods either to throttle the amount of I/O sent to the drive or find ways to balance the same I/O across multiple drives (e.g., in a RAID0, 5, or 6 configuration).

Also, notice the average CPU metrics at the top of each interval. Here, you will find a breakdown of how much of the collected CPU is busy performing tasks, waiting on completion of tasks (%iowait), idling, and so on. The less idle in the system, the more affected your drive performance.

You can view a real-time breakdown of these CPU cores with the top utility. After opening the top application at the command line, press the 1 key (Listing 7).

Listing 7: top Output

top - 19:44:01 up 15 min,  3 users,  load average: 1.08, 0.68, 0.42
Tasks: 145 total,   1 running, 144 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.7 us,  1.4 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu1  :  0.3 us,  3.1 sy,  0.0 ni, 95.9 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu2  :  0.3 us,  1.7 sy,  0.0 ni, 97.6 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu3  :  0.0 us,  1.3 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7951.2 total,   6269.5 free,    210.9 used,   1470.8 buff/cache
MiB Swap:   3934.0 total,   3934.0 free,      0.0 used.   7079.6 avail Mem 
 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 3294 root      20   0  748016   4760    988 S  13.3   0.1   0:01.19 fio
 3155 root      20   0       0      0      0 D   2.3   0.0   0:14.55 md0_resync
   18 root      20   0       0      0      0 S   0.3   0.0   0:00.05 ksoftirqd/1
 3152 root      20   0       0      0      0 S   0.3   0.0   0:05.38 md0_raid1
 3284 petros    20   0    9496   4080   3356 R   0.3   0.1   0:00.04 top
 3286 root      20   0  813548 428684 424916 S   0.3   5.3   0:00.37 fio

Enough Free Memory?

If the CPU is not the problem and the drives are being underutilized, do you have a constraint on memory resources? The easiest and quickest way to check is with free:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7951         201        7037           1         712        7493
Swap:          4095           0        4095

The free utility dumps the amount of total, used, and free memory on the system, but it will also show how much of it is used as a buffer or temporary cache (buff/cache) and how much of it is available and reclaimable (available). The output is in megabytes – the -g argument reports output in gigabytes and -k in kilobytes – and will match the data found in /proc/meminfo. Note that the output only looks different from the free output because it is calculated in kilobytes instead of megabytes:

$ cat /proc/meminfo | grep -e "^Mem"
MemTotal:        8142012 kB
MemFree:         7204048 kB
MemAvailable:    7672148 kB

When available memory starts to increase while free memory drastically decreases, a lot of memory is being consumed by the system and its applications, a percentage of which can be reclaimed from temporary caches when the operating system is under memory pressure. If the system begins to reclaim memory, it will affect overall system performance. You will also observe a kswapd or swapper message in the dmesg or syslog output, indicating that the kernel is hard at work freeing up reclaimable memory. If the system is in a situation in which both free and available memory decreases, it means the system has less memory to reclaim, so when an application asks for more memory, it will fault on the page allocation, likely forcing the application to exit early. This condition is typically accompanied by an entry in dmesg or syslog notifying the system administrator or user that a page allocation has occurred.

If you find yourself in a situation in which the system is struggling to find memory resources to serve applications and I/O requests, you might have to figure out the greatest offender(s) and attempt to find a resolution to the problem. In a worst case scenario, an application may contain a memory leak and not properly free unused memory, consuming more system memory that cannot be reclaimed. This scenario will need to be addressed in the application’s code. A simple way to observe the greatest consumers of your memory resources is with the ps utility (Listing 8).

Listing 8: ps Output

$ ps aux | head -1; ps aux | sort -rnk 4 | head -9
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        5088  5.0  5.2 815852 429928 pts/2   Sl+  15:35   0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test
root        5097  2.2  0.4 783092 38248 ?        Ds   15:35   0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test
root        5096  2.1  0.4 783088 38168 ?        Ds   15:35   0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test
root        5095  2.0  0.4 783084 38204 ?        Ds   15:35   0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test
root        5094  2.1  0.4 783080 38236 ?        Ds   15:35   0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test
root        1421  0.2  0.3 1295556 29472 ?       Ssl  14:51   0:05 /usr/lib/snapd/snapd
root         990  0.0  0.2 107888 16912 ?        Ssl  14:50   0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
root         844  0.0  0.2 280452 18256 ?        SLsl 14:50   0:00 /sbin/multipathd -d -s
systemd+     912  0.0  0.1  24092 10612 ?        Ss   14:50   0:00 /lib/systemd/systemd-resolved

The fourth column (%MEM) is the column on which you should focus. In this example, the fio utility is consuming 5.2% of the system memory, and as soon as it exits, it will free that memory back into the larger pool of available memory for future use.

Other things worth considering are tuning the kernel’s virtual memory subsystem with the sysctl utility. A guide of what parameters can be tuned are found in the Documentation/admin-guide/sysctl/vm.txt file of the Linux kernel source. Tunables include filesystem buffering and caching thresholds, memory swapping, and others.

Is Other Traffic Hitting the Drive?

Now is the time to dive a little deeper by looking at the traffic hitting your drive. Are you seeing unexpected I/O requests? The blktrace and blkparse applications are appropriately built to dive into this space. Start by creating a small shell script to generate “outside” traffic. The script will rely on the sg_inq binary provided by the sg3_utils package mentioned earlier:

#!/bin/sh
 
while [ 1 ]; do
        sg_inq /dev/sdf
        sleep 2
done

To execute the script, enter:

$ sh send_sg_inq.sh

If you were to use blktrace to capture the traffic hitting the drive, you would see sg_inq or a SCSI INQUIRY request being sent to the drive. This operation is considered a read (outside of the general data path) because it is requesting information back from the drive. Therefore, the seventh column of the lines following those ending in sg_inq are labeled with an R.

Listing 9: blktrace Output

$ sudo blktrace -d /dev/sdf -o -|blkparse -i -
  8,80   1        1     0.000000000  4693  D   R 36 [sg_inq]
  8,80   1        2     0.000590071     0  C   R [0]
  8,80   1        3     0.000624346  4693  D   R 144 [sg_inq]
  8,80   1        4     0.001093812     0  C   R [0]
  8,80   1        5     0.001121290  4693  D   R 252 [sg_inq]
  8,80   1        6     0.001577685     0  C   R [0]
  8,80   1        7     0.001593793  4693  D   R 252 [sg_inq]
  8,80   1        8     0.002063534     0  C   R [0]
  8,80   0        1     2.004478167  4695  D   R 36 [sg_inq]
  8,80   0        2     2.005054640     0  C   R [0]
  8,80   0        3     2.005078936  4695  D   R 144 [sg_inq]

What will this look like when you send more I/O? You can use the dd utility to send 1MB of all-zero data to the drive:

$ sudo dd if=/dev/zero of=/dev/sdf1 bs=1M

The blktrace utility will dump the captured output (Listing 10); the seventh column is saying that a write (W) was sent, and the eighth column shows the size of the transfer (1342768 + 8). In one line, you see that a kernel worker thread is tasked with sending the write, and in the following line you see the logical block addressing (LBA) location relative to that drive or drive partition’s starting point (e.g., 1326392)

Listing 10: blktrace Output

  8,80   1     4398     0.027331017   218  M   W 1342768 + 8 [kworker/u16:3]
  8,80   1     4399     0.027332069   218  A   W 1342776 + 8 <- (8,81) 1326392
  8,80   1     4400     0.027332354   218  Q   W 1342776 + 8 [kworker/u16:3]
  8,80   1     4401     0.027332675   218  M   W 1342776 + 8 [kworker/u16:3]
  8,80   1     4402     0.027333775   218  A   W 1342784 + 8 <- (8,81) 1326400
  8,80   1     4403     0.027334066   218  Q   W 1342784 + 8 [kworker/u16:3]
  8,80   1     4404     0.027334387   218  M   W 1342784 + 8 [kworker/u16:3]
  8,80   1     4405     0.027335587   218  A   W 1342792 + 8 <- (8,81) 1326408
  8,80   1     4406     0.027335872   218  Q   W 1342792 + 8 [kworker/u16:3]
  8,80   1     4407     0.027336186   218  M   W 1342792 + 8 [kworker/u16:3]
  8,80   1     4408     0.027337256   218  A   W 1342800 + 8 <- (8,81) 1326416
  8,80   1     4409     0.027337542   218  Q   W 1342800 + 8 [kworker/u16:3]
  8,80   1     4410     0.027337900   218  M   W 1342800 + 8 [kworker/u16:3]
  8,80   1     4411     0.027339749   218  A   W 1342808 + 8 <- (8,81) 1326424

This output can and will get large and ugly, and it might take some time to parse through the content manually. Be sure not to let the utility capture for too long.

Similar to blktrace, you are able to view traffic hitting the SAS drives with the SCSI logging level function of the Linux kernel. The simplest way to observe and manage the kernel’s logging level settings is with the scsi_logging_level tool. For instance, to list the current logging level setting, invoke the command with the -g option:

$ sudo scsi_logging_level -g
Current scsi logging level:
/proc/sys/dev/scsi/logging_level = 0
SCSI_LOG_ERROR=0
SCSI_LOG_TIMEOUT=0
SCSI_LOG_SCAN=0
SCSI_LOG_MLQUEUE=0
SCSI_LOG_MLCOMPLETE=0
SCSI_LOG_LLQUEUE=0
SCSI_LOG_LLCOMPLETE=0
SCSI_LOG_HLQUEUE=0
SCSI_LOG_HLCOMPLETE=0
SCSI_LOG_IOCTL=0

The Linux SCSI subsystem is separated into three main levels: the upper layer comprises entry point drivers that allow you to connect with the underlying device (e.g., sd, sg, sr), and the mid layer connects the upper level to the interface driver of the lower level (i.e., Fibre Channel, SAS, iSCSI, etc.). Aside from connecting both upper and lower levels, the mid level is responsible for failure handling and error correction of failed requests propagated up from the underlying device or the host bus adaptor (HBA) in the low level connecting to that underlying device (Figure 1).

Figure 1: The general outline of the Linux SCSI subsystem.

To enable the highest level of logging for all SCSI errors and traffic in the low and mid layers, type:

$ sudo scsi_logging_level -s -E 7 -L 7 -M 7
New scsi logging level:
/proc/sys/dev/scsi/logging_level = 2096647
SCSI_LOG_ERROR=7
SCSI_LOG_TIMEOUT=0
SCSI_LOG_SCAN=0
SCSI_LOG_MLQUEUE=7
SCSI_LOG_MLCOMPLETE=7
SCSI_LOG_LLQUEUE=7
SCSI_LOG_LLCOMPLETE=7
SCSI_LOG_HLQUEUE=0
SCSI_LOG_HLCOMPLETE=0
SCSI_LOG_IOCTL=0

When logging is enabled, it will print SCSI level traces in both dmesg and syslog or the kernel log. For instance, in the example shown in Listing 11, you can the see SCSI Command Descriptor Blocks (CDBs) and traces printed for write, cache synchronization, and inquiry commands.

Listing 11: Kernel Log

Mar 27 15:21:24 dev-machine kernel: [  387.117129] sd 0:0:0:0: [sda] tag#3 Send: scmd 0x000000001842ae5b
Mar 27 15:21:24 dev-machine kernel: [  387.117134] sd 0:0:0:0: [sda] tag#3 CDB: Write(10) 2a 00 1d 04 5d 88 00 00 20 00
Mar 27 15:21:24 dev-machine kernel: [  387.117468] sd 0:0:0:0: [sda] tag#3 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Mar 27 15:21:24 dev-machine kernel: [  387.117485] sd 0:0:0:0: [sda] tag#3 CDB: Write(10) 2a 00 1d 04 5d 88 00 00 20 00
Mar 27 15:21:24 dev-machine kernel: [  387.117491] sd 0:0:0:0: [sda] tag#3 scsi host busy 1 failed 0
Mar 27 15:21:24 dev-machine kernel: [  387.117495] sd 0:0:0:0: Notifying upper driver of completion (result 0)
Mar 27 15:21:24 dev-machine kernel: [  387.117585] sd 0:0:0:0: [sda] tag#24 Send: scmd 0x00000000fa2f2e1a
Mar 27 15:21:24 dev-machine kernel: [  387.117590] sd 0:0:0:0: [sda] tag#24 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Mar 27 15:21:24 dev-machine kernel: [  387.130198] sd 0:0:0:0: [sda] tag#24 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Mar 27 15:21:24 dev-machine kernel: [  387.130202] sd 0:0:0:0: [sda] tag#24 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Mar 27 15:21:24 dev-machine kernel: [  387.130205] sd 0:0:0:0: [sda] tag#24 scsi host busy 1 failed 0
Mar 27 15:21:24 dev-machine kernel: [  387.130207] sd 0:0:0:0: Notifying upper driver of completion (result 0)
 
[ … ]
 
Mar 27 15:21:25 dev-machine kernel: [  387.655119] sd 2:0:1:1: [sdf] tag#3669 Send: scmd 0x00000000b06d61c4
Mar 27 15:21:25 dev-machine kernel: [  387.655122] sd 2:0:1:1: [sdf] tag#3669 CDB: Inquiry 12 00 00 00 90 00
Mar 27 15:21:25 dev-machine kernel: [  387.655587] sd 2:0:1:1: [sdf] tag#3669 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Mar 27 15:21:25 dev-machine kernel: [  387.655589] sd 2:0:1:1: [sdf] tag#3669 CDB: Inquiry 12 00 00 00 90 00
Mar 27 15:21:25 dev-machine kernel: [  387.655591] sd 2:0:1:1: [sdf] tag#3669 scsi host busy 1 failed 0
Mar 27 15:21:25 dev-machine kernel: [  387.655593] sd 2:0:1:1: Notifying upper driver of completion (result 0)

Much like blktrace, this log can get pretty large and pretty ugly and should be enabled for brief periods at a time to avoid overwhelming the system. Another thing worth noting is that, with capture tools such as these, latencies are introduced and may alter the original I/O profile being chased, so proceed with caution.

To disable all logging, type:

$ sudo scsi_logging_level -s -E 0 -L 0 -M 0
New scsi logging level:
/proc/sys/dev/scsi/logging_level = 7
SCSI_LOG_ERROR=7
SCSI_LOG_TIMEOUT=0
SCSI_LOG_SCAN=0
SCSI_LOG_MLQUEUE=0
SCSI_LOG_MLCOMPLETE=0
SCSI_LOG_LLQUEUE=0
SCSI_LOG_LLCOMPLETE=0
SCSI_LOG_HLQUEUE=0
SCSI_LOG_HLCOMPLETE=0
SCSI_LOG_IOCTL=0

Perf Tracing

Another tool captures system traces, but at the Linux kernel level and in the context of CPU-executed operations. That tool is called perf and can be obtained from the Linux Kernel Archives. Some Linux distributions will offer it in their downloadable package repositories, but it might be best to download the latest and greatest from the website (Listing 12).

Listing 12: Get perf from kernel.org

$ wget https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/v5.9.0/perf-5.9.0.tar.xz
--2021-03-21 15:40:30--  https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/v5.9.0/perf-5.9.0.tar.xz
Resolving mirrors.edge.kernel.org (mirrors.edge.kernel.org)... 2604:1380:1:3600::1, 147.75.197.195
Connecting to mirrors.edge.kernel.org (mirrors.edge.kernel.org)|2604:1380:1:3600::1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1857268 (1.8M) [application/x-xz]
Saving to: ‘perf-5.9.0.tar.xz’
 
perf-5.9.0.tar.xz                            100%[=============================================================================================>]   1.77M  8.85MB/s    in 0.2s
 
2021-03-21 15:40:30 (8.85 MB/s) - ‘perf-5.9.0.tar.xz’ saved [1857268/1857268]

Extract the file into a local working directory

$ tar xJf perf-5.9.0.tar.xz

and change into the perf source code subdirectory:

$ cd perf-5.9.0/tools/perf/

Resolve any and all dependencies and run make and make install utilities.

Now that perf is installed, you can generate I/O to a disk drive with fio,

$ sudo fio --bs=1K --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=1 --name=test

and in another terminal window, invoke perf to record kernel operations executed at the CPU level for 30 seconds:

$ sudo ./perf record -a -g -- sleep 30
[ perf record: Woken up 8 times to write data ]
[ perf record: Captured and wrote 2.476 MB perf.data (12373 samples) ]

A perf.data file is generated in the local working directory. You can use tools to open this file directly for analysis or to generate other output from that file. For instance, to generate a report summarizing the 30 seconds, enter:

$ sudo ./perf script > out.script
$ sudo ./perf report > out.report

You will need both files just generated:

$ ls out.*
out.report  out.script

If you open the report file (i.e., out.report), you will see a summary of all the kernel operations and the amount of CPU it utilized during that 30-second snapshot, and you will see the subroutines tied to the top-level functions (Listing 13).

Listing 13: Summary of Kernel Operations

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 12K of event 'cycles'
# Event count (approx.): 1354753461
#
# Children      Self  Command          Shared Object             Symbol
# ........  ........  ...............  ........................  ................................................
#
    57.94%     0.00%  swapper          [kernel.kallsyms]         [k] secondary_startup_64
            |
            ---secondary_startup_64
               |          
               |--48.28%--start_secondary
               |          |          
               |           --48.25%--cpu_startup_entry
               |                     |          
               |                      --47.68%--do_idle
               |                                |
               |                                |--36.89%--call_cpuidle
               |                                |          |
               |                                |           --36.81%--cpuidle_enter
               |                                |                     |          
               |                                |                      --35.93%--cpuidle_enter_state
               |                                |                                |          
               |                                |                                |--16.95%--intel_idle
               |                                |                                |          
               |                                |                                |--10.25%--ret_from_intr
               |                                |                                |          |          
               |                                |                                |           --10.21%--do_IRQ
               |                                |                                |                     |          
               |                                |                                |                     |--7.62%--irq_exit
               |                                |                                |                     |          |          
               |                                |                                |                     |           --7.47%--__softirqentry_text_start
               |                                |                                |                     |                     | 
               |                                |                                |                     |                      --7.17%--blk_done_softirq
[ … ]
 
     0.51%     0.04%  fio              [kernel.kallsyms]         [k] mpage_alloc.isra.0
     0.50%     0.00%  fio              libc-2.31.so              [.] 0x00007f0576ab611a
            |
            ---0x7f0576ab611a
 
     0.50%     0.24%  swapper          [kernel.kallsyms]         [k] sched_clock_cpu
     0.49%     0.00%  perf             perf                      [.] 0x00005570000237de
     0.48%     0.18%  swapper          [kernel.kallsyms]         [k] ttwu_do_wakeup
     0.46%     0.19%  swapper          [kernel.kallsyms]         [k] set_next_entity
     0.46%     0.00%  rcu_sched        [kernel.kallsyms]         [k] ret_from_fork
     0.46%     0.00%  rcu_sched        [kernel.kallsyms]         [k] kthread
     0.46%     0.10%  swapper          [kernel.kallsyms]         [k] load_balance
 
[ … ]

The perf utility is useful when you might need to see where the CPU spends most of its time and provides insight into what could be holding up the completion of a function or shed light on poorly developed functions in the kernel space.

Sometimes it can be difficult to parse this data, which is why you can convert it into a visual chart called a f lame g raph. To do this, clone the git repo and convert the original output script file into an SVG (Listing 14).

Listing 14: Generating a Flame Graph

$ git clone https://github.com/brendangregg/FlameGraph.git
Cloning into 'FlameGraph'...
remote: Enumerating objects: 1067, done.
remote: Total 1067 (delta 0), reused 0 (delta 0), pack-reused 1067
Receiving objects: 100% (1067/1067), 1.87 MiB | 6.67 MiB/s, done.
Resolving deltas: 100% (612/612), done.
 
$ ./FlameGraph/stackcollapse-perf.pl out.script > out.folded
$ ./FlameGraph/flamegraph.pl out.folded > out.svg

When you open the file, you will see a visual representation of the 30-second capture (Figure 2). You can graphically navigate this graph with the click of your mouse. As you begin to click into the kernel functions, you will drill deeper and deeper into each function and the subfunctions it might be waiting on (Figures 3 and 4). You will also observe the total CPU time of the functions at the bottom of the SVG page.

Figure 2: Flame graph: a visual representation of the perf data.

Figure 3: Clicking a kernel function in a flame graph focuses on the function and any linked subfunctions.

Figure 4: Drilling deeper into the flame graph and subfunctions.

Other Things to Consider

You will want to know whether you are looking at a solid state drive (SSD). If so, performance can drop drastically once the SSD reaches its write cliff, or the point at which the drive exhausts its first writes to each NAND memory cell. Once you reach the drive’s write cliff, you are now exercising a programmable erase (PE) cycle for each write operation. This multistep process involves erasing an entire cell before writing to it. Each NAND cell has a finite number of PE cycles. The PE cycles to a cell will vary on the type of NAND technology employed. Vendors resort to something called over-provisioning to prolong or delay this write cliff.

Often you will get double the NAND memory when purchasing a drive. For instance, on a 3TB PCIe SSD, you might be paying for 6TB of storage, of which you are only able to use 3TB. There are two methods by which someone can restore an SSD to its former performing glory. One is by employing a technique called TRIM (called Discard on SAS technologies). In this background process the software tells the drive to prep unused data blocks for a future write operation. Tools such as fstrim will enable this function, but many filesystems (e.g., ext4) and volume managers (e.g., logical volume management, LVM) offer discard support.

Conclusion

When something does not seem quite right, you have many ways to peek into a system – a few of these mechanisms were explored here. This guide does not cover everything and just provides a starting point, but it is enough of a starting point to get you moving in the right direction of isolating and diagnosing the root cause of your performance troubles.

The Author

Petros Koutoupis is currently a senior performance software engineer at Cray (now HPE) for its Lustre High Performance File System division. He is also the creator and maintainer of the RapidDisk Project. Koutoupis has worked in the data storage industry for well over a decade and has helped pioneer the many technologies unleashed in the wild today.

Tags: hardware , HPC , Linux , Performance