When I/O Workloads Don’t Perform
Every now and then, you find yourself in a situation where you expect better performance from your data storage drives. Either they once performed very well and one day just stopped, or they came straight out of the box underperforming. I explore a few of the reasons why this might happen.
Sometimes, the easiest and quickest way to determine the root cause of a slow drive is to check its local logging data. The method by which this log data is stored will differ by the drive type, but in the end, the results are generally the same. For instance, a SCSI-based drive such as a Serial Attached SCSI (SAS) drive collects drive log data and general metrics in something called the SCSI log pages (plural because each page separates the collected data into its respective category). The easiest way to access this data is by using the sg3_utils package available for Linux. To find out what categories the drive supports, execute the sg_logs binary with the SAS drive or SCSI generic identifier in which you are interested (Listing 1).
Listing 1: sg_logs
$ sudo sg_logs /dev/sdc SEAGATE ST14000NM0001 K001 Supported log pages [0x0]: 0x00 Supported log pages [sp] 0x02 Write error [we] 0x03 Read error [re] 0x05 Verify error [ve] 0x06 Non medium [nm] 0x08 Format status [fs] 0x0d Temperature [temp] 0x0e Start-stop cycle counter [sscc] 0x0f Application client [ac] 0x10 Self test results [str] 0x15 Background scan results [bsr] 0x18 Protocol specific port [psp] 0x1a Power condition transitions [pct] 0x2f Informational exceptions [ie] 0x37 Cache (seagate) [c_se] 0x38 0x3e Factory (seagate) [f_se]
As you can see, you can observe data for write, read, and drive temperature errors, and more. To specify a specific page, you will need to use the -p parameter followed by the page number. For instance, look at the log page for write errors (i.e., 0x2; Listing 2).
Listing 2: Log Page for Write Errors
$ sudo sg_logs -p 0x2 /dev/sdc SEAGATE ST14000NM0001 K001 Write error counter page [0x2] Errors corrected with possible delays = 0 Total rewrites or rereads = 0 Total errors corrected = 0 Total times correction algorithm processed = 0 Total bytes processed = 3951500537856 Total uncorrected errors = 0
Seemingly, this drive does not have any write errors (corrected and uncorrected by the drive firmware), so it looks to be in good shape. Typically, if you see errors, especially of the uncorrected type, the printout will include failing logical block addresses (LBAs). If the failed LBA regions (i.e., sectors) were listed under the read error category, it would likely be in a pending reallocation state (waiting for a future write to the same address). A sector pending reallocation is a sector that is unable to be read from or written to and must be reallocated elsewhere on the disk drive. This reallocation will only happen on the next write operation to that failed sector, if the drive has spare sectors it can use to relocate the data. A failing sector or a sector pending reallocation by the drive’s firmware will affect overall drive performance, and if enough of it occurs, it would be highly recommended to replace the drive as soon as possible.
Another thing that needs to be understood is that if a log page starts to list a significant count of corrected read or write errors, chances are that the disk drive’s surrounding environment may be at fault. For instance, vibration will often cause a disk drive’s head to misread or miswrite a length of sectors on a drive track, which results in the firmware taking action to correct it. This process alone will introduce unwanted I/O latencies (reducing performance to the drive).
If you’d like to list all of the log pages at once, use the -a parameter (Listing 3). (Warning: You will get a lot of information.)
Listing 3: List All Log Pages
$ sudo sg_logs -a /dev/sdc SEAGATE ST14000NM0001 K001 Supported log pages [0x0]: 0x00 Supported log pages [sp] 0x02 Write error [we] 0x03 Read error [re] 0x05 Verify error [ve] 0x06 Non medium [nm] 0x08 Format status [fs] 0x0d Temperature [temp] 0x0e Start-stop cycle counter [sscc] 0x0f Application client [ac] 0x10 Self test results [str] 0x15 Background scan results [bsr] 0x18 Protocol specific port [psp] 0x1a Power condition transitions [pct] 0x2f Informational exceptions [ie] 0x37 Cache (seagate) [c_se] 0x38 0x3e Factory (seagate) [f_se] Write error counter page [0x2] Errors corrected with possible delays = 0 Total rewrites or rereads = 0 Total errors corrected = 0 Total times correction algorithm processed = 0 Total bytes processed = 3951500537856 Total uncorrected errors = 0 Reserved or vendor specific [0xf800] = 0 Reserved or vendor specific [0xf801] = 0 Reserved or vendor specific [0xf802] = 0 Reserved or vendor specific [0xf803] = 0 Reserved or vendor specific [0xf804] = 0 Reserved or vendor specific [0xf805] = 0 Reserved or vendor specific [0xf806] = 0 Reserved or vendor specific [0xf807] = 0 Reserved or vendor specific [0xf810] = 0 Reserved or vendor specific [0xf811] = 0 Reserved or vendor specific [0xf812] = 0 Reserved or vendor specific [0xf813] = 0 Reserved or vendor specific [0xf814] = 0 Reserved or vendor specific [0xf815] = 0 Reserved or vendor specific [0xf816] = 0 Reserved or vendor specific [0xf817] = 0 Reserved or vendor specific [0xf820] = 0 Read error counter page [0x3] Errors corrected without substantial delay = 0 Errors corrected with possible delays = 0 Total rewrites or rereads = 0 Total errors corrected = 0 Total times correction algorithm processed = 0 Total bytes processed = 35801804845056 Total uncorrected errors = 0 Verify error counter page [0x5] Errors corrected without substantial delay = 0 Errors corrected with possible delays = 0 Total rewrites or rereads = 0 Total errors corrected = 0 Total times correction algorithm processed = 0 Total bytes processed = 0 Total uncorrected errors = 0 Non-medium error page [0x6] Non-medium error count = 0 Format status page [0x8] Format data out: <not available> Grown defects during certification <not available> Total blocks reassigned during format <not available> Total new blocks reassigned <not available> Power on minutes since format <not available> Temperature page [0xd] Current temperature = 28 C Reference temperature = 60 C Start-stop cycle counter page [0xe] Date of manufacture, year: 2019, week: 26 Accounting date, year: , week: Specified cycle count over device lifetime = 50000 Accumulated start-stop cycles = 498 Specified load-unload count over device lifetime = 600000 Accumulated load-unload cycles = 553 [ … ]
Other tools exist to extract similar and sometimes the same amount of data from a SAS drive (e.g., smartctl). If a drive supports the industry standard Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.), you can use the smartmontools package and, again, more specifically, the smartctl binary (Listing 4).
Listing 4: smartctl on SAS Drive
$ sudo smartctl -a /dev/sdc smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-66-generic] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: SEAGATE Product: ST14000NM0001 Revision: K001 Compliance: SPC-5 User Capacity: 7,000,259,821,568 bytes [7.00 TB] Logical block size: 4096 bytes LU is fully provisioned Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x6000c500a7b3ceeb0000000000000000 Serial number: ZKL00CYG0000G925020A Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Sun Mar 21 15:00:00 2021 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Disabled or Not Supported === START OF READ SMART DATA SECTION === SMART Health Status: OK Grown defects during certification <not available> Total blocks reassigned during format <not available> Total new blocks reassigned <not available> Power on minutes since format <not available> Current Drive Temperature: 28 C Drive Trip Temperature: 60 C Manufactured in week 26 of year 2019 Specified cycle count over device lifetime: 50000 Accumulated start-stop cycles: 498 Specified load-unload count over device lifetime: 600000 Accumulated load-unload cycles: 553 Elements in grown defect list: 0 Vendor (Seagate Cache) information Blocks sent to initiator = 150743545 Blocks received from initiator = 964465354 Blocks read from cache and sent to initiator = 1014080851 Number of read and write commands whose size <= segment size = 8318611 Number of read and write commands whose size > segment size = 12 Vendor (Seagate/Hitachi) factory information number of hours powered up = 264.67 number of minutes until next internal SMART test = 48 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 0 0 0 0 35801.805 0 write: 0 0 0 0 0 3951.501 0 Non-medium error count: 0 No Self-tests have been logged
The smartmontools package is most beneficial for Serial ATA (SATA) drives, because most SATA drives tend to support the feature out of the box. Note that the S.M.A.R.T. output, such as the type of data and the way it is formatted, will differ on SATA drives from its SAS counterpart (Listing 5).
Listing 5: smartctl on SATA Drive
$ sudo smartctl -a /dev/sda smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-66-generic] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Constellation ES (SATA 6Gb/s) Device Model: ST500NM0011 Serial Number: Z1M11WAJ LU WWN Device Id: 5 000c50 04edcb79a Add. Product Id: DELL(tm) Firmware Version: PA08 User Capacity: 500,107,862,016 bytes [500 GB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s) Local Time is: Sun Mar 21 15:00:38 2021 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 609) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 75) minutes. Conveyance self-test routine recommended polling time: ( 3) minutes. SCT capabilities: (0x10bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 077 063 --- Pre-fail Always - 56770409 3 Spin_Up_Time 0x0003 096 092 --- Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 --- Old_age Always - 137 5 Reallocated_Sector_Ct 0x0033 100 100 --- Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 067 060 --- Pre-fail Always - 5578572 9 Power_On_Hours 0x0032 100 100 --- Old_age Always - 400 10 Spin_Retry_Count 0x0013 100 099 --- Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 --- Old_age Always - 135 184 End-to-End_Error 0x0032 100 100 --- Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 --- Old_age Always - 0 188 Command_Timeout 0x0032 100 100 --- Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 --- Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 072 058 --- Old_age Always - 28 (Min/Max 22/28) 191 G-Sense_Error_Rate 0x0032 100 100 --- Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 --- Old_age Always - 35 193 Load_Cycle_Count 0x0032 100 100 --- Old_age Always - 487 194 Temperature_Celsius 0x0022 028 042 --- Old_age Always - 28 (0 18 0 0 0) 195 Hardware_ECC_Recovered 0x001a 113 099 --- Old_age Always - 56770409 197 Current_Pending_Sector 0x0012 100 100 --- Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 --- Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 --- Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 --- Old_age Offline - 344 (218 109 0) 241 Total_LBAs_Written 0x0000 100 253 --- Old_age Offline - 1389095282 242 Total_LBAs_Read 0x0000 100 253 --- Old_age Offline - 619165492 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 2 - # 2 Extended offline Completed without error 00% 2 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
For the most part, the information is generally the same. For instance, when you look at the drive attributes category, attribute 197 or Current_Pending_Sector is the same sector pending reallocation discussed earlier. Again, you can gather drive temperature information, lifetime hours, and more.
How About CPU and Drive Utilization?
Now you have checked all your drives, but for some reason, they are still not performing as expected. The next step should be to determine whether drive utilization is too high or the CPU is too busy and is having a difficult time keeping up with I/O requests. The sysstat package provides a nice little utility called iostat that gathers both sets of data. In the example in Listing 6, iostat is showing an extended set of metrics and the CPU utilization at two-second intervals.
Listing 6: iostat Output
$ iostat -x -d 2 -c Linux 5.4.12-050412-generic (dev-machine) 03/14/2021 _x86_64_ (4 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.79 0.07 1.19 2.89 0.00 95.06 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 10.91 6.97 768.20 584.64 4.87 18.20 30.85 72.31 13.16 20.40 0.26 70.44 83.89 1.97 3.52 nvme0n1 58.80 12.22 17720.47 48.71 230.91 0.01 79.70 0.08 0.42 0.03 0.00 301.34 3.98 1.02 7.24 sdb 0.31 55.97 4.13 17676.32 0.00 231.64 0.00 80.54 2.50 8.47 0.32 13.45 315.84 1.30 7.32 sdc 0.24 0.00 3.76 0.00 0.00 0.00 0.00 0.00 2.47 0.00 0.00 15.64 0.00 1.03 0.02 sde 2.47 0.00 62.57 0.00 0.00 0.00 0.00 0.00 0.63 0.00 0.00 25.34 0.00 0.29 0.07 sdf 1.51 0.00 32.42 0.00 0.00 0.00 0.00 0.00 0.69 0.00 0.00 21.40 0.00 0.31 0.05 sdd 1.42 0.00 50.96 0.00 0.00 0.00 0.00 0.00 0.44 0.00 0.00 35.83 0.00 0.38 0.05 md0 12.43 12.17 54.39 48.68 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.37 4.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.76 0.00 3.03 1.26 0.00 94.95 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 0.00 9.00 0.00 88.00 0.00 8.00 0.00 47.06 0.00 30.83 0.26 0.00 9.78 0.67 0.60 nvme0n1 2769.50 2682.00 29592.00 10723.25 241.00 0.00 8.01 0.00 0.11 0.02 0.01 10.68 4.00 0.14 77.60 sdb 0.00 2731.00 0.00 27814.00 0.00 241.00 0.00 8.11 0.00 12.20 30.13 0.00 10.18 0.29 79.40 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 2717.50 2679.00 10870.00 10716.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 4.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.51 0.00 2.42 0.00 0.00 97.07 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme0n1 2739.00 2747.50 27336.00 10988.50 210.00 0.00 7.12 0.00 0.12 0.02 0.00 9.98 4.00 0.14 77.20 sdb 0.00 2797.50 0.00 28270.00 0.00 210.00 0.00 6.98 0.00 11.75 29.38 0.00 10.11 0.28 78.80 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 2688.00 2746.50 10752.00 10986.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 4.00 0.00 0.00
The first interval should probably be ignored because iostat has no real data with which to compare (i.e., no initial state), so the numbers look a bit off. Once the utility stabilizes by the second interval, you will see a more accurate picture of your disk drive and what it reports with reads per second (r/s), writes per second (w/s), average I/O waiting to complete in reads (r_await) and writes (w_await), a calculation of how much of the drive is in use (%util), and more. The higher the %util number, the busier the drive is likely to be completing I/O requests. If that number is high, you might need to consider methods either to throttle the amount of I/O sent to the drive or find ways to balance the same I/O across multiple drives (e.g., in a RAID0, 5, or 6 configuration).
Also, notice the average CPU metrics at the top of each interval. Here, you will find a breakdown of how much of the collected CPU is busy performing tasks, waiting on completion of tasks (%iowait), idling, and so on. The less idle in the system, the more affected your drive performance.
You can view a real-time breakdown of these CPU cores with the top utility. After opening the top application at the command line, press the 1 key (Listing 7).
Listing 7: top Output
top - 19:44:01 up 15 min, 3 users, load average: 1.08, 0.68, 0.42 Tasks: 145 total, 1 running, 144 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.7 us, 1.4 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st %Cpu1 : 0.3 us, 3.1 sy, 0.0 ni, 95.9 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st %Cpu2 : 0.3 us, 1.7 sy, 0.0 ni, 97.6 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu3 : 0.0 us, 1.3 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 7951.2 total, 6269.5 free, 210.9 used, 1470.8 buff/cache MiB Swap: 3934.0 total, 3934.0 free, 0.0 used. 7079.6 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3294 root 20 0 748016 4760 988 S 13.3 0.1 0:01.19 fio 3155 root 20 0 0 0 0 D 2.3 0.0 0:14.55 md0_resync 18 root 20 0 0 0 0 S 0.3 0.0 0:00.05 ksoftirqd/1 3152 root 20 0 0 0 0 S 0.3 0.0 0:05.38 md0_raid1 3284 petros 20 0 9496 4080 3356 R 0.3 0.1 0:00.04 top 3286 root 20 0 813548 428684 424916 S 0.3 5.3 0:00.37 fio
Enough Free Memory?
If the CPU is not the problem and the drives are being underutilized, do you have a constraint on memory resources? The easiest and quickest way to check is with free:
$ free -m total used free shared buff/cache available Mem: 7951 201 7037 1 712 7493 Swap: 4095 0 4095
The free utility dumps the amount of total, used, and free memory on the system, but it will also show how much of it is used as a buffer or temporary cache (buff/cache) and how much of it is available and reclaimable (available). The output is in megabytes – the -g argument reports output in gigabytes and -k in kilobytes – and will match the data found in /proc/meminfo. Note that the output only looks different from the free output because it is calculated in kilobytes instead of megabytes:
$ cat /proc/meminfo | grep -e "^Mem" MemTotal: 8142012 kB MemFree: 7204048 kB MemAvailable: 7672148 kB
When available memory starts to increase while free memory drastically decreases, a lot of memory is being consumed by the system and its applications, a percentage of which can be reclaimed from temporary caches when the operating system is under memory pressure. If the system begins to reclaim memory, it will affect overall system performance. You will also observe a kswapd or swapper message in the dmesg or syslog output, indicating that the kernel is hard at work freeing up reclaimable memory. If the system is in a situation in which both free and available memory decreases, it means the system has less memory to reclaim, so when an application asks for more memory, it will fault on the page allocation, likely forcing the application to exit early. This condition is typically accompanied by an entry in dmesg or syslog notifying the system administrator or user that a page allocation has occurred.
If you find yourself in a situation in which the system is struggling to find memory resources to serve applications and I/O requests, you might have to figure out the greatest offender(s) and attempt to find a resolution to the problem. In a worst case scenario, an application may contain a memory leak and not properly free unused memory, consuming more system memory that cannot be reclaimed. This scenario will need to be addressed in the application’s code. A simple way to observe the greatest consumers of your memory resources is with the ps utility (Listing 8).
Listing 8: ps Output
$ ps aux | head -1; ps aux | sort -rnk 4 | head -9 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 5088 5.0 5.2 815852 429928 pts/2 Sl+ 15:35 0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test root 5097 2.2 0.4 783092 38248 ? Ds 15:35 0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test root 5096 2.1 0.4 783088 38168 ? Ds 15:35 0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test root 5095 2.0 0.4 783084 38204 ? Ds 15:35 0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test root 5094 2.1 0.4 783080 38236 ? Ds 15:35 0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test root 1421 0.2 0.3 1295556 29472 ? Ssl 14:51 0:05 /usr/lib/snapd/snapd root 990 0.0 0.2 107888 16912 ? Ssl 14:50 0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal root 844 0.0 0.2 280452 18256 ? SLsl 14:50 0:00 /sbin/multipathd -d -s systemd+ 912 0.0 0.1 24092 10612 ? Ss 14:50 0:00 /lib/systemd/systemd-resolved
The fourth column (%MEM) is the column on which you should focus. In this example, the fio utility is consuming 5.2% of the system memory, and as soon as it exits, it will free that memory back into the larger pool of available memory for future use.
Other things worth considering are tuning the kernel’s virtual memory subsystem with the sysctl utility. A guide of what parameters can be tuned are found in the Documentation/admin-guide/sysctl/vm.txt file of the Linux kernel source. Tunables include filesystem buffering and caching thresholds, memory swapping, and others.