When I/O workloads don't perform

A Peak Under the Hood

Perf Tracing

Another tool captures system traces, but at the Linux kernel level and in the context of CPU-executed operations. That tool is called perf and can be obtained from the Linux Kernel Archives [1]. Some Linux distributions will offer it in their downloadable package repositories, but it might be best to download the latest and greatest from the website (Listing 13).

Listing 13

Get perf from kernel.org

$ wget https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/v5.9.0/perf-5.9.0.tar.xz
--2021-03-21 15:40:30--  https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/v5.9.0/perf-5.9.0.tar.xz
Resolving mirrors.edge.kernel.org (mirrors.edge.kernel.org)... 2604:1380:1:3600::1, 147.75.197.195
Connecting to mirrors.edge.kernel.org (mirrors.edge.kernel.org)|2604:1380:1:3600::1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1857268 (1.8M) [application/x-xz]
Saving to: 'perf-5.9.0.tar.xz'
perf-5.9.0.tar.xz                            100%[=============================================================================================>]   1.77M  8.85MB/s    in 0.2s
2021-03-21 15:40:30 (8.85 MB/s) - 'perf-5.9.0.tar.xz' saved [1857268/1857268]

Extract the file into a local working directory

$ tar xJf perf-5.9.0.tar.xz

and change into the perf source code subdirectory:

$ cd perf-5.9.0/tools/perf/

Resolve any and all dependencies and run make and make install utilities.

Now that perf is installed, you can generate I/O to a disk drive with fio,

$ sudo fio --bs=1K --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=1 --name=test

and in another terminal window, invoke perf to record kernel operations executed at the CPU level for 30 seconds:

$ sudo ./perf record -a -g -- sleep 30
[ perf record: Woken up 8 times to write data ]
[ perf record: Captured and wrote 2.476 MB perf.data (12373 samples) ]

A perf.data file is generated in the local working directory. You can use tools to open this file directly for analysis or to generate other output from that file. For instance, to generate a report summarizing the 30 seconds, enter:

$ sudo ./perf script > out.script
$ sudo ./perf report > out.report

You will need both files just generated:

$ ls out.*
out.report  out.script

If you open the report file (i.e., out.report), you will see a summary of all the kernel operations and the amount of CPU it utilized during that 30-second snapshot, and you will see the subroutines tied to the top-level functions (Listing 14).

Listing 14

Summary of Kernel Operations

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 12K of event 'cycles'
# Event count (approx.): 1354753461
#
# Children      Self  Command          Shared Object             Symbol
# ........  ........  ...............  ........................  ................................................
#
    57.94%     0.00%  swapper          [kernel.kallsyms]         [k] secondary_startup_64
            |
            ---secondary_startup_64
               |
               |--48.28%--start_secondary
               |          |
               |           --48.25%--cpu_startup_entry
               |                     |
               |                      --47.68%--do_idle
               |                                |
               |                                |--36.89%--call_cpuidle
               |                                |          |
               |                                |           --36.81%--cpuidle_enter
               |                                |                     |
               |                                |                      --35.93%--cpuidle_enter_state
               |                                |                                |
               |                                |                                |--16.95%--intel_idle
               |                                |                                |
               |                                |                                |--10.25%--ret_from_intr
               |                                |                                |          |
               |                                |                                |           --10.21%--do_IRQ
               |                                |                                |                     |
               |                                |                                |                     |--7.62%--irq_exit
               |                                |                                |                     |          |
               |                                |                                |                     |           --7.47%--__softirqentry_text_start
               |                                |                                |                     |                     |
               |                                |                                |                     |                      --7.17%--blk_done_softirq
[ ... ]
     0.51%     0.04%  fio              [kernel.kallsyms]         [k] mpage_alloc.isra.0
     0.50%     0.00%  fio              libc-2.31.so              [.] 0x00007f0576ab611a
            |
            ---0x7f0576ab611a
     0.50%     0.24%  swapper          [kernel.kallsyms]         [k] sched_clock_cpu
     0.49%     0.00%  perf             perf                      [.] 0x00005570000237de
     0.48%     0.18%  swapper          [kernel.kallsyms]         [k] ttwu_do_wakeup
     0.46%     0.19%  swapper          [kernel.kallsyms]         [k] set_next_entity
     0.46%     0.00%  rcu_sched        [kernel.kallsyms]         [k] ret_from_fork
     0.46%     0.00%  rcu_sched        [kernel.kallsyms]         [k] kthread
     0.46%     0.10%  swapper          [kernel.kallsyms]         [k] load_balance
[ ... ]

The perf utility is useful when you might need to see where the CPU spends most of its time and provides insight into what could be holding up the completion of a function or shed light on poorly developed functions in the kernel space.

Sometimes it can be difficult to parse this data, which is why you can convert it into a visual chart called a flame graph [2]. To do this, clone the git repo and convert the original output script file into an SVG (Listing 15).

Listing 15

Generating a Flame Graph

$ git clone https://github.com/brendangregg/FlameGraph.git
Cloning into 'FlameGraph'...
remote: Enumerating objects: 1067, done.
remote: Total 1067 (delta 0), reused 0 (delta 0), pack-reused 1067
Receiving objects: 100% (1067/1067), 1.87 MiB | 6.67 MiB/s, done.
Resolving deltas: 100% (612/612), done.
$ ./FlameGraph/stackcollapse-perf.pl out.script > out.folded
$ ./FlameGraph/flamegraph.pl out.folded > out.svg

When you open the file, you will see a visual representation of the 30-second capture (Figure 2). You can graphically navigate this graph with the click of your mouse. As you begin to click into the kernel functions, you will drill deeper and deeper into each function and the subfunctions it might be waiting on (Figures 3 and 4). You will also observe the total CPU time of the functions at the bottom of the SVG page.

Figure 2: Flame graph: a visual representation of the perf data.
Figure 3: Clicking a kernel function in a flame graph focuses on the function and any linked subfunctions.
Figure 4: Drilling deeper into the flame graph and subfunctions.

Other Things to Consider

You will want to know whether you are looking at a solid state drive (SSD). If so, performance can drop drastically once the SSD reaches its write cliff , or the point at which the drive exhausts its first writes to each NAND memory cell. Once you reach the drive's write cliff, you are now exercising a programmable erase (PE) cycle for each write operation. This multistep process involves erasing an entire cell before writing to it. Each NAND cell has a finite number of PE cycles. The PE cycles to a cell will vary on the type of NAND technology employed. Vendors resort to something called over-provisioning to prolong or delay this write cliff.

Often you will get double the NAND memory when purchasing a drive. For instance, on a 3TB PCIe SSD, you might be paying for 6TB of storage, of which you are only able to use 3TB. There are two methods by which someone can restore an SSD to its former performing glory. One is by employing a technique called TRIM (called Discard on SAS technologies). In this background process, the software tells the drive to prep unused data blocks for a future write operation. Tools such as fstrim will enable this function, but many filesystems (e.g., ext4) and volume managers (e.g., logical volume management, LVM) offer discard support.

Conclusion

When something does not seem quite right, you have many ways to peek into a system – a few of these mechanisms were explored here. This guide does not cover everything and just provides a starting point, but it is enough of a starting point to get you moving in the right direction of isolating and diagnosing the root cause of your performance troubles.

Infos

  1. Linux Kernel Archives: http://kernel.org/
  2. Flame graph: https://github.com/brendangregg/FlameGraph

The Author

Petros Koutoupis is currently a senior performance software engineer at Cray (now HPE) for its Lustre High Performance File System division. He is also the creator and maintainer of the RapidDisk Project (http://www.rapiddisk.org/). Koutoupis has worked in the data storage industry for well over a decade and has helped pioneer the many technologies unleashed in the wild today.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • When I/O Workloads Don’t Perform

    Isolating and diagnosing the root causes of your performance troubles.

  • SMART storage device monitoring
    Most storage devices have SMART capability, but can it help you predict failure? We look at ways to take advantage of this built-in monitoring technology with the smartctl utility from the Linux smartmontools package.
  • Predicting Drive Doom

    Modern drives use S.M.A.R.T. (self-monitoring, analysis, and reporting technology) to gather information and run self-tests. Smartmontools is a Linux tool for interacting with the S.M.A.R.T. features of drives. But is S.M.A.R.T. the best way to predict impending drive doom?

  • S.M.A.R.T., smartmontools, and drive monitoring
    Modern drives use S.M.A.R.T. (self-monitoring, analysis, and reporting technology) to gather information and run self-tests. Smartmontools is a Linux tool for interacting with the S.M.A.R.T. features of drives. But is S.M.A.R.T. the best way to predict impending drive doom?
  • NVMe-oF gears up to replace iSCSI
    NVMe over Fabrics promises high-redundancy centralized storage with simple maintenance and management and almost twice the speed of its antiquated iSCSI predecessor.
comments powered by Disqus