« Previous 1 2
When I/O workloads don't perform
A Peak Under the Hood
Perf Tracing
Another tool captures system traces, but at the Linux kernel level and in the context of CPU-executed operations. That tool is called perf
and can be obtained from the Linux Kernel Archives [1]. Some Linux distributions will offer it in their downloadable package repositories, but it might be best to download the latest and greatest from the website (Listing 13).
Listing 13
Get perf from kernel.org
$ wget https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/v5.9.0/perf-5.9.0.tar.xz --2021-03-21 15:40:30-- https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/v5.9.0/perf-5.9.0.tar.xz Resolving mirrors.edge.kernel.org (mirrors.edge.kernel.org)... 2604:1380:1:3600::1, 147.75.197.195 Connecting to mirrors.edge.kernel.org (mirrors.edge.kernel.org)|2604:1380:1:3600::1|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1857268 (1.8M) [application/x-xz] Saving to: 'perf-5.9.0.tar.xz' perf-5.9.0.tar.xz 100%[=============================================================================================>] 1.77M 8.85MB/s in 0.2s 2021-03-21 15:40:30 (8.85 MB/s) - 'perf-5.9.0.tar.xz' saved [1857268/1857268]
Extract the file into a local working directory
$ tar xJf perf-5.9.0.tar.xz
and change into the perf
source code subdirectory:
$ cd perf-5.9.0/tools/perf/
Resolve any and all dependencies and run make
and make install
utilities.
Now that perf
is installed, you can generate I/O to a disk drive with fio
,
$ sudo fio --bs=1K --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=1 --name=test
and in another terminal window, invoke perf
to record kernel operations executed at the CPU level for 30 seconds:
$ sudo ./perf record -a -g -- sleep 30 [ perf record: Woken up 8 times to write data ] [ perf record: Captured and wrote 2.476 MB perf.data (12373 samples) ]
A perf.data
file is generated in the local working directory. You can use tools to open this file directly for analysis or to generate other output from that file. For instance, to generate a report summarizing the 30 seconds, enter:
$ sudo ./perf script > out.script $ sudo ./perf report > out.report
You will need both files just generated:
$ ls out.* out.report out.script
If you open the report file (i.e., out.report
), you will see a summary of all the kernel operations and the amount of CPU it utilized during that 30-second snapshot, and you will see the subroutines tied to the top-level functions (Listing 14).
Listing 14
Summary of Kernel Operations
# To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 12K of event 'cycles' # Event count (approx.): 1354753461 # # Children Self Command Shared Object Symbol # ........ ........ ............... ........................ ................................................ # 57.94% 0.00% swapper [kernel.kallsyms] [k] secondary_startup_64 | ---secondary_startup_64 | |--48.28%--start_secondary | | | --48.25%--cpu_startup_entry | | | --47.68%--do_idle | | | |--36.89%--call_cpuidle | | | | | --36.81%--cpuidle_enter | | | | | --35.93%--cpuidle_enter_state | | | | | |--16.95%--intel_idle | | | | | |--10.25%--ret_from_intr | | | | | | | --10.21%--do_IRQ | | | | | | | |--7.62%--irq_exit | | | | | | | | | --7.47%--__softirqentry_text_start | | | | | | | | | --7.17%--blk_done_softirq [ ... ] 0.51% 0.04% fio [kernel.kallsyms] [k] mpage_alloc.isra.0 0.50% 0.00% fio libc-2.31.so [.] 0x00007f0576ab611a | ---0x7f0576ab611a 0.50% 0.24% swapper [kernel.kallsyms] [k] sched_clock_cpu 0.49% 0.00% perf perf [.] 0x00005570000237de 0.48% 0.18% swapper [kernel.kallsyms] [k] ttwu_do_wakeup 0.46% 0.19% swapper [kernel.kallsyms] [k] set_next_entity 0.46% 0.00% rcu_sched [kernel.kallsyms] [k] ret_from_fork 0.46% 0.00% rcu_sched [kernel.kallsyms] [k] kthread 0.46% 0.10% swapper [kernel.kallsyms] [k] load_balance [ ... ]
The perf
utility is useful when you might need to see where the CPU spends most of its time and provides insight into what could be holding up the completion of a function or shed light on poorly developed functions in the kernel space.
Sometimes it can be difficult to parse this data, which is why you can convert it into a visual chart called a flame graph [2]. To do this, clone the git
repo and convert the original output script file into an SVG (Listing 15).
Listing 15
Generating a Flame Graph
$ git clone https://github.com/brendangregg/FlameGraph.git Cloning into 'FlameGraph'... remote: Enumerating objects: 1067, done. remote: Total 1067 (delta 0), reused 0 (delta 0), pack-reused 1067 Receiving objects: 100% (1067/1067), 1.87 MiB | 6.67 MiB/s, done. Resolving deltas: 100% (612/612), done. $ ./FlameGraph/stackcollapse-perf.pl out.script > out.folded $ ./FlameGraph/flamegraph.pl out.folded > out.svg
When you open the file, you will see a visual representation of the 30-second capture (Figure 2). You can graphically navigate this graph with the click of your mouse. As you begin to click into the kernel functions, you will drill deeper and deeper into each function and the subfunctions it might be waiting on (Figures 3 and 4). You will also observe the total CPU time of the functions at the bottom of the SVG page.
Other Things to Consider
You will want to know whether you are looking at a solid state drive (SSD). If so, performance can drop drastically once the SSD reaches its write cliff , or the point at which the drive exhausts its first writes to each NAND memory cell. Once you reach the drive's write cliff, you are now exercising a programmable erase (PE) cycle for each write operation. This multistep process involves erasing an entire cell before writing to it. Each NAND cell has a finite number of PE cycles. The PE cycles to a cell will vary on the type of NAND technology employed. Vendors resort to something called over-provisioning to prolong or delay this write cliff.
Often you will get double the NAND memory when purchasing a drive. For instance, on a 3TB PCIe SSD, you might be paying for 6TB of storage, of which you are only able to use 3TB. There are two methods by which someone can restore an SSD to its former performing glory. One is by employing a technique called TRIM (called Discard on SAS technologies). In this background process, the software tells the drive to prep unused data blocks for a future write operation. Tools such as fstrim
will enable this function, but many filesystems (e.g., ext4) and volume managers (e.g., logical volume management, LVM) offer discard
support.
Conclusion
When something does not seem quite right, you have many ways to peek into a system – a few of these mechanisms were explored here. This guide does not cover everything and just provides a starting point, but it is enough of a starting point to get you moving in the right direction of isolating and diagnosing the root cause of your performance troubles.
Infos
- Linux Kernel Archives: http://kernel.org/
- Flame graph: https://github.com/brendangregg/FlameGraph
« Previous 1 2
Buy this article as PDF
(incl. VAT)