Lead Image © Lucy Baldwin, 123RF.com

Lead Image © Lucy Baldwin, 123RF.com

Favorite benchmarking tools

High Definition

Article from ADMIN 55/2020
By
We take a look at three benchmarking tool favorites: time, hyperfine, and bench.

At the Dragon Propulsion Laboratory, we are partial to using the simplest tool that will do the job at hand – particularly when dealing with the inherent complexity that performance measurement (and tuning) brings to the table. Yet that same complexity often requires advanced tooling to resolve the riddles posed by performance questions. I will examine my current benchmarking tool favorites from the simplest to the more sophisticated.

Tempus Fugit

The benchmark archetype is time: simple, easy to use, and well understood by most users. In its purest form, time takes a command as a parameter and times its execution in the real world (real), as well as how much CPU time was allocated in user and kernel (sys) modes:

$  time sleep 1
real   0m1.004s
user  0m0.002s
sys   0m0.001s

What not everyone knows is that the default time command is actually one of the bash-builtins [1]:

$ type time
time is a shell keyword
$ which time
/usr/bin/time

There is time, and then there is GNU time [2]. The standalone binary version sports a few additional capabilities, the most noteworthy being its ability to measure page faults and swapping activity by the tested binary:

$ /usr/bin/time gcc test.c -o test
0.03user 0.01system 0:00.05elapsed 98%CPU (0avgtext+0avgdata 20564maxresident)k
0inputs+40outputs (0major+4475minor)pagefaults 0swaps

The sixth Dojo was dedicated to GNU time's amazing capabilities, and I invite you to read up in your prized archive of ADMIN back issues [3]. Table 1 sums up the capabilities of this versatile tool, which include memory use, basic network benchmarks (packet counts), filesystem I/O operations (read and write counts), and page faults, both minor (MMU) and major (disk access).

Table 1

Format Specifiers*

Option Function
C Image name and command-line arguments (argv)
D Average size of the process's unshared data area (KB)
E Wall clock time used by the process ([hours:]minutes:seconds)
F Number of major page faults (requiring I/O) incurred
I Number of filesystem inputs
K Average total (data + stack + text) memory use of the process (KB)
M Maximum resident set size of the process (KB)
O Number of filesystem outputs by the process
P Percentage of the CPU that this job received (user + system divided by running time)
R Number of minor page faults (not requiring I/O, resolved in RAM)
S Kernel-mode CPU seconds allocated to the process
U User-mode CPU seconds allocated to the process
W Number of times the process was swapped out of main memory
X Average amount of shared text in the process (KB)
Z System's page size (bytes) – this is a system constant
c Number of times the process was context-switched involuntarily (time slice expired)
e Wall clock time used by the process (seconds)
k Number of signals delivered to the process
p Average unshared stack size of the process
r Number of socket messages received
s Number of socket messages sent
t Average resident set size of the process (KB)
w Number of times that the program was context-switched voluntarily
x Exit status of the command
* Available in GNU time, version 1.7 (adapted from the man page).

Going Deeper

Back in issue 12, I mentioned Martin Pool's promising tool judge [4]. Unfortunately, judge never made it past version 0.1, with its most recent release dated back to 2011. However, Martin's efforts have an impressive successor in David Peter's recent hyperfine [5], which is a step up from time when timing a run, because it runs the task a number of times and generates relevant statistics.

Remarkably, the tool takes care of determining how many runs are necessary to generate a statistically valid result automatically on its own recognizance. Normally, it performs 10 runs of the specified command by default, but the minimum number of runs can be tuned manually (-m). Returning again to the same sleep [6] example I used previously, I show a snapshot of the interactive progress report in Figure 1, followed by the final results of the run in Figure 2.

Figure 1: Hyperfine displays progress in real time, which is really handy when managing longer tests.
Figure 2: A summary of results for a very simple example.

In this test, hyperfine sampled 10 runs of a simple delay command, determined an expected value (mean) 1.6ms greater than requested with a standard deviation (sigma) of 0.3ms [7], and reported the range between the minimum and maximum values encountered – handling insubstantial results as well as the appearance of significant outliers. Incongruous data will generate warnings like the following:

Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs.

Very short execution times will lead to warnings requesting a better test and will likely inflate the execution count: Try hyperfine 'ls' to see this happen in practice. The program can be dominated by startup time, or the hot-cache case might not be what you want to measure.

The tool can account for cache warmup runs (--warmup N), and if the opposite behavior is desired, it is just as easy to clear caches before every run by passing the appropriate command (--prepare COMMAND). Exporting the timing results for all runs is also conveniently supported in JSON, CSV, and even Markdown.

Bench Test

The bench [8] command also is geared toward statistical analysis of multiple sampling runs, but it provides much less feedback as measurements are being taken. Once samples are obtained, however, the analysis of the results is comprehensive (Figure 3). As with judge, multiple commands can be compared in a single run, but the pièce de résistance of bench is really its ability to generate beautiful charts from the data. Figure 4 makes the case for using bench plain, displaying the report generated by the test in Figure 5. Everything is in one place and ready to share with others on your team.

Figure 3: Bench thoroughly weighs the presence of outliers in the data.
Figure 4: Bench uses a pure HTML canvas to visualize results interactively.
Figure 5: Generating the Figure 4 test report with Bench.

Lack of pre-built packages in the major Linux distributions and the broken builds for macOS [9] found in brew [10] make installing bench less than the ideal experience, but its usefulness more than makes up for the minor inconvenience of having to use Haskell's stack build system [11] for setup.

The Author

Federico Lucifredi (@0xf2) is the Product Management Director for Ceph Storage at Red Hat and was formerly the Ubuntu Server Project Manager at Canonical and the Linux "Systems Management Czar" at SUSE. He enjoys arcane hardware issues and shell-scripting mysteries and takes his McFlurry shaken, not stirred. You can read more from him in the new O'Reilly title AWS System Administration .

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus