Virtuous Benchmarks: Using Benchmarks to Your Advantage
Benchmarks have been misused by both users and vendors for many years, but they don’t have to be the evil creature we all think them to be.
From my perspective as a user, customer, developer, and administrator on the one hand and as a vendor on the other, one of the most contentious issues from both sides in the HPC industry has been benchmarks.
As a user and customer, I used benchmarks to get an idea of performance and to compare product metrics, such as performance/price or performance/watt. However, benchmarks require time and effort, both to create them and to interpret the results, delaying the request for proposal while the vendors take the time to produce the benchmarks and thus delaying the introduction of a new system. Moreover, after the system is installed, the requested benchmarks are often re-run to make sure it meets the vendor’s guarantees. This again delays putting the system into production.
On the vendor side, I used benchmarks to improve my understanding of how new systems performed, so I could make good recommendations to customers. It also helped me explain to customers how much work would be needed to port applications to these new systems. To achieve this, standard benchmarks and commercial applications are run on the new systems and the results are published in a series of articles and blog posts. Additionally, customer-specific benchmarks typically took a great deal of work.
Because of the enormous amount of effort required in this process, both sides – customer and vendor – view benchmarks as a necessary evil. Neither side really wants them; nonetheless, they use them. That said, perhaps I can find a way to use them that isn’t so evil. To begin this quest, I’ll examine the benchmarks typically run when installing a system.
Installation Benchmarks
During installation, the system is reconstructed on the customer site, which includes racking and cabling the hardware and installing or checking the system software. Once the system is up and running, benchmarks are run to determine two things: Are the nodes and network functioning correctly? Is system performance as promised?
In my experience, to accomplish these two goals, you should run a series of benchmarks that start with single-node runs and progress to groups of nodes of various sizes.
Single-Node Runs
I like to start with the individual nodes and then work up, so I begin by running the exact same tests on all of the nodes as close to the same time as possible. The tests should run fairly quickly yet stress various components of the system. For example, they should definitely stress the processor(s) and memory, especially the bandwidth. I would recommend running single-core tests and tests that use all of the cores (i.e., MPI or OpenMP).
A number of benchmarks are available for you to run. The ones I like are the NAS Parallel Benchmarks (NPB). NPB is a set of benchmarks that cover a wide range of applications, primarily from the CFD (Computational Fluid Dynamics) field. I’ve found they really stress the CPU, memory bandwidth, and network in various ways. OpenMP and MPI versions of NPB “classes” allow you to run different data sizes. Plus, they are very easy to build and run, and the output is easy to interpret.
The NASA website provides the following details on the NPB benchmarks.
- Five kernel benchmarks:
- IS – Sort small integers using the bucket sort. Typically uses random memory access.
- EP – Embarrassingly parallel application. Generates independent Gaussian random variates using the Marsaglia polar method.
- CG – Estimate the smallest eigenvalue of a large sparse symmetric positive-definite matrix using the inverse iteration with the conjugate gradient method as a subroutine for solving systems of linear equations. Uses irregular memory access and communication.
- MG – Approximate the solution of a three-dimensional discrete Poisson equation using the V-cycle multigrid method on a sequence of meshes. Exhibits both long- and short-distance communication and is memory intensive.
- FT – Solve a three-dimensional partial differential equation (PDE) using the fast Fourier transform. Uses a great deal of all-to-all communication.
- Three pseudo-applications:
- BT – Solves a synthetic system of nonlinear PDEs using a block tri-diagonal solver.
- SP – Solves a synthetic system of nonlinear PDEs using a scalar penta-diagonal solver.
- LU – Solves a synthetic system of nonlinear PDEs using symmetric successive over-relaxation (SSOR). Also referred to as a Lower-Upper Gauss–Seidel solver.
These tests have both OpenMP and MPI versions, and a “multizone” version of the pseudo-applications can be run in a hybrid mode (i.e., MPI/OpenMP).
The benchmark classes in Table 1 indicate the size of the problem being examined and correlate with the amount of memory used and the amount of time needed to complete.
Table 1: NPB Benchmark Classes
Class | Test Size | Application |
S | Small | Quick tests |
W | Workstation | From the 1990s |
A, B, C | Standard | 4x size increases going from one class to the next |
D, E, F | Large | ~16x size increases from each of the previous classes |
NPB has been released three times, each undergoing several versions as bugs were found or improvements were introduced. As of this writing, the latest version is 3.3.1 for both NPB and NPB-MZ (multizone).
Benchmark results are usually expressed in terms of how much (wall clock) time it takes to run and in GFLOPS (10^9 floating point operations per second) or MFLOPS (10^6 floating point operations per second). For example, Listing 1 presents the output of the MG benchmark (NPB 3.3.1, GCC compilers, OpenMPI, single socket with four cores with four hyperthreading cores for eight total cores, Class C).
Listing 1: MG Benchmark Output
NAS Parallel Benchmarks 3.3 -- MG Benchmark No input file. Using compiled defaults Size: 512x 512x 512 (class C) Iterations: 20 Number of processes: 8 Initialization time: 5.245 seconds iter 1 iter 5 iter 10 iter 15 iter 20 Benchmark completed VERIFICATION SUCCESSFUL L2 Norm is 0.5706732285739E-06 Error is 0.1345119360807E-12 MG Benchmark Completed. Class = C Size = 512x 512x 512 Iterations = 20 Time in seconds = 35.44 Total processes = 8 Compiled procs = 8 Mop/s total = 4393.44 Mop/s/process = 549.18 Operation type = floating point Verification = SUCCESSFUL Version = 3.3.1 Compile date = 28 Nov 2014 ...
The output says it took 35.44 seconds to run, using a total of 4,393.44 Mop/s (4.393 GFLOPS).
For testing (benchmarking), I select a subset of the NPB benchmarks and classes, execute single-node runs (either OpenMP or MPI) on all of the nodes roughly at the same time, and name the output files to match the node name. To collect the output from all of the runs, I use simple Bash or Python scripts.
With this data in hand, I first look for performance outliers. To begin, I compute the average (arithmetic mean) and standard deviation of all of the results for each test. If the standard deviation is a significant percentage of the average, I then plot the data on a graph of performance versus node number, which I inspect visually for outliers.
From the plot, I can mark some nodes as outliers that need to be re-tested and possibly triaged. Next, I remove the data of the outlier nodes from the totals and recompute the average and standard deviation, repeating the outlier identification process. At some point, one hopes the standard deviation becomes a small percentage of the average, so I can stop the testing process with a set of good nodes and a set of outlier nodes.
For example, I might start with a performance standard deviation target of +/-5% of the average. (Note that 5% is an example, not a hard and fast number.) If the computed standard deviation is greater than 5%, I will plot the results and start choosing nodes outside of this deviation. Next, I recompute the average and standard deviation of the reduced set and repeat until I reach the target 5% deviation.
With the set of outlier nodes, I re-run the benchmarks one or two more times to see if the performance changes. If it does not, then I triage the nodes (up to and including replacement).
The last step is probably one of the most critical steps you can take, and it goes to the heart of this article. Be sure to store the single-node results somewhere you can easily retrieve them. Also, store the the source, and even the binaries, with the information on how you built the code, including software versions.
Small Node Groups
After the single-node runs are done, I test small groups of nodes. You can either arbitrarily pick the number of nodes per group to test, or you can group the nodes together so that they all belong to a single switch. Generally, I try to run four nodes per group to keep things simple. In these groups, I run tests with both a single core per node and all the cores per node, allowing me to stress the nodes in different ways. The goal of small-node-group testing is to start introducing network performance as an overall parameter. For these runs, you have to use the MPI version of the NPB tests, and I would run the same tests as used in the single-node runs.
I recommend running two different classes for these small node groups, beginning with A or B, to stress the network by taking a small problem and spreading it across a number of processes. However, real systems are seldom run this way, because it is not an efficient use of the system. Therefore, I would also run the largest class problem possible to stress the memory, CPU, and network.
After running these tests, you again perform a statistical analysis on the results in the exact same manner as described for the single-node runs: compute the average and standard deviation of the tests, look for outliers in the data, run more tests on those groups, and perhaps triage the nodes if needed. I would also recommend comparing the nodes in this outlier groups to the outliers in the single-node tests to look for correlation.
As with the single-node tests, be sure to store the results somewhere you can easily retrieve them, along with the source and binaries and how you built the code, including versions.
Larger Node Groups
After running small groups of nodes you can run larger groups by combining the smaller groups. How many nodes you use in these larger groups is up to you. Regardless, you should follow the same process as used for the single-node and small node count tests. The most important thing to remember is to store the results once you are done.
If you want, you can repeat the process of testing larger and larger node groups until you reach the entire cluster. Sometimes this is useful if you are trying to do a TOP500 run, because you can leave slower nodes out of the run that hurt the final result. However, what you have after you finish all of these tests – from the single-node benchmarks to the larger node count tests – is some very important and extremely useful information: a fairly extensive database of benchmark results that includes the results from standard benchmarks for all of the individual nodes and groups of nodes, as well as a history of outlier nodes relative to the others. This kind of information can be extremely valuable to HPC system administrators.
Database of Results
The most common problem I encountered as an admin is user applications that do not run or run poorly. This type of problem can be difficult to tackle because of the myriad possible causes.
One of the first things to do in tackling these problems is test the nodes that seem to be causing the performance problems. To do this, you need to know what kind of performance to expect from the nodes. Don’t forget that you have a very nice database of test results you can use for this testing. Of course, these tests might not “tickle” the node(s) in the same way a user application does, but at least you have a starting point for debugging the node.
In addition to debugging the node itself, the database results can help track down network problems. With the node group tests from the database in hand, you can re-run the small node group tests across the set of nodes you suspect are not performing well and see how the results compare with the database.
Admins also update system software from time to time (e.g., a security update or a new version of a compiler or library). To determine whether the nodes are performing well after the update, you can simply re-run the tests and compare the results to the database. If the results are as good or better than the previous results, you are golden. If the results are worse, maybe you have some triage time ahead. Regardless, you should keep track of the tests after the system has been upgraded as a new baseline for comparison.
After a firmware upgrade on nodes or switches, I would definitely recommend re-running all of the tests – from single-nodes to larger node groups. Be sure to compare these results to the database of results. If the new results are the same or better than the previous results, life is good. Again, don’t forget to store the new results as a new baseline. If the results are worse and triaging does not turn up much, you might have to roll back the firmware version while you debug the updated firmware with the manufacturer(s). Without a database of benchmark results, though, determining whether you need to roll back or not would be difficult.
A great way to use these benchmark results is to re-run the tests on nodes periodically by creating some simple jobs, running them, and recording and comparing the results. A simple tool can parse the benchmark results and throw them into the database for comparison with the old results, and you can even use statistical methods in the comparison. If you start to see performance differences between these periodic runs or between the runs and the database, it might be time to take the node(s) out of production for triage.
I know of one site that, for a period of time, re-ran some simple single-node tests in the scheduler epilogue script. They would pull these results into a great flat file and constantly run statistics against that file. Although this example is a bit extreme, they were having problems at the time, and it illustrates the usefulness of a performance database.
Summary
Benchmarks have been used nefariously by both vendors and customers, but they don’t have to be used for evil purposes; instead, they can be very useful to admins.
For example, debugging a user’s application when it isn’t running well is always difficult; however, you have an advantage if you have a set of baseline performance benchmarks in your back pocket. An excellent way to start checking for problems is to check the nodes they use. In particular, I would briefly take the nodes out of production and check their performance by repeating the exact same tests used to create the database and comparing the results to the database.
I hope I’ve convinced you that benchmarks can be amazingly useful for admins, and even users.