Using benchmarks to your advantage

Node Check

Larger Node Groups

After running small groups of nodes you can run larger groups by combining the smaller groups. How many nodes you use in these larger groups is up to you. Regardless, you should follow the same process as used for the single-node and small node count tests. The most important thing to remember is to store the results once you are done.

If you want, you can repeat the process of testing larger and larger node groups until you reach the entire cluster. Sometimes this is useful if you are trying to do a TOP500 run, because you can leave slower nodes out of the run that hurt the final result. However, what you have after you finish all of these tests – from the single-node benchmarks to the larger node count tests – is some very important and extremely useful information: A fairly extensive database of benchmark results; it includes the results from standard benchmarks for all of the individual nodes and groups of nodes, as well as a history of outlier nodes relative to the others. This kind of information can be extremely valuable to HPC system administrators.

Database of Results

The most common problem I encountered as an admin is user applications that do not run or run poorly. This type of problem can be difficult to tackle because of the myriad possible causes.

One of the first things to do in tackling these problems is test the nodes that seem to be causing the performance problems. To do this, you need to know what kind of performance to expect from the nodes. Don't forget that you have a very nice database of test results you can use for this testing. Of course, these tests might not "tickle" the node(s) in the same way a user application does, but at least you have a starting point for debugging the node.

In addition to debugging the node itself, the database results can help track down network problems. With the node group tests from the database in hand, you can re-run the small node group tests across the set of nodes you suspect are not performing well and see how the results compare with the database.

Admins also update system software from time to time (e.g., a security update or a new version of a compiler or library). To determine whether the nodes are performing well after the update, you can simply re-run the tests and compare the results to the database. If the results are as good or better than the previous results, you are golden. If the results are worse, maybe you have some triage time ahead. Regardless, you should keep track of the tests after the system has been upgraded as a new baseline for comparison.

After a firmware upgrade on nodes or switches, I definitely recommend re-running all of the tests – from single-nodes to larger node groups. Be sure to compare these results to the database of results. If the new results are the same or better than the previous results, life is good.

Again, don't forget to store the new results as a new baseline. If the results are worse and triaging does not turn up much, you might have to roll back the firmware version while you debug the updated firmware with the manufacturer(s). Without a database of benchmark results, though, determining whether you need to roll back or not would be difficult.

A great way to use these benchmark results is to re-run the tests on nodes periodically by creating some simple jobs, running them, and recording and comparing the results. A simple tool can parse the benchmark results and throw them into the database for comparison with the old results, and you can even use statistical methods in the comparison. If you start to see performance differences between these periodic runs or between the runs and the database, it might be time to take the node(s) out of production for triage.

I know of one site that, for a period of time, re-ran some simple single-node tests in the scheduler epilogue script. They would pull these results into a great flat file and constantly run statistics against that file. Although this example is a bit extreme, they were having problems at the time, and it illustrates the usefulness of a performance database.

Summary

Benchmarks have been used nefariously by both vendors and customers, but they don't have to be used for evil purposes; instead, they can be very useful to admins. For example, debugging a user's application when it isn't running well is always difficult; however, you have an advantage if you have a set of baseline performance benchmarks in your back pocket.

An excellent way to start checking for problems is to check the nodes they use. In particular, I would briefly take the nodes out of production and check their performance by repeating the exact same tests used to create the database and comparing the results to the database.

I hope I've convinced you that benchmarks can be amazingly useful for admins, and even users.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Benchmarks Don’t Have to Be Evil

    Benchmarks have been misused by both users and vendors for many years, but they don’t have to be the evil creature we all think them to be.

  • Measuring the performance health of system nodes
    Many HPC systems check the state of a node before running an application, but not very many check that the performance of the node is acceptable before running the job.
  • Performance Health Check

    Many HPC systems check the state of a node b efore  running a n  application, but not very many check that the performance of the node is acceptable before running the job.

  • ClusterHAT

    Inexpensive, small, portable, low-power clusters are fantastic for many HPC applications. One of the coolest small clusters is the ClusterHAT for Raspberry Pi.

  • Favorite benchmarking tools
    We take a look at three benchmarking tool favorites: time, hyperfine, and bench.
comments powered by Disqus