« Previous 1 2
Using benchmarks to your advantage
Node Check
Larger Node Groups
After running small groups of nodes you can run larger groups by combining the smaller groups. How many nodes you use in these larger groups is up to you. Regardless, you should follow the same process as used for the single-node and small node count tests. The most important thing to remember is to store the results once you are done.
If you want, you can repeat the process of testing larger and larger node groups until you reach the entire cluster. Sometimes this is useful if you are trying to do a TOP500 run, because you can leave slower nodes out of the run that hurt the final result. However, what you have after you finish all of these tests – from the single-node benchmarks to the larger node count tests – is some very important and extremely useful information: A fairly extensive database of benchmark results; it includes the results from standard benchmarks for all of the individual nodes and groups of nodes, as well as a history of outlier nodes relative to the others. This kind of information can be extremely valuable to HPC system administrators.
Database of Results
The most common problem I encountered as an admin is user applications that do not run or run poorly. This type of problem can be difficult to tackle because of the myriad possible causes.
One of the first things to do in tackling these problems is test the nodes that seem to be causing the performance problems. To do this, you need to know what kind of performance to expect from the nodes. Don't forget that you have a very nice database of test results you can use for this testing. Of course, these tests might not "tickle" the node(s) in the same way a user application does, but at least you have a starting point for debugging the node.
In addition to debugging the node itself, the database results can help track down network problems. With the node group tests from the database in hand, you can re-run the small node group tests across the set of nodes you suspect are not performing well and see how the results compare with the database.
Admins also update system software from time to time (e.g., a security update or a new version of a compiler or library). To determine whether the nodes are performing well after the update, you can simply re-run the tests and compare the results to the database. If the results are as good or better than the previous results, you are golden. If the results are worse, maybe you have some triage time ahead. Regardless, you should keep track of the tests after the system has been upgraded as a new baseline for comparison.
After a firmware upgrade on nodes or switches, I definitely recommend re-running all of the tests – from single-nodes to larger node groups. Be sure to compare these results to the database of results. If the new results are the same or better than the previous results, life is good.
Again, don't forget to store the new results as a new baseline. If the results are worse and triaging does not turn up much, you might have to roll back the firmware version while you debug the updated firmware with the manufacturer(s). Without a database of benchmark results, though, determining whether you need to roll back or not would be difficult.
A great way to use these benchmark results is to re-run the tests on nodes periodically by creating some simple jobs, running them, and recording and comparing the results. A simple tool can parse the benchmark results and throw them into the database for comparison with the old results, and you can even use statistical methods in the comparison. If you start to see performance differences between these periodic runs or between the runs and the database, it might be time to take the node(s) out of production for triage.
I know of one site that, for a period of time, re-ran some simple single-node tests in the scheduler epilogue
script. They would pull these results into a great flat file and constantly run statistics against that file. Although this example is a bit extreme, they were having problems at the time, and it illustrates the usefulness of a performance database.
Summary
Benchmarks have been used nefariously by both vendors and customers, but they don't have to be used for evil purposes; instead, they can be very useful to admins. For example, debugging a user's application when it isn't running well is always difficult; however, you have an advantage if you have a set of baseline performance benchmarks in your back pocket.
An excellent way to start checking for problems is to check the nodes they use. In particular, I would briefly take the nodes out of production and check their performance by repeating the exact same tests used to create the database and comparing the results to the database.
I hope I've convinced you that benchmarks can be amazingly useful for admins, and even users.
Infos
- NAS Parallel Benchmarks: http://en.wikipedia.org/wiki/NAS_Parallel_Benchmarks
- NASA NPBs: http://www.nas.nasa.gov/publications/npb.html
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.