« Previous 1 2 3
HPC fundamentals
Quick on the Uptake
Summary
A tool that allows you to run commands on a range of nodes is probably the most fundamental tool an HPC admin can use. Even for experienced admins, such an easy-to-use tool can help you understand quickly the state of your system. Arguably, the most popular parallel shell is pdsh
. It is easy to use and flexible and has very useful modules to extend its capability.
The pdsh
tool can be used on the cluster in a number of ways. An extremely common use is to check the load on all of the nodes in the cluster (uptime
) to determine whether the node is up or down and report the load on the node. A myriad of other uses range from checking the version of software installed on the nodes, to spot monitoring, to installing packages.
The pdsh
command lets you define a list of target hosts to include or exclude and allows you to treat clusters in subgroups when performing operations or to group hosts on the basis of function. Using modules, you can group target hosts by SLURM_JOBID
, so you can query nodes that are part of a single job.
Finally, you can use pdsh
in conjunction with scripts on a shared workspace and then use the command to run the scripts on target hosts. However, a word of caution: If possible, do not run commands or scripts that have multiline output you would have to reassemble into the proper order.
If you are starting out in the cluster world, or even if you are an experienced administrator, pdsh
is a go-to tool for managing and monitoring systems.
Infos
- pdsh tool: https://github.com/chaos/pdsh
- SSH: https://en.wikipedia.org/wiki/Secure_Shell
- hostlist expressions: https://code.google.com/p/pdsh/wiki/HostListExpressions
- "Monitoring HPC Systems: Processor and Memory Metrics" by Jeff Layton: http://www.admin-magazine.com/HPC/Articles/Processor-and-Memory-Metrics
- "Monitoring HPC Systems: Process, Network, and Disk Metrics" by Jeff Layton: http://www.admin-magazine.com/HPC/Articles/Process-Network-and-Disk-Metrics
- Modules page: https://github.com/chaos/pdsh/blob/master/README.modules
- munge: https://dun.github.io/munge/
« Previous 1 2 3
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.