HPC fundamentals

Quick on the Uptake

When I was preparing for my qualification exams in graduate school, I talked with fellow grad students about their experiences. I remember one fellow student said something like, "When in doubt, focus on the fundamentals"; that is, go back to first principles to solve problems. I have always remembered his comment, and I try to apply it when I can.

For HPC, one of the fundamentals is being able to run a command across multiple nodes in a cluster. A parallel shell is a simple but powerful tool that allows you to do just that on designated (or all) nodes, so you do not have to log in to each node and run the same command over and over again. This single tool has an infinite number of ways to be useful, but I like to use it when performing administrative tasks, such as:

quickly discover the status of nodes in a cluster,
checking the versions of particular software packages on each node,
checking the OS version on all nodes,
checking the kernel version on all nodes,
searching the system logs on each node (if you do not store them centrally),
examining the CPU usage on each node,
examining local I/O (if the nodes do any local I/O),
checking whether any nodes are swapping,
spot-monitoring the compute nodes, and
debugging.

This list is just the short version; the complete list is extensive. Anything you want to do on a single node can be done on a large number of nodes using a parallel shell tool. However, for those that might be asking if they can use parallel shells on their 50,000-node clusters, the answer is that you can, but the time skew in the results will be large enough that the results might not be useful (which is a completely different subject). Parallel shells are more practical when used on a smaller number of nodes, on specific nodes (e.g., those associated with a specific job in a resource manager), or for gathering

...

Use Express-Checkout link below to read the full article (PDF).