Parallel Shells

The most fundamental tool needed to administer an HPC system is a parallel shell, which allows you to run the same command on a series of nodes. In this article, we look at pdsh.

One of the key tools you need to administer a cluster is a parallel shell. A parallel shell allows you to run the same command on designated nodes in the cluster, so you don’t have to log in to each node to run the command. This tool can be useful in many ways, but I like to use it when performing administrative tasks, such as:

  • checking the versions of particular software packages on each node
  • checking the OS version on all nodes
  • checking the kernel version on all nodes
  • searching the system logs on each node (if you don’t store them centrally)
  • examining the CPU usage on each node
  • examining local I/O (if the nodes are doing local I/O)
  • checking whether any nodes are swapping
  • spot-monitoring the compute nodes

The real list of possible tasks is extensive, but anything you want to do on a single node can be done on a large number of nodes using a parallel shell tool.

If you try to use a parallel shell on a 50,000-node cluster, however, the time skew could be large enough to make the results useless. Although certain techniques can allow the use of parallel commands on a large number of nodes, a better use would be on a modest number of nodes or for gathering information on slowly varying data. Parallel shells are even great for administering instances in the cloud on something like Amazon Web Services (AWS).

Many parallel shells are available, and each tool has its pros and cons. A short list includes:

A number of these tools are written in Python, which has become a very popular tool for devops. Although some of these tools might not be appropriate or useful for HPC, I included them for the sake of completeness. (Note: I have not tested all of these tools, so I can’t vouch for them.)

In this article I’m going to select one of the parallel shells to illustrate its possibilities. Other tools are fairly similar, with some syntactical differences and different features. The tool I’m going to talk about here is pdsh.

Introduction to pdsh

Pdsh is arguably one of the most popular parallel shell tools. The most recent version on SourceForge as of writing this article is 2.26, dated 2011-05-01. Code development appears to have moved to Google code. The most recent version there is 2.29, updated February 2013. I’ll be using that version in this article.

Pdsh is very interesting because it allows you to run commands on multiple nodes using only ssh. The client nodes only need ssh installed, which is pretty typical for HPC systems, and you don’t need to install any extra software on the compute nodes – you just need ssh. However, you need the ability toSSH to any node without a password (“passwordless SSH”).

Building and Installing pdsh

Building and installing pdsh is really simple if you’ve built code using GNU’s autoconfigure before. The steps are quite easy:

./configure --with-ssh --without-rsh
make
make install

This puts the binaries into /usr/local/, which is fine for testing purposes. For production work, I would put it in /opt or something like that – just be sure it’s in your path.

You might notice that in the configure command I used the option --without-rsh. By default pdsh uses rsh, which is not really secure, so I chose to exclude it from the configuration. In the output below, you can see the pdsh rcmd modules (rcmd is the remote command used by pdsh).

[laytonjb@home4 ~]$ pdsh -v
pdsh: invalid option -- 'v'
Usage: pdsh [-options] command ...
-S                return largest of remote command return values
-h                output usage menu and quit
-V                output version information and quit
-q                list the option settings and quit
-b                disable ^C status feature (batch mode)
-d                enable extra debug information from ^C status
-l user           execute remote commands as user
-t seconds        set connect timeout (default is 10 sec)
-u seconds        set command timeout (no default)
-f n              use fanout of n nodes
-w host,host,...  set target node list on command line
-x host,host,...  set node exclusion list on command line
-R name           set rcmd module to name
-M name,...       select one or more misc modules to initialize first
-N                disable hostname: labels on output lines
-L                list info on all loaded modules and exit
available rcmd modules: ssh,exec (default: ssh)

Notice that the “available rcmd modules” at the end of the output lists only ssh and exec. If I didn’t exclude rsh, it would be listed here, too, and it would be the default. To override rsh as the default and make ssh the default, you just add the following line to your .bashrc file:

export PDSH_RCMD_TYPE=ssh

Be sure to “source” your .bashrc file (i.e., source .bashrc) to set the environment variable. You can also log out and log back in.

If for some reason you see the following when you try running pdsh,

[laytonjb@home4 ~]$ pdsh -w 192.168.1.250 ls -s
pdsh@home4: 192.168.1.250: rcmd: socket: Permission denied

then you have built it with rsh. You can either rebuild pdsh without rsh, or you can use the environment variable in your .bashrc file, or you can do both.

First pdsh Commands

To begin, I’ll try to get the kernel version of a node by using its IP address:

[laytonjb@home4 ~]$ pdsh -w 192.168.1.250 uname -r
192.168.1.250: 2.6.32-431.11.2.el6.x86_64

The -w option means I am specifying the node(s) that will run the command. In this case, I specified the IP address of the node (192.168.1.250). After the list of nodes, I add the command I want to run, which is uname -r in this case. Notice that pdsh starts the output line by identifying the node name.

If you need to mix rcmd modules in a single command, you can specify which module to use on the pdsh command line:

[laytonjb@home4 ~]$ pdsh -w ssh:laytonjb@192.168.1.250 uname -r
192.168.1.250: 2.6.32-431.11.2.el6.x86_64

You just put the rcmd module before the node name. In this case, I used ssh and typical ssh syntax.

A very common way of using pdsh is to set the environment variable WCOLL to point to the file that contains the list of hosts you want to use in the pdsh command. For example, I created a subdirectory PDSH where I create a file hosts that lists the hosts I want to use:

[laytonjb@home4 ~]$ mkdir PDSH
[laytonjb@home4 ~]$ cd PDSH
[laytonjb@home4 PDSH]$ vi hosts
[laytonjb@home4 PDSH]$ more hosts
192.168.1.4
192.168.1.250

I’m only using two nodes: 192.168.1.4 and 192.168.1.250. The first node is my test system (like a cluster head node), and the second node is my test compute node. You can put hosts in the file as you would on the command line separated by commas. Just be sure not to put a blank line at the end of the file because pdsh will try to connect to it. You can put the environment variable WCOLL in your .bashrc file like this:

export WCOLL=/home/laytonjb/PDSH/hosts

As before, you can source your .bashrc file, or you can log out and log back in.

Specifying Hosts

I won’t list all the several other ways to specify a list of nodes, because the pdsh website discusses virtually all of them; however, some of the methods are pretty handy. The simplest way is to specify the nodes on the command line is to use the -w option:

[laytonjb@home4 ~]$ pdsh -w 192.168.1.4,192.168.1.250 uname -r
192.168.1.4: 2.6.32-431.17.1.el6.x86_64
192.168.1.250: 2.6.32-431.11.2.el6.x86_64

In this case, I specified the node names separated by commas. You can also use a range of hosts as follows:

pdsh -w host[1-11]
pdsh -w host[1-4,8-11]

In the first case, pdsh expands the host range to host1, host2, host3, …, host11.

In the second case, it expands the hosts similarly (host1, host2, host3, host4, host8, host9, host10, host11). You can go to the pdsh website for more information on hostlist expressions.

Another option is to have pdsh read the hosts from a file other than the one to which WCOLL points:

[laytonjb@home4 ~]$ pdsh -w ^/tmp/hosts uptime
192.168.1.4:  15:51:39 up  8:35, 12 users,  load average: 0.64, 0.38, 0.20
192.168.1.250:  15:47:53 up 2 min,  0 users,  load average: 0.10, 0.10, 0.04
[laytonjb@home4 ~]$ more /tmp/hosts
192.168.1.4
192.168.1.250

This command tells pdsh to take the host names from the file /tmp/hosts, which is listed after -w ^ (no space between the “^” and the filename). You can also use several host files,

[laytonjb@home4 ~]$ more /tmp/hosts
192.168.1.4
[laytonjb@home4 ~]$ more /tmp/hosts2
192.168.1.250
[laytonjb@home4 ~]$ pdsh -w ^/tmp/hosts,^/tmp/hosts2 uname -r
192.168.1.4: 2.6.32-431.17.1.el6.x86_64
192.168.1.250: 2.6.32-431.11.2.el6.x86_64

or you can exclude hosts from a list:

[laytonjb@home4 ~]$ pdsh -w -192.168.1.250 uname -r
192.168.1.4: 2.6.32-431.17.1.el6.x86_64

The option -w -192.168.1.250 excluded node 192.168.1.250 from the list and only output the information for 192.168.1.4. You can also exclude nodes using a node file:

[laytonjb@home4 ~]$ pdsh -w -^/tmp/hosts2  uname -r
192.168.1.4: 2.6.32-431.17.1.el6.x86_64

In this case /tmp/hosts2 contains 192.168.1.250, which isn’t included in the output.

Using the -x option with a hostname or a list of hostnames to be excluded from running the command also works:

[laytonjb@home4 ~]$ pdsh -x 192.168.1.4 uname -r
192.168.1.250: 2.6.32-431.11.2.el6.x86_64
[laytonjb@home4 ~]$ pdsh -x ^/tmp/hosts uname -r
192.168.1.250: 2.6.32-431.11.2.el6.x86_64
[laytonjb@home4 ~]$ more /tmp/hosts
192.168.1.4

More Useful pdsh Commands

Now I can shift into second gear and try some fancier pdsh tricks. First, I want to run a more complicated command on all of the nodes:

[laytonjb@home4 ~]$ pdsh 'cat /proc/cpuinfo | grep bogomips'
192.168.1.4: bogomips   : 6997.39
192.168.1.4: bogomips   : 6997.39
192.168.1.4: bogomips   : 6997.39
192.168.1.4: bogomips   : 6997.39
192.168.1.4: bogomips   : 6997.39
192.168.1.4: bogomips   : 6997.39
192.168.1.4: bogomips   : 6997.39
192.168.1.4: bogomips   : 6997.39
192.168.1.250: bogomips : 5624.23
192.168.1.250: bogomips : 5624.23
192.168.1.250: bogomips : 5624.23
192.168.1.250: bogomips : 5624.23

In the output, the node precedes the command results, so you can tell what output is associated with which node.

Notice that I put the entire command in quotes. This means the entire command is run on each node, including the first (cat /proc/cpuinfo) and second (grep bogomips) parts. You should also notice that the BogoMips values are different on the two nodes, which is perfectly understandable because the systems are different. The first node has eight cores (four cores and four Hyper-Thread cores), and the second node has four cores. You can use this command across a homogeneous cluster to make sure all the nodes are reporting back the same BogoMips value. If the cluster is truly homogeneous, this value should be the same. If it’s not, then I would take the offending node out of production and check it.

A slightly different command,

[laytonjb@home4 ~]$ pdsh 'cat /proc/cpuinfo' | grep bogomips
192.168.1.4: bogomips   : 6997.39
192.168.1.4: bogomips   : 6997.39
192.168.1.4: bogomips   : 6997.39
192.168.1.4: bogomips   : 6997.39
192.168.1.4: bogomips   : 6997.39
192.168.1.4: bogomips   : 6997.39
192.168.1.4: bogomips   : 6997.39
192.168.1.4: bogomips   : 6997.39
192.168.1.250: bogomips : 5624.23
192.168.1.250: bogomips : 5624.23
192.168.1.250: bogomips : 5624.23
192.168.1.250: bogomips : 5624.23

runs the first part contained in quotes, cat /proc/cpuinfo, on each node and the second part of the command, grep bogomips, on the node where you issue the pdsh command. The point here is that you need to be careful on the command line. In this example, the differences are trivial, but other commands could have differences that might be difficult to notice.

One very important thing to note is that pdsh does not guarantee a return of output in any particular order. If you have a list of 20 nodes, the output does not necessarily start with node 1 and increase incrementally to node 20. For example, I run vmstat on each node and get three lines of output from each node:

laytonjb@home4 ~]$ pdsh vmstat 1 2
192.168.1.4: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
192.168.1.4:  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
192.168.1.4:  1  0      0 30198704 286340 751652    0    0     2     3   48   66  1  0 98  0  0
192.168.1.250: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
192.168.1.250:  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
192.168.1.250:  0  0      0 7248836  25632  79268    0    0    14     2   22   21  0  0 99  0  0
192.168.1.4:  1  0      0 30198100 286340 751668    0    0     0     0  412  735  1  0 99  0  0
192.168.1.250:  0  0      0 7249076  25632  79284    0    0     0     0   90   39  0  0 100  0  0

At first it looks like the output from the first node first, but then the second node creeps in with its output. You need to expect that the output from a command that returns more than one line per node could be mixed. My best advice is to grab the output, put it into an editor, and re-arrange the lines, remembering that the lines for any specific node are in the correct order. Maybe someone with some serious pdsh-fu has a simple solution (please let me know if you have a technique). The other option is to issue only commands that return a single line of output. The results might not return in node order, but it is easier to sort them.

You can easily use pdsh to run scripts or commands on each node. For example, if you have read my past articles on processor and memory metrics or processes, networks, and disk metrics you can use those scripts to gather metrics quickly and easily on each node. However, you might want to modify the scripts so you only get one line of output (or maybe add switches in the code so you can specify the output) to make it easier to sort the results.

pdsh Modules

Earlier I mentioned that pdsh uses rcmd modules to access nodes. The authors have extended this to create modules for various specific situations. The pdsh modules page lists other modules that can be built as part of pdsh, including:

  • machines
  • genders
  • nodeupdown
  • slurm
  • torque
  • dshgroup
  • netgroup

These modules extend the functionality of pdsh. For example, the SLURM module allows you to run the command only on nodes specified by currently running SLURM jobs. When pdsh is run with the SLURM module, it reads the list of nodes from the SLURM_JOBID environment variable. Running pdsh with the -j jobid option gets the list of hosts from the jobid specified.

Summary

A tool that allows you to run commands on a range of nodes simultaneously is probably the most fundamental tool an HPC admin can use. Even experienced admins use parallel shell tools to understand the states of their systems. These tools are easily scriptable, so you can store the data in a flat file or a database.

Although you have the choice of several parallel shells, arguably the most popular tool of this kind ispdsh, which I briefly showed how to build, install, and usein this article. Pdsh is not very difficult to use, and a range of command options gives it a tremendous amount of flexibility for almost any scenario you can imagine.Pdshcan be used in conjunction with commands or scripts to gather information about compute nodes, so you are just limited by your imagination or needs.