Command-line tools for the HPC administrator
Line Items
The HPC world has some amazing "big" tools that help administrators monitor their systems and keep them running, such as the Ganglia and Nagios cluster monitoring systems. Although they are extremely useful, sometimes it is the small tools that can help debug a user problem or find system issues. Here are a few favorites.
ldd
The introduction of sharable objects [1], or "dynamic libraries," has allowed for smaller binaries, less "skew" across binaries, and a reduction in memory usage, among other things. Users, myself included, tend to forget that when code compiles, we only see the size of the binary itself, not the "shared" objects.
For example, the following simple Hello World program, test1
, uses PGI compilers (16.10):
PROGRAM HELLOWORLD write(*,*) "hello world" END
Running the ldd
command against the compiled program produces the output in Listing 1. If you look at the binary, which is very small, you might think it is the complete story, but after looking at the list of libraries linked to it, you can begin to appreciate what compilers and linkers do for users today.
Listing 1
Show Linked Libraries (ldd)
$ pgf90 test1.f90 -o test1 $ ldd test1 linux-vdso.so.1 => (0x00007fff11dc8000) libpgf90rtl.so => /opt/pgi/linux86-64/16.10/lib/libpgf90rtl.so (0x00007f5bc6516000) libpgf90.so => /opt/pgi/linux86-64/16.10/lib/libpgf90.so (0x00007f5bc5f5f000) libpgf90_rpm1.so => /opt/pgi/linux86-64/16.10/lib/libpgf90_rpm1.so (0x00007f5bc5d5d000) libpgf902.so => /opt/pgi/linux86-64/16.10/lib/libpgf902.so (0x00007f5bc5b4a000) libpgftnrtl.so => /opt/pgi/linux86-64/16.10/lib/libpgftnrtl.so (0x00007f5bc5914000) libpgmp.so => /opt/pgi/linux86-64/16.10/lib/libpgmp.so (0x00007f5bc5694000) libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f5bc5467000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5bc524a000) libpgc.so => /opt/pgi/linux86-64/16.10/lib/libpgc.so (0x00007f5bc4fc2000) librt.so.1 => /lib64/librt.so.1 (0x00007f5bc4dba000) libm.so.6 => /lib64/libm.so.6 (0x00007f5bc4ab7000) libc.so.6 => /lib64/libc.so.6 (0x00007f5bc46f4000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f5bc44de000) /lib64/ld-linux-x86-64.so.2 (0x000056123e669000)
If the application fails, a good place to look is the list of libraries linked to the binary. If the paths have changed or if you copy the binary from one system to another, you might see a library mismatch. The ldd
command is indispensable when chasing down strange issues with libraries.
find
One of those *nix commands you don't learn at first but that can really save your bacon is find
. If you are looking for a specific file or a set of files with a similar name, then find
is your friend.
I tend to use find
two ways. The first way is fairly straightforward: If I'm looking for a file or a set of files below the directory in which I'm located (pwd
), then I use some variation of the following:
$ find . -name "*.dat.2"
The dot right after the find
command tells it to start searching from the current directory and then search all directories under that directory.
The -name
option lets me specify a "template" that find
uses to look for files. In this case, the template is *.dat.2
. By using the *
wildcard, find
will locate any file that ends in .dat.2
. If I truly get desperate to find a file, I can go to the root directory (/
) and run the same find
command.
The second way I run find
is to use its output as input to grep
. Remember, *nix is designed to have small programs that can pipe input and output from one command to another for complex processing. Here, the command chain
$ find . -name "*.dat.2" | grep -i "xv426"
takes the output from the find
command and pipes it into the grep
command to look for any file name that contains the string xv426
– be it uppercase or lowercase (-i
). The power of *nix lies in this ability to combine find
with virtually any other *nix command (e.g., sort
, unique
, wc
, sed
).
In *nix, you have more than one way to accomplish a task: Different commands can yield the same end result. Just remember that you have a large number of commands from which to draw; you don't have to write Python or Perl code to accomplish a task that you can accomplish from the command line.
ssh and pdsh
It might sound rather obvious, but two of the tools I rely on the most are a simple secure tool for remote logins, ssh
, and a tool that uses ssh
to run commands on remote systems in parallel or in groups, pdsh
. When clusters first showed up, the tool of choice for remote logins was rsh
. It had been around for a while and was very easy to use.
However, it was very insecure because it transmitted data – including passwords – from the host machine to the target machine with no encryption. Therefore, a very simple network sniff could gather lots of passwords. Anyone who used rsh
, rlogin
, or telnet
between systems across the Internet, such as when users logged in to the cluster, left them wide open to attacks and password sniffing. Very quickly people realized that something more secure was needed.
In 1995, researcher Tatu Ylönen created Secure Shell (SSH) because of a password sniffing attack on his network. It gained popularity, and SSH Communications Security was founded to commercialize it. The OpenBSD community grabbed the last open version of SSH and developed it into OpenSSH [2]. After gaining popularity in the early 2000s, the cluster community grabbed it and started using it to help secure clusters.
SSH is extremely powerful. Beyond just remote logins and commands run on a remote node, it can be used for tunneling or forwarding other protocols over SSH. Specifically, you can use SSH to forward X from a remote host to your desktop and copy data from one host to another (scp
), and you can use it in combination with rsync
to back up, mirror, or copy data.
SSH was a great development for clusters and HPC, because it finally provided a way to log in to systems and send commands securely and remotely. However, SSH could only do this for a single system, and HPC systems can comprise hundreds or thousands of nodes; therefore, admins needed a way to send the same command to a number of nodes or a fixed set of nodes.
In a past article [3], I wrote about a class of tools that accomplishes this goal: parallel shells. The most common is pdsh
[4]. In theory, it is fairly simple to use. It uses a specific remote command to run a common command on specified nodes. You have a choice of underlying tools when you build and use pdsh
. I prefer to use SSH because of the security if offers.
To create a simple file containing the list of hosts you want pdsh to use by default, enter:
export WCOLL=/home/laytonjb/PDSH/hosts <C>WCOLL<C> is an environment variable that points to the location of the file that lists hosts. You can put this command in either your home or global <C>.bashrc<C> file.
Buy this article as PDF
(incl. VAT)