What to Know Before Moving into HPC

Jeff Layton

The SC24 supercomputing conference really impressed upon me the level of excitement that people have around HPC when they are first introduced to it. The key moment for me was on the bus ride back to the hotel sitting next to a young person who was so impressed with what HPC had been doing to improve the world and where it was headed. They were very excited to get back to their employer and start jumping into HPC. As many of us know, although systems on the TOP500 are amazingly impressive, there is nothing like a home cluster that you can use for learning and development and for downright solving real problems. (Yes, you can solve real problems on simple Raspberry Pi clusters at home.)

Many people in HPC, and I’m no exception, have been asked how to get started with HPC. Depending on experience level, I recommend people get a simple desktop or laptop and start learning Linux. Lots of articles talk about how to set up a home Linux system, which led to my previous article about basic Linux commands to know before jumping into HPC. As of the date of writing this article (late March), I haven’t seen anyone publicly complain about my list of commands, but disagreement will always exist, which is perfectly acceptable. The next step is to go beyond a list of commands and emphasize HPC concepts that go along with the commands.

Networking

In undergraduate engineering, faculty always advise you to take more math if you have electives. If someone going into HPC had the same question about where to dive a bit deeper, I would always advise learning more networking. You don’t have to dive to the bottom of the network model right away; rather you should try to understand the top layers of the model better and the basics of debugging problems.

I highly recommend understanding how to configure the network interface of a single home Linux system without allowing Linux to autoconfigure everything for you. Just learn the simple stuff, such as configuring the hostname, IP address, gateway, and netmask. Learn how to turn off DHCP (Dynamic Host Configuration Protocol), where a server (e.g., a home router) provides all of the network information (hostname, IP address, gateway, netmask, etc.) when you boot your system. Once DHCP is turned off, configure a static network interface so that the IP address can’t change if the system is rebooted. The same is true for the gateway and netmask. In many situations, HPC systems are going to rely on static hostnames and IP addresses, and a system administrator will too, so they can troubleshoot any HPC systems that are having problems. Many articles discuss configuring a network interface in Linux, which you can find with a simple search on those keywords.

For HPC clusters, it is a good idea to learn private networks and the common network addresses used within those configurations. I would recommend looking at 192.168 and 10.x IPv4 networks, because those are the most common. Note that these are all Ethernet networks (other network types require a different discussion).

If you have a simple server at home (e.g., a Raspberry Pi), then practice, practice, practice. Try configuring the network interface with different IP addresses and then open a browser to see if you can reach the Internet. Try different IP addresses. Try erasing your Gateway and see what happens. If you have another system on your home network, try pinging your Linux server. Have fun and don’t get frustrated, you can always turn on DHCP, get the system back on the network, and start over with your experimentation.

SSH

In my previous article, I mentioned that the ssh and scp commands are important enough that they should be learned when just getting started with Linux, and the same goes for HPC-specific situations.

The ssh command is a grand HPC command that you use as part of setting up, using, and monitoring HPC systems. In the good old days, you had insecure commands such as telnet , rsh (remote shell), rcp (remote copy), rlogin , and rexec (the dreaded R commands). SSH was developed in response to the very weak security (if any) that was used around 1995. Today Linux uses OpenSSH.

On your experimental HPC system, you can configure SSH with the help of some good articles on this topic. Once you have SSH configured, try ssh -ing to your experimental system from your experimental system. It should ask for your password. If it does, and you can log in to your system, you are golden.

Next, try ssh -ing into the experimental server from a different system at home, if you have one. This could be a Windows laptop, another Linux system, or a Mac system. For Windows, install something called MobaXterm, which gives you a Linux-like terminal window from which you can ping your experimental Linux system or your other Linux system and ssh into your experimental Linux system. macOS is even easier because you can open a terminal window and use ping or ssh to log in into your Linux system.

Imagine using SSH to run a single command on 10 different systems in your HPC cluster. You would have to type the password 10 times. Ugh. Instead, try configuring SSH without a password. After SSH is working and you can ssh into your Linux system, erase your SSH configuration by erasing the .ssh directory in your home directory:

rm -rf .ssh

Once passwordless SSH is configured, repeat the experiment of ssh -ing into your experimental Linux server from that same server. This time it should not ask you for your password, which means you have achieved success. This step will be essential when you build an HPC system. Typically, HPC systems have their own private network, so passwordless SSH becomes essential.

SCP

SSH is great for logging in to a system, but what if you want to copy data securely from one server to another. The scp command is a secure network version of the cp command. Indeed, SCP is the acronym for Secure Copy Protocol. This command is useful, particularly when you are logging in to the system from somewhere else, such as your laptop. Even in your home cluster you might want to copy data from the cluster to a laptop or desktop (the cluster will have its own private network).

SCP was originally based on the SSH protocol, but OpenSSH developers found SCP to be outdated, inflexible, and not readily fixed, so the OpenSSH developers switched to SFTP as the underlying protocol. SCP is very easy to use and very similar to SSH.

If, or when, you have two Linux systems in your home lab, you can experiment by copying files between the two systems with scp . The syntax should be familiar, and you can even use wildcards to copy all files, as well as use the recursive option (-r ) to copy data from all the subdirectories.

Parallel Shell

In HPC you deal with many systems rather than a single system. At times, you want or need to run a command on all the systems or a subset of the systems, and you don’t want to SSH to each system and run the command. A parallel shell that runs a specific command or script on the systems you want accomplishes this task.

A quick search shows a fair number of available parallel shells:

Some of these are a bit older, but quite a few of them are still being developed. How do you choose?

My recommendation is to just pick one. Personally, I would choose one that people write about and perhaps one that is still under active development. (Look at the source code or GitHub site and see whether the code has been updated in the last couple of years.) If you ask other people which one to select, you’ll here a lot of strong opinions about one tool or the other, so keep that in mind.

I’ve been using pdsh for a long while. I know the tool reasonably well as a user, and it’s still under active development. Just because I use it doesn’t mean it is better than the others. Therefore, I will write a bit about it as an example of a parallel shell.

Pdsh allows you to run the same command or script on all systems you specify. For example, you could run a command to check the hostname of all of the systems in a cluster:

$ pdsh -w compute-[01-04] hostname

You should have pdsh use SSH, which you can do by adding the following line to your .bashrc file:

export PDSH_RCMD_TYPE=ssh

(I’m assuming you are using Bash Shell; if not, you might have to put this line in a different file for your shell.)

Also note that you want to have passwordless SSH set up on your Linux test system(s) and in your cluster.

The -w compute-[01-04] portion of the command is the option to specify the systems on which you want to run the specific command or script. In this case, rather than list each system individually, I list the four systems in brief notation compute-[01-04] . This command tells pdsh that the base name of the systems start with compute- . The part in the brackets is a range of the final part of the system names (to run hostname ). In this example, the system names are compute-01 , compute-02 , compute-03 , and compute-04 . You can use all sorts of range combinations with pdsh , such as compute-[01-03,05-08,10,13,14,15-20] and so on.

You can use pdsh to monitor specific systems:

$ pdsh -w compute-[01-04] uptime

Pdsh has a number of other options that can pre-define sets of systems, exclude certain systems by default or on the command line, and so on. You should be able to find a few articles about pdsh , so go read them.

Shared Filesystem

A shared filesystem allows files to be presented in the same way on every system that mounts the same shared filesystem, so it looks like a local filesystem. This approach allows each system to always have the exact same view of the data, including creating, reading, writing, and deleting files, as well as several other file functions. Although not strictly necessary, I argue that a shared filesystem is a fundamental concept in HPC. I think you need to learn about them and use them before making the decision not to use them.

Many details go into a shared filesystem so that the data can be shared without having systems step on each other. You can probably figure out the most common possible issues with shared filesystems: overwriting data to a file that another system is still writing. However, the good news is that several shared filesystems contain all the semantics to avoid this issue, so users cannot do anything too problematic, although you still need to be careful that one system that just finished editing and closing a file doesn’t have another system immediately delete that same file (unless you really want to). Filesystems can’t save you from this problem – that is a user education issue. One shared filesystem that has been around for a while is the Network File System (NFS).

NFS

NFS is the most common shared filesystem in Linux by a large margin. It has been around since 1984 when Sun Microsystems developed it. NFS is still in active development as an open standard by many companies and individuals. It uses a client-server mode, wherein one or more systems are NFS servers (they contain the data), and other systems are NFS clients. (Note that NFS servers can also be NFS clients, although that can get a little weird).

I highly recommend reading an article or two about how to configure NFS on two Linux systems. Make one the NFS server and the other the NFS client. Mixing Linux and Windows can work, but Windows sometimes needs a third-party tool to be an NFS client (and you have to pay for the tool). Therefore, I suggest using two Linux systems (e.g., a couple of Raspberry Pis). Once you have everything set up, try creating a file on the NFS client and then check whether that file also appears on the NFS server.

One of the advantages of a shared filesystem is that – using NFS as an example – you can mount /home from the NFS server to all the systems you want and then configure passwordless SSH for your account. When you want to SSH to any one of the systems that mounts /home , you won’t have to enter a password, which also means that you can use pdsh without having to enter a password.

Next Steps

These commands pretty much compose the Linux commands you need to know for HPC. Almost everything further concerns details about how you want to deploy (provision) and manage, write code for, and manage resources for your cluster so that you can share it or run batch jobs and not have to stay logged in, monitoring and generally administering the cluster. I hope this article has pointed you in the direction of HPC-specific concepts and commands.

Don’t be afraid to experiment while you’re learning. Making mistakes is par for the course, so read as much as you can online and don’t be afraid to ask questions. My last piece of advice is that you don’t have to memorize all the commands and their options. It’s OK to search for commands and definitely OK to look up options. Finally, have fun!