Updates and Upgrades in HPC
What is the difference between an update and an upgrade? We’ll look at why, once an HPC distribution is installed, you are likely to see one or two minor release updates, but rarely an upgrade.
As my introduction to this article, I want to tell you a short story. I was working on my small Warewulf cluster to incorporate running containers. I tried Docker, Apptainer, enroot, and maybe something else. I was having trouble with all of them because of the prerequisites. Several times I installed them, tried something, then uninstalled them, ending up with compute nodes that no longer booted.
I tried removing packages from the container. I tried reinstalling packages during which the head node and the container were both updated to a new minor version of Rocky. I asked for help on the Warewulf Slack channel and directly asked a friend for help. I had a “crash cart” connected to the compute node to watch it boot. I could not fix it, but I wasn’t too upset because I had the HPC ADMIN articles and a bunch of notes.
I decided to start over with a fresh container, but before I did that, I thought about how I wanted to create the container this time. I could pull the Rocky 8 container and then maybe write a script to add the packages I needed or wanted, even if I added them in stages. I could also write a simple script that created a chroot, and I could build it up from layers. (Although I started this process, my day job didn’t allow me to get back to the project for over a month, and counting, so I haven’t been able to build and test the container from the chroot.) I could also write a Docker file to build up a container for myself. During these contemplations, I tried updating Rocky 8 on the head node and in the container, which really broke things with absolutely nothing working with the compute nodes.
At this point I started to think about general steps I could take to update or upgrade an HPC system, which is the subject of this article.
Update vs. Upgrade
I’m sure you remember the scientific method from school, but rather than begin with a hypothesis, I like to start with a problem statement that includes definitions. My Solutions Architect (SA) background has taught me that if you aren’t careful about defining terms and assumptions, miscommunications happen – as I discovered once when presenting to a customer, who assumed I was lying and stormed out of the meeting. Although this episode might not have been the result of not setting expectations and definitions, nonetheless, it was a fun time. Before diving into pontifications about updating or upgrading, I want at least to define “update” and “upgrade” to illustrate how they are very different.
If you poll 10 HPC administrators and ask them for their definitions of update and upgrade, you'll get maybe 16 answers that are all very different. To me, update refers to keeping a distribution with a specific major and minor number current. For example, keeping Rocky 8.6 up to date. As you move up the version ladder, things get a little murkier. I think of keeping a major number distribution up to date (e.g., from Rocky 8.6 to Rocky 8.7) as an update, although some disagree and think of this as an upgrade.
Going from one major distribution to another is what I think of as an upgrade (e.g., from Rocky 8 to Rocky 9). Before diving into a deeper discussion on updates and upgrades, here are a few assumptions I want to mention:
- storage is separate from compute (although for small systems such as mine, this isn’t always possible);
- you have a backup process that you are comfortable with;
- you have practiced storing from backups by actually doing a restore;
- the critical nodes in a cluster are usually the head node, the login nodes, control nodes for services such as Slurm, and any other node type that takes work to install and you don’t want to repeat; and
- sometimes login nodes are considered “critical”; that is, they are simple nodes that users log in to and from which they submit their jobs.
Update Within a Minor Version
An example of updating within a minor version is to keep Rocky 8.6 up to date with package updates that fall into the Rocky 8.6 definition. Within a minor version, package updates are almost always compatible with previous packages in that minor version. The goal is to keep everything compatible to minimize change but to incorporate, perhaps, security fixes or add minor features that do not break compatibility.
I say “almost always compatible” because in some cases a package maintainer introduces a minor package update that breaks compatibility. I’m sure the maintainer’s goal was not to do this, but it does happen. If this does happen, you need to be ready (more on that later).
For a variety of reasons I make the following recommendations for updating within a minor version:
- Do not make updates on a production system if you can help it.
- Make sure the test system matches the production system as closely as possible.
- Have a set of tests ready to run after the package updates.
- Document everything you do in as much detail as possible.
- Be ready to revert any package updates through your package manager. Be sure to practice using it so that you understand the idiosyncrasies.
- Have backups on the test system in case you have problems.
- When you are ready to roll out the update on a production system, have a maintenance window during which you can install the updates system-wide and be sure you can revert if you need to.
- Build in some time in case the updates don’t work or the post-update tests fail.
To elaborate on a few of these recommendations: Although you should never do even minor version updates on a production system, in some circumstances (e.g., my little cluster) I can’t adhere to this recommendation. Having a set of tests for checking the system after a package update is critical to ensuring there are no problems before rolling out to a production cluster. These tests should comprise basic commands, benchmarks, and user applications. Don’t make the set of tests too extensive or you will drive yourself crazy, but don’t make the set too small either.
Probably the most important thing you should do is document what you are doing. Your .bash_history can help a little, but it’s not enough by itself. A tool that can really help is script, which allows you to record what you are doing. Be sure not to change terminal tabs because script won’t pick up anything in that new tab.
I like script because it is a sequential text file that can be easily searched and can be massively compressed (it’s just text). The following are a couple of tips when using script:
- Periodically list the content of pwd, perhaps with something like ls -lstar.
- Frequently use the date command, so you know when things happened and their order.
- Type notes at the command prompt and any information you want to record. When you hit Enter you will get an error, but anything you entered will appear in the script file.
I might be a bit retentive, but I have been bitten in the past, and as the introduction to this article explains, I was bitten again.
Sometimes you need to revert a package update to a previous version. Most package managers allow you to do this, but before you need it, be sure you know the command options and practice reverting a package.
Complimentary to reverting a package with a package manager, understand whether you can install an older package to replace it. This event is not reverting to a package, which is usually just the previous version, but going back two or more versions. Also, you might need to do this if reverting a package doesn’t work.
Minor Version: Update or Upgrade?
After updating within the minor version, I then think about updating that version. An example of this is updating Rocky 8.6 to Rocky 8.7. Although I think of it as an update, you could easily think of this as an upgrade because it involves a bit more effort. Moreover, it will almost always have regressions, so it might feel like an upgrade rather than an update. However, I still refer to this as an update, and I’ll save the term “upgrade” for a change in major versions.
The same steps I discussed in updating within a minor version apply to updating minor versions themselves. In my opinion, though, you need to pay more attention to the updated packages. Although you can update and test them as individually as possible, this process could be a bit tedious. However, I don't recommend that you install them in one shot.
One addition I recommend to the previous documentation steps is to take screenshots periodically and put them in a document. I like doing this because I can step through the script file, but seeing the whole terminal adds a bit more context.
Major Version Upgrade
An example of a major version upgrade is going from Rocky 8.6 to Rocky 9. This change is big and is what I consider an upgrade to the cluster. My advice is, first, to plan a complete wipe of the system and start from scratch because, at best, I have seen too many automatic “upgrades” that create a “Franken System” that is truly never the upgraded version and always has bits left over from the previous version. Second, I have first-hand experience of using automatic upgrades, and they left the system so damaged that I had to restore it from a backup or an image.
Before wiping the old distribution from the cluster, I always make a backup of the critical nodes (e.g., head node, login nodes, control nodes for things like Slurm, and possibly gateway or storage nodes). However, backups might not be what you want or need. Think about using images of the critical nodes so that re-installation is much easier.
Backups
I don’t want to get deep in a discussion of backups and images of nodes; there is a whole philosophy around them and how to use them. I just want to mention a couple of ideas in relation to updates and upgrades.
I use “backup” to mean a copy of files from a node, such as /home or /opt, but I can’t really boot a node from these. This doesn't mean that backups aren’t useful, of course, but it does mean that if I need to re-install a node, I want to accomplish this directly.
I like to take images of nodes for re-installation. You can refer to these as backups if you like, but I prefer to separate backups, indicating just files, from images that you use to re-install a node and even boot that node.
Linux has a plethora of imaging tools. I have experience with Mondo Rescue, which I used when I was an admin at Lockheed-Martin. One thing I really liked was that it could create an ISO image that I could put on a CD or DVD (way back then) or a USB drive (today) to re-install the node.
A combination of backups and node images might provide the best way to restore a system quickly.
HPC and Updates or Upgrades
Systems are different, but the HPC world is different from the enterprise world or cloud world – or any other IT world for that matter. The focus of HPC is on high performance, and I’m referring to the “center of mass” of HPC.
HPC doesn’t dismiss security patches, but it points to the number one focus: application performance. Perhaps not always noticed is that HPC focuses on the user’s scientific applications, not a database nor a web server. This results in two aspects of HPC: (1) You tune the operating system to get the best possible performance without wasting too much time, and (2) once you tune the operating system, you leave it alone and focus on the user applications.
The second aspect is very important. Once the distribution is reasonably tuned, you turn to the user applications. My experience with HPC systems is that once the distribution (operating system) is installed, you might see one or two minor release updates, but it is rare to see an upgrade. To repeat myself: The focus is on achieving large improvements in user applications. Therefore, you will see that HPC systems have a variety of compilers, libraries, and tools that are added, updated, or upgraded throughout the life of the system.