Managing cluster software packages
Package Deal
Setting up and configuring an HPC cluster is not as difficult as it used to be; some nice provision tools allow almost anyone to get a cluster working in short order. An issue worth considering, however, is how easily you can change things once the cluster is working. For example, if you get a cluster set up and then a user comes to you and says, "I need package XYZ built with library EFG version 1.23," do you re-provision things to meet your user's needs, or is there an easy way to add and subtract software from a running cluster that is minimally intrusive?
The short answer is "yes." Before I describe how you can organize a cluster to be more malleable, some mention of provisioning packages will be helpful. Three basic methods are offered by various toolsets:
- Image Based – A node disk image is propagated out to nodes on boot. Different "rolls" (images) can be constructed for different packages. An example is Rocks Clusters [1].
- NFS Root – Each node boots and installs everything as NFS root except for things that change for each node (e.g.,
/etc
,/var
). This system can be run disk-less or disk-full. An example is oneSIS [2]. - RAM Disk – A RAM disk is created on each node that holds a running system image. The RAM disk system can be created in hybrid mode, wherein some files are available via NFS, and it can run disk-less or disk-full. An example is Warewulf [3]. (A good description of Warewulf can be found in the HPC Admin series on Warewulf [4].)
Regardless of the provisioning system, the goal is to make changes without having to reboot nodes. Not all changes can be made without rebooting nodes (i.e., changing the underlying provisioning); however, many application packages can be added or removed without too much trouble if some simple steps are taken.
Dump It into /opt
On almost all HPC clusters, users have a globally shared /home
, and a globally shared /opt
path is possible as well. NFS is used on small to medium-sized clusters to share these directories. On larger clusters, some type of parallel filesystem might be needed. In either case, a mechanism always exists to share files across the cluster.
The simplest method is to install packages in /opt
. This approach has the advantage of "install once, available everywhere," although you might have to address some issues with logfiles; however, in general, this method will work with most software applications.
The main issue administrators must deal with is dynamic library linking. Because packages are not installed in the standard /usr/lib
path and you don't want to copy package entries into /etc/ld.so.conf/
on the nodes, you need a way to manage the location to the libraries. Of course, doing full static linking is one possibility, and using the LD_LIBRARY_PATH
is another, but both of these solutions put some extra requirements on users, and ultimately it comes back to the sys admin to support any problems. The preferred method is to install packages that "just work."
The solution is very simple. First, create /opt/etc/ld.so.conf.d/
and have all the packages place their library paths in conf
files, just as they would in /etc/ld.so.conf.d/
. Next, you need to make a small addition to /opt/etc/ld.so.conf
on all nodes. (i.e., it needs to be part of the node provisioning step so it is there after the node boots.) The additional line is:
include /opt/etc/ld.so.conf.d/*.conf
This new line tells ldconfig
to search /opt/etc/ld.so.conf.d/
for additional library paths. If a package is added or removed, all that needs to happen is a global ldconfig
on all the nodes to update the library paths. This step is easily accomplished with a tool like pdsh
. Thus, installing a package globally on the cluster is as simple as installing it in /opt,
making an entry in /opt/etc/ld.so.conf.d/
, and running a global ldconfig
.
If, for example, you had the current version of Open MPI installed and a user wanted to try the PetSc libraries with a new version, you could easily install and build everything in /opt
and have the user running new code without rebooting nodes or instructing them on the nuances of LD_LIBRARY_PATH
. Now that you have a way to add and subtract packages easily from your cluster, you need to tell users how to use them.
Global Environment Modules
In a previous article, I described the Environment Modules package [5]. (I have recently noted that some other Admin HPC authors have covered the same topic, as well [6].)
The use of Environment Modules [7] provides easy management of various versions and packages in a dynamic HPC environment. One of the issues, however, is how to keep your Modules environment when you use other nodes. If you use ssh
to log in to nodes, then you have an easy way to keep (or not keep) your module environment.
With some configuration, the SSH protocol allows passing of environment variables. Additionally, Modules stores the currently loaded modules in an environment variable called LOADEDMODULES
. For example, if I load two modules (ftw
and mpich2
) and then look at my environment, I will find:
LOADEDMODULES=fftw/3.3.2/gnu4:mpich2/1.4.1p1/gnu4
At this point, all I need to do is include this with all cluster SSH sessions, and then I can reload the Module environment. To pass an environment variable via ssh
, both the /etc/ssh/ssh_config
and /etc/ssh/sshd_config
files need to be changed. To begin, the /etc/ssh/ssh_config
file needs to have the following line added to it:
AcceptEnv LOADEDMODULES NOMODULES
(I will explain NOMODULES
later.) Keep in mind that you can use the Host
option in the ssh_config
file to restrict the hosts that receive this variable. Similarly, the sshd_conf
file needs the following line added:
SendEnv LOADEDMODULES NOMODULES
Once the SSHD service is restarted, future SSH sessions will transmit the two variables to remote SSH logins. Before the remote login can use modules, they must be loaded. This step can be done by adding a small piece of code to the user's .bashrc
script, as shown in Listing 1.
Listing 1
Loading Modules
01 if [ -z $NOMODULES ] ; then 02 LOADED=`echo -n $LOADEDMODULES|sed 's/:/ /g'` 03 for I in $LOADED 04 do 05 if [ $I != "" ] ; then 06 module load $I 07 fi 08 done 09 else export LOADEDMODULES="" 10 fi
As can be seen from this code, if NOMODULES
is set, nothing is done, and no modules are loaded. If it is not set, each module listed in LOADEDMODULES
is loaded. Note that this setup assumes the module package and module files are available to the node. Consider the example in Listing 2, in which two modules are loaded (fftw
and mpich2
) before logging in to another node (n0
in this case). On the first login, the modules are loaded on the remote node. On the second login, with NOMODULES
set, no modules are available:
Listing 2
Setting NOMODULES
$ module list Currently Loaded Modulefiles: 1) fftw/3.3.2/gnu4 2) mpich2/1.4.1p1/gnu4 $ ssh n0 $ module list Currently Loaded Modulefiles: 1) fftw/3.3.2/gnu4 2) mpich2/1.4.1p1/gnu4 $ exit $ export NOMODULES=1 $ ssh n0 $ module list No Modulefiles Currently Loaded.
As was noted, an important assumption is the availability of module files to all the nodes. By placing the module files in NFS-shared /opt
, all the nodes can find the module files in one place, and they can be added or removed without changing the running image on the node.
Toward Cluster RPMs
The final ingredient to this recipe is to encapsulate both of the ideas into package RPMs; that is, an RPM will install a package in /opt
, make the entry in /opt/ld.so.conf.d
, and install a module file. That way, except for a global ldconfig
, the entire package could be installed across the cluster in one step. If pdsh
(or similar) were required as part of the RPM installation process, the global ldconfig
could be done by the RPMs (just like a local ldconfig
is done by almost all RPMs).
Of course, building good RPMs takes some time, but once you have the basic "skeleton," it is not that difficult to drop it into the configure/make/install steps for various packages. Once you have good cluster RPMs for your applications, however, installation and de-installation are simple, convenient, and cluster-wide.
Infos
- Rocks Clusters: http://www.rocksclusters.org/wordpress
- oneSIS: http://onesis.org/
- Warewulf: http://warewulf.lbl.gov/trac/
- Warewulf cluster manager: http://www.admin-magazine.com/HPC/Articles/Warewulf-Cluster-Manager-Master-and-Compute-Nodes
- Managing the build environment: http://www.admin-magazine.com/HPC/Articles/Managing-the-Build-Environment-with-Environment-Modules
- Lmod environment modules: http://www.admin-magazine.com/HPC/Articles/Lmod-Alternative-Environment-Modules
- Environment modules: http://modules.sourceforge.net/