Warewulf Cluster Manager – Part 2
From a software perspective, building HPC clusters does not have to be complicated or difficult. Stateless cluster tools can make your life much easier when it comes to deploying and managing HPC clusters. Recall that stateless means that the nodes do not retain state between rebooting. In many cases, people refer to this as “diskless,” although that is not truly correct because you can still have stateless nodes that have internal disks. However, with stateless clusters, the OS is not installed on the internal disk(s), so if a cluster node reboots, it gets a new image. This allows it to be stateless between reboots. This approach can greatly reduce the amount of time it takes to deploy HPC clusters and can help eliminate image skew between compute nodes (a common problem in HPC). Clusters also are easier to update because you just reboot the compute node to get an updated image. One stateless HPC cluster tool I have been using for some time, named Warewulf, pioneered many of the common methods of imaging, deploying, and managing stateless clusters (the new version of Warewulf can also deploy stateful clusters).
In the previous article, I introduced how to install Warewulf on a master node and boot your first compute node, but this does not mean it’s a complete cluster yet. A little more configuration to the master remains, and a few other tools must be installed and configured before the cluster is useful and jobs can be run. This article is the second in a four-part series about Warewulf, where these tools are installed and configured and the cluster is ready for running applications. More specifically, I will discuss the following aspects:
- Setting up NFS on the master node for exporting several directories to the compute nodes.
- Creating a Hybrid VNFS to reduce memory usage and improve boot times.
- Adding users to the compute nodes.
- Adding a parallel shell tool, pdsh, to the master node.
- Installing and configuring ntp (a key component for running MPI jobs).
These added capabilities will create a functioning HPC cluster.
In the third of the four articles in this series, I build out the development and run-time environments for MPI applications, including environment modules, compilers, and MPI libraries, and in the fourth article, I add a number of the classic cluster tools, such as Torque and Ganglia, which make operating and administering clusters easier. For this article and all others, I will use the exact same system as before.
NFS on the Master Node
One common theme in HPC clusters is the need for a common filesystem, which allows all the compute nodes to access the same file and write to a common filesystem, or even a common file, reducing the need to copy files to compute nodes so that applications can function. Traditionally, clusters have used NFS (Network filesystem) as the common filesystem for many reasons:
- NFS is the only standard network filesystem.
- It comes with virtually all Linux distributions and other distributions (including Windows).
- It is easy to configure, use, and debug.
Many people will argue that NFS is old and not very useful anymore; however, a very large number of HPC systems still use NFS, proving its utility as a common filesystem. For many applications and reasonably sized clusters, it has high enough performance that I/O is not a bottleneck. Even for extremely large clusters, it still serves a very useful role, even if it isn’t used as the storage space for running applications. Dr. Tommy Minyard of the University of Texas, Texas Advanced Computing Center (TACC), states that TACC uses NFS for its largest clusters (4,000 nodes – almost 63,000 cores) for user /home directories for the reasons stated previously: It is easy to configure, use, and debug; it comes with almost every Linux distribution; and it has enough performance for many tasks, even on extremely large systems.
In addition to using NFS as a common filesystem for users in an HPC cluster, Warewulf can also use it as part of a Hybrid VNFS approach to reduce the size of the images on compute nodes. This reduces the amount of data transmitted to the compute nodes, meaning they can boot faster and reduce the footprint of the image, reducing memory usage.
Configuring NFS for Warewulf clusters really means that it needs to be configured on the master node. Almost by default, NFS should be installed on your master node. If not, please do so (for Scientific Linux, this means performing yum install nfs). For the example system, the NFS server is already installed and only needs to be configured.
The first step in configuring NFS as a server on the master node is to find out whether it runs on the master node when it is booted by using the chkconfig command:
[root@test1 etc]# chkconfig --list ... nfs 0:off 1:off 2:off 3:off 4:off 5:off 6:off ...
This output shows that the NFS server is not turned “on,” so when the master node is booted, it is not automatically started. To fix this, use the chkconfig command again:
[root@test1 etc]# chkconfig nfs on [root@test1 etc]# chkconfig --list ... nfs 0:off 1:off 2:on 3:on 4:on 5:on 6:off ...
Now the NFS server will start when the master node reboots. To make it start now, you just need to use the service command.
[root@test1 ~]# service nfs restart Shutting down NFS mountd: [FAILED] Shutting down NFS daemon: [FAILED] Shutting down NFS quotas: [FAILED] Shutting down NFS services: [ OK ] Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ]
I used the restart option to show that the NFS server was not initially running but now it is (I also could have used the start option).
Now that the NFS server has been started and configured to start automatically when the master node is booted, I’ll move on to configuring the various directories on the master node that I want to “export” to the compute nodes using NFS. The directories I will be exporting are:
- /var/chroots (where I store the Hybrid VNFS – more on this in the next section),
- /home (for the user’s home directories),
- /opt (for optional applications and packages that will be installed a later date), and
- /usr/local (also for optional applications and packages).
For the compute nodes, I will mount all of the directories except /home as read-only (ro). This helps minimizes any OS skew between nodes because compute nodes cannot write to them. The file that defines which directories are to be exported, /etc/exports, should look like this:
/var/chroots 10.1.0.0/255.255.255.0(ro) /home 10.1.0.0/255.255.255.0(rw) /opt 10.1.0.0/255.255.255.0(ro) /usr/local 10.1.0.0/255.255.255.0(ro)
I exported these filesystems to the 10.1.0.0 network address space, which is the private compute node network I created in the first article. By specifying the range of network addresses that can mount the filesystems (directories), I can control which nodes mount the filesystem and which can’t.
A simple command lets me check what the master node is exporting:
[root@test1 ~]# showmount -e 10.1.0.250 Export list for 10.1.0.250: /usr/local 10.1.0.0/255.255.255.0 /opt 10.1.0.0/255.255.255.0 /home 10.1.0.0/255.255.255.0 /var/chroots 10.1.0.0/255.255.255.0
You can see that all four directories are exported and to which networks.
At this point, you should have NFS working on the master node as a server, exporting filesystems to the compute node network.
Hybrid VNFS
A somewhat common complaint about stateless clusters is that they use ramdisks to hold the OS, reducing the amount of memory that is usable for jobs (Note: Not all stateless clusters use a ramdisk, but many do). Warewulf uses a ramdisk, but the OS components installed on the ramdisk only contain critical portions of the OS that should be local. The footprint of the OS on each node can be further reduced by using what Warewulf calls a “Hybrid” VNFS.
All Hybrid really means is that some parts of the OS are installed on the ramdisk, local to the node, and some parts use NFS for some capabilities. Because of the design and implementation of Warewulf, the basic process is easy – just mount the VNFS from the master node on the compute node. The Hybrid VNFS will then use symlinks for some components that are contained in the NFS-mounted VNFS. The steps for doing this are very easy to follow because Warewulf is designed for Hybrid VNFS from the start.
The first step is to create a directory in the chroot environment (/var/chroots/sl6.2) that allows you to mount the VNFS. Remember that this chroot environment is the basis of the VNFS and that the VNFS, which is the distribution installed on the compute nodes, is on the master node in directory /var/chroots/sl6.2. Therefore, you can just cd to that location and create a mountpoint for the VNFS that will be NFS-mounted from the master node.
[root@test1 ~]# cd /var/chroots/sl6.2 [root@test1 sl6.2]# mkdir vnfs [root@test1 sl6.2]# ls -s total 88 4 bin 4 dev 0 fastboot 4 lib 4 media 4 opt 4 root 4 selinux 4 sys 4 usr 4 vnfs 4 boot 4 etc 4 home 12 lib64 4 mnt 4 proc 4 sbin 4 srv 4 tmp 4 var
The next step is define the NFS mount points in /etc/fstab for the chroot environment. The changes to be made to the chroot environment here are static and will not change for new compute nodes. The only time they will change is if you use a different NFS server for the filesystems. Because I’malready in the directory, /var/chroots/sl6.2/, I just need to edit the file etc/fstab (the full path is /var/chroots/sl6.2/etc/fstab). The fstab file in the chroot should look like the listing below.
[root@test1 sl6.2]# cd etc [root@test1 etc]# pwd /var/chroots/sl6.2/etc [root@test1 etc]# more fstab #GENERATED_ENTRIES# tmpfs /dev/shm tmpfs defaults 0 0 devpts /dev/pts devpts gid=5,mode=620 0 0 sysfs /sys sysfs defaults 0 0 proc /proc proc defaults 0 0 10.1.0.250:/var/chroots/sl6.2 /vnfs nfs defaults 0 0 10.1.0.250:/home /home nfs defaults 0 0 10.1.0.250:/opt /opt nfs defaults 0 0 10.1.0.250:/usr/local /usr/local nfs defaults 0 0
In this case, I have mounted the specific chroot environment for this node, which in this case is sl6.2. Also notice that I have included the other NFS filesystems, /home, /opt, and /usr/local in etc/fstab, so I can mount those as well (I will need them later). This completes the changes I need to make to the chroot environment (the VNFS), and I can now move on to the master node to complete one last configuration before rebuilding the VNFS.
When Warewulf is installed, by default it should create a file, /etc/warewulf/vnfs.conf on the master node, which is a configuration file that describes the VNFS at a high level. In this file should be a list of things to exclude from hybridization so that they are installed locally on the compute nodes. For Warewulf 3.1, the exclude list that is created when Warewulf is installed should look something like the following file listing:
[root@test1 ~]# more /etc/warewulf/vnfs.conf # General Vnfs configuration file. # # You can also create a VNFS specific configuration file in the vnfs/# configuration directory named "vnfs/[name].conf". There you can put # additional information and configuration paramaters. excludes += /usr/share excludes += /usr/X11R6 excludes += /usr/lib/locale excludes += /usr/lib64/locale excludes += /usr/src excludes += /usr/include excludes += /usr/local # These are optional because they may break things if your not enabled # hybridization #excludes += /usr/lib64/R #excludes += /usr/lib64/python2.4 #excludes += /usr/lib/perl5 #excludes += /usr/openv #excludes += /usr/lib64/perl5 #excludes += /usr/lib64/dri
Now that the compute node chroot environment has been configured as well as the VNFS configuration file on the master node, the next step is to just rebuild the VNFS but with an additional flag to tell Warewulf that the VNFS is now a Hybrid.
[root@test1 ~]# wwvnfs --chroot /var/chroots/sl6.2 --hybridpath=/vnfs Using 'sl6.2' as the VNFS name Creating VNFS image for sl6.2 Building template VNFS image Excluding files from VNFS Building and compressing the final image Cleaning temporary files Are you sure you wish to overwrite the Warewulf VNFS Image 'sl6.2'? Yes/No> y Importing into existing VNFS Object: sl6.2 Done.
Notice that I wrote over the old VNFS because I really want to have the smallest VNFS possible and I shouldn’t be too worried about the previous VNFS. (Note: Be very careful if you rebuild the chroot from scratch with mkroot-rh.sh because it will overwrite any manual changes you have made to the chroot environment.)
Because Warewulf already knows about the VNFS named sl6.2, you do not have to re-associate the VNFS with specific compute nodes – it’s already done. If you like, you can check the size of the VNFS with a simple command:
[root@test1 ~]# wwsh vnfs list VNFS NAME SIZE (M) sl6.2 55.3
To test the new Hybrid VNFS, simply reboot the compute node and then check to see whether the filesystems are mounted with the mount command on the compute node.
-bash-4.1# mount none on / type tmpfs (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 10.1.0.250:/var/chroots/sl6.2 on /vnfs type nfs (ro,vers=4,addr=10.1.0.250,clientaddr=10.1.0.1) 10.1.0.250:/home on /home type nfs (rw,vers=4,addr=10.1.0.250,clientaddr=10.1.0.1) 10.1.0.250:/opt on /opt type nfs (ro,vers=4,addr=10.1.0.250,clientaddr=10.1.0.1) 10.1.0.250:/usr/local on /vnfs/usr/local type nfs (ro,vers=4,addr=10.1.0.250,clientaddr=10.1.0.1)
As you can see the four NFS filesystems are mounted from the master node (10.1.0.250). In case you didn’t know, beginning with version 6.x, Red Hat Enterprise Linux started using NFSv4 instead of v3. Because Scientific Linux is based on Red Hat Enterprise Linux, it uses NFSv4 as well, which you can see in the listing as vers=4 for each of the mountpoints.
If you want to see how much memory the VNFS is using, you can simple use the df command on the compute node.
-bash-4.1# df Filesystem 1K-blocks Used Available Use% Mounted on none 1478332 221404 1256928 15% / tmpfs 1468696 148 1468548 1% /dev/shm 10.1.0.250:/var/chroots/sl6.2 54987776 29582336 22611968 57% /vnfs 10.1.0.250:/home 54987776 29582336 22611968 57% /home 10.1.0.250:/opt 54987776 29582336 22611968 57% /opt 10.1.0.250:/usr/local 54987776 29582336 22611968 57% /vnfs/usr/local -bash-4.1# df -h Filesystem Size Used Avail Use% Mounted on none 1.5G 217M 1.2G 15% / tmpfs 1.5G 148K 1.5G 1% /dev/shm 10.1.0.250:/var/chroots/sl6.2 53G 29G 22G 57% /vnfs 10.1.0.250:/home 53G 29G 22G 57% /home 10.1.0.250:/opt 53G 29G 22G 57% /opt 10.1.0.250:/usr/local 53G 29G 22G 57% /vnfs/usr/local
From the output, it can be seen that only 217MB of memory is used on the compute node for storing the local OS. Given that you can easily and inexpensively buy 8GB for consumer boxes such as the one I’m using in this test, 217MB is pretty easy to tolerate, especially when it means I can boot compute nodes very quickly (the time to reboot the compute node was less than a minute).
Adding Users
At this point, you can start adding users to the compute nodes. There are really two things that you need to do: (1) Make sure the password and group information is on the compute nodes and (2) have the user’s /home directories mounted on the compute nodes. With Warewulf, these steps are pretty easy, and I’ll introduce you to some pretty cool Warewulf capabilities.
The first step is to make sure the user’s account information is on the compute nodes. Really, you need to focus on three files: /etc/passwd, /etc/group, and /etc/shadow. This assumes you have shadow passwords enabled – if you don’t know for sure, it’s almost a certainty that they are being used because they are on by default in most distributions. The first obvious solution is to include these files in the VNFS, rebuild the VNFS, and then reboot the compute nodes. Although this will work, what happens if the files change, such as when new users are added? Fortunately, Warewulf has a much easier method for handling this – using the capabilities of wwsh.
The Warewulf shell, wwsh is a very powerful way for you to interact with the Warewulf provisioning system. It is primarily used for configuring and managing Warewulf. I won’t go over it in detail because you can consult the documentation, but I do want to highlight some of its features. The shell allows you to:
- manage bootstrap and VNFS images,
- manage the node database,
- manage the provisioning of nodes, and
- provide a generic interface to the datastore through the object command and control over the DHCP services.
These four things seem pretty simple, but in fact, they are very powerful, and I will be using them in the rest of this article and subsequent articles. If you haven’t noticed, I have already used wwsh a bit in this and the previous article.
wwsh is an interactive shell run by root (the cluster administrator). Once you get into the shell by typing wwsh, you have a prompt where you can enter commands. The commands take the form of
wwsh [command] [options] [targets]
with several commands:
- bootstrap: manage bootstrap images
- dhcp: manage DHCP services and configuration
- events: control how events are handled
- file: manage files within the Warewulf datastore
- node: manage the node database
- object: generic interface to the Warewulf datastore
- provision: manage the provisioning of bootstrap images, VNFS images, and files
- pxe: configure node PXE configuration
- vnfs: manage VNFS images
- quit: exit the Warewulf shell
If you get stuck, you can always enter help. If you type help at the command prompt, you get help on wwsh itself. If you type bootstrap help, you get help on the bootstrap command and it’s options.
Using wwsh, I am going to “import” the three files I need for user accounts into the Warewulf database; then, I’ll associate the files with the nodes I want. When the compute nodes are booted, Warewulf will automagically copy these files to the nodes on my behalf (pretty sweet, if you ask me). The first step is to import the files into the Warewulf database.
[root@test1 ~]# wwsh file import /etc/passwd Imported passwd into a new object [root@test1 ~]# wwsh file import /etc/group Imported group into a new object [root@test1 ~]# wwsh file import /etc/shadow Imported shadow into a new object [root@test1 ~]# wwsh file list NAME FORMAT #NODES SIZE(K) FILE PATH ================================================================================ dynamic_hosts data 0 0.3 /etc/hosts passwd data 0 1.9 /etc/passwd group data 0 0.9 /etc/group shadow data 0 1.1 /etc/shadow
When the files are imported, Warewulf assigns a “name” to the file. For example, the file /etc/passwd became passwd in the Warewulf database. When working with Warewulf, be sure to use the “name” and not the full path.
With the above commands, I imported three files into the Warewulf database. However, you see when I list the files that there are actually four files. The fourth file, dynamic_hosts, is something that Warewulf creates, and it will be used later in this article.
Now that the files are imported into the Warewulf database, the next step is to associate the files with the nodes to be provisioned. This is also very easy to do with the wwsh command and the --fileadd option for the provision command.
[root@test1 ~]# wwsh provision set n[0001-0249] --fileadd passwd Are you sure you want to make the following changes to 1 node(s): ADD: FILES = passwd Yes/No> y [root@test1 ~]# wwsh provision set n[0001-0249] --fileadd group Are you sure you want to make the following changes to 1 node(s): ADD: FILES = group Yes/No> y [root@test1 ~]# wwsh provision set n[0001-0249] --fileadd shadow Are you sure you want to make the following changes to 1 node(s): ADD: FILES = shadow Yes/No> y
As I stated previously, be sure you use the Warewulf “name” of the file and not the full path in the above commands. Also notice that Warewulf tells you only one node has been defined up to this point.
Now, I’ll reboot the compute node to see whether Warewulf pushes the files to the node. I’ll check to see whether this happened by watching the file /etc/group, which is one of the files I imported into the Warewulf database. Before rebooting the compute node, the file looks like this:
-bash-4.1# more /etc/group root:x:0:root bin:x:1:root,bin,daemon daemon:x:2:root,bin,daemon sys:x:3:root,bin,adm adm:x:4:root,adm,daemon tty:x:5: disk:x:6:root lp:x:7:daemon,lp mem:x:8: kmem:x:9: wheel:x:10:root mail:x:12:mail uucp:x:14:uucp man:x:15: games:x:20: gopher:x:30: video:x:39: dip:x:40: ftp:x:50: lock:x:54: audio:x:63: nobody:x:99: users:x:100: utmp:x:22: utempter:x:35: floppy:x:19: vcsa:x:69: rpc:x:32: cdrom:x:11: tape:x:33: dialout:x:18: rpcuser:x:29: nfsnobody:x:65534: sshd:x:74:
After the compute node has rebooted, the /etc/group file looks like this:
-bash-4.1# more /etc/group root:x:0:root bin:x:1:root,bin,daemon daemon:x:2:root,bin,daemon sys:x:3:root,bin,adm adm:x:4:root,adm,daemon tty:x:5: disk:x:6:root lp:x:7:daemon,lp mem:x:8: kmem:x:9: wheel:x:10:root mail:x:12:mail,postfix uucp:x:14:uucp man:x:15: games:x:20: gopher:x:30: video:x:39: dip:x:40: ftp:x:50: lock:x:54: audio:x:63: nobody:x:99: users:x:100: dbus:x:81: utmp:x:22: utempter:x:35: rpc:x:32: avahi-autoipd:x:170: oprofile:x:16: desktop_admin_r:x:499: desktop_user_r:x:498: floppy:x:19: vcsa:x:69: ctapiusers:x:497: rtkit:x:496: cdrom:x:11: tape:x:33: dialout:x:18: saslauth:x:76: avahi:x:70: postdrop:x:90: postfix:x:89: qpidd:x:495: ntp:x:38: cgred:x:494: apache:x:48: rpcuser:x:29: nfsnobody:x:65534: haldaemon:x:68:haldaemon pulse:x:493: pulse-access:x:492: stapdev:x:491: stapusr:x:490: stap-server:x:155: fuse:x:489: gdm:x:42: sshd:x:74: tcpdump:x:72: slocate:x:21: laytonjb:x:500: mysql:x:27: dhcpd:x:177:
Notice that this file is substantially different from the original; all I did was reboot the node, and Warewulf took care of everything else.
Recall that the intent of importing these files into the Warewulf database was to allow users to log in to the compute nodes, so as a final test, as a user, I ssh into the compute node. If this works, I should see all of my files.
[laytonjb@test1 ~]$ ssh 10.1.0.1 The authenticity of host '10.1.0.1 (10.1.0.1)' can't be established. RSA key fingerprint is 64:cf:46:1f:03:b4:93:8f:fb:ca:15:8e:c7:1c:07:d1. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '10.1.0.1' (RSA) to the list of known hosts. laytonjb@10.1.0.1's password: [laytonjb@n0001 ~]$ ls -s total 44 4 BENCHMARK 4 CLUSTERBUFFER 4 Desktop 4 Documents 4 Downloads 4 Music 4 Pictures 4 Public 4 src 4 Templates 4 Videos
Notice that I had to supply a password to SSH to the node. For HPC clusters, which are on a private network, this can be a pain when starting jobs that use a large number of nodes. What I really want is a passwordless SSH to the compute nodes for users. Fortunately, this is pretty easy to do.
Basically, so I will not have to use a password when logging in, I will use public key authentication with a null password and then copy the public part of the key to a specific file. The first step is to generate the keys as a user on the master node:
[laytonjb@test1 ~]$ ssh-keygen -t rsa enerating public/private rsa key pair. Enter file in which to save the key (/home/laytonjb/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/laytonjb/.ssh/id_rsa. Your public key has been saved in /home/laytonjb/.ssh/id_rsa.pub. The key fingerprint is: a9:90:af:81:69:fc:4f:b5:ef:6e:b5:d4:b7:cb:c6:02 laytonjb@test1 The key's randomart image is: +--[ RSA 2048]----+ | | | | | | | . . | | o S . | | . o o o . Eo . .| | = . + . o.....| | . . + .. ...+ | | o.. ++ oo.| +-----------------+
When prompted for a passphrase, I just hit Enter. This process creates a few files in a .ssh subdirectory in the root of your account.
[laytonjb@test1 ~]$ cd .ssh [laytonjb@test1 .ssh]$ ls -s total 12 4 id_rsa 4 id_rsa.pub 4 known_hosts
I want to copy the public key contained in the id_rsa.pub file to a file named authorized_keys in the same subdirectory. This means the public key I created is an “authorized” key for the account:
[laytonjb@test1 .ssh]$ cp id_rsa.pub authorized_keys [laytonjb@test1 .ssh]$ ls -s total 16 4 authorized_keys 4 id_rsa 4 id_rsa.pub 4 known_hosts
Everything is now set for the user. You can SSH to the compute node because your home directory is NFS-mounted on the compute nodes:
[laytonjb@test1 ~]$ ssh 10.1.0.1 Last login: Sat May 26 11:45:39 2012 from 10.1.0.250[laytonjb@n0001 ~]$
I didn’t have to supply a password while sshing into the compute nodes this time. A key to this is that the authorized_keys file is in the home directory of the user logging in to that node. Remember that I used NFS to mount the user’s home directory on the compute nodes.
One additional item to note is that the first time you SSH to a node, it will prompt you to say that the authenticity of the host is not known. If you continue, ssh will add this node to a list of known hosts (the file known_hosts in your .ssh subdirectory). You only need to do this once, but it is necessary.
It would also be good if I had passwordless SSH for root, but the home directory for root was not NFS-exported to the compute nodes (and maybe shouldn’t be), so a simple way of getting the authorized_keys file for root to the compute nodes is to use wwsh to import the file, associate it with the desired compute nodes, and then reboot the compute nodes.
To start this process, root must go through the same key generation process that the user did, using the Enter key for the passphrase and then copying the public keys to the authorized_keys file. Then, this file can be imported into the Warewulf database and associated with the compute nodes:
[root@test1 ~]# wwsh file import /root/.ssh/authorized_keys Imported authorized_keys into a new object [root@test1 .ssh]# wwsh file list NAME FORMAT #NODES SIZE(K) FILE PATH ================================================================================ dynamic_hosts data 0 0.3 /etc/hosts passwd data 1 1.9 /etc/passwd group data 1 0.9 /etc/group shadow data 1 1.1 /etc/shadow hosts data 1 0.4 /etc/hosts authorized_keys data 0 0.4 /root/.ssh/authorized_keys [root@test1 .ssh]# wwsh provision set n[0001-0249] --fileadd authorized_keys Are you sure you want to make the following changes to 1 node(s): ADD: FILES = authorized_keys Yes/No> y
After rebooting the compute node, which literally takes less than one minute over a Fast Ethernet network, root can log in into the compute node with ssh and no password.
[root@test1 ~]# ssh 10.1.0.1 -bash-4.1# ls -s total 0
One additional thing I did when sshing to the compute nodes that I would rather not have to do in the future was enter the IP address of the compute node. I would rather use the node name. But to do this, the nodes need a way to resolve the IP address to the host name. Once again, Warewulf comes to the rescue.
Recall that when I listed the files imported into the Warewulf database, Warewulf had defined a file named dynamic_hosts. It turns out that this file contains a list of node names and IP addresses that go into /etc/hosts (Note: For more experienced HPC cluster admins, Warewulf uses files for resolving IP addresses rather than DNS, but you could configure DNS if you like.) A quick peak at the dynamic_hosts file in the Warewulf database reveals:
[root@test1 ~]# wwsh file show dynamic_hosts #### dynamic_hosts ############################################################ # Host file template dynamically generated by Warewulf 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain4 10.1.0.1 n0001 n0001.localdomain
This file looks a great deal like what I need for /etc/hosts for the compute nodes and for the master node (don’t forget that the master node doesn’t know about the compute node name either). Warewulf does the work for you by adding the dynamic hosts to the list of hosts for the compute nodes while keeping the name of the compute node.
Creating the /etc/hosts file for the master node is very easy to do with the use of the dynamic_hosts content.
[root@test1 ~]# cp /etc/hosts /etc/hosts.old [root@test1 ~]# wwsh file show dynamic_hosts > /etc/hosts [root@test1 ~]# more /etc/hosts # Host file template dynamically generated by Warewulf 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain4 10.1.0.1 n0001 n0001.localdomain #### dynamic_hosts ############################################################
Notice that I made a copy of /etc/hosts before I sent the contents of dynamic_hosts to the file, overwriting what was there. To test whether this worked, ssh to the node n0001 as root.
[root@test1 ~]# ssh n0001 Last login: Sat May 26 12:00:06 2012 from 10.1.0.250
The /etc/hosts on the master node works fine.
There are really two parts to getting a list of hostnames to the compute nodes. The first part is getting the name of the master node and the second part is getting the list of compute nodes. Both result in /etc/hosts files on the compute nodes that contain the master node name and all compute node names.
The first step is to get the master node into /etc/hosts. When Warewulf installs on the master node, a file called /etc/warewulf/hosts-template serves as the template for the hostnames (dynamic_hosts in the Warewulf database). Just add the entry for the master node to this file. In the case of the test system I’ve been using, this is 10.1.0.250, so the /etc/warewulf/hosts-template on the master node should look like this:
[root@test1 ~]# cd /etc/warewulf [root@test1 warewulf]# more hosts-template # Host file template dynamically generated by Warewulf 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain4 10.1.0.250 test1 test1.localdomain
Notice that the last line is the master node. Warewulf will use this template when it constructs the file dynamic_hosts.
For the compute nodes to get a list of compute node names, all you need to do is associate the file with the compute nodes using wwsh.
[root@test1 ~]# wwsh file list NAME FORMAT #NODES SIZE(K) FILE PATH ================================================================================ dynamic_hosts data 0 0.3 /etc/hosts passwd data 1 1.9 /etc/passwd group data 1 0.9 /etc/group shadow data 1 1.1 /etc/shadow authorized_keys data 1 0.4 /root/.ssh/authorized_keys [root@test1 ~]# wwsh provision set n[0001-0249] --fileadd dynamic_hosts Are you sure you want to make the following changes to 1 node(s): ADD: FILES = dynamic_hosts Yes/No> y
Before you can reboot the compute nodes so they get the new /etc/hosts file, you need to force Warewulf to rebuild the host name in dynamic_hosts so that it picks up the updated hosts template. This can be done by creating a fictitious node and then deleting it (thanks for bpape on the Warewulf list for this handy command):
echo y | wwsh node new --ipaddr 10.10.10.10 --netmask 255.255.255.0 --netdev=eth0 ntest ; echo y | wwsh node delete ntest
Just be sure you aren’t using the IP address 10.10.10.10 on the cluster network and that you haven’t used the node name ntest. Note that a future version of Warewulf will have a simple option for rebuilding dynamic_hosts.
Now you can test that your changes have worked by rebooting the compute nodes and checking the contents of the /etc/hosts file on the compute node.
[root@test1 ~]# ssh n0001 -bash-4.1# more /etc/hosts # Host file template dynamically generated by Warewulf 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain4 10.1.0.250 test1 test1.localdomain 10.1.0.1 n0001 n0001.localdomain
You can also test this as a user:
[laytonjb@test1 ~]$ ssh n0001 [laytonjb@n0001 ~]$ ssh test1 Last login: Sun Jun 3 16:15:49 2012
Looks like a winner!
Installing and Configuring Cluster Tools
At this point, you have user accounts on the compute nodes, and you are able to log in to the nodes using node names rather than IP addresses – and you don’t have to enter passwords. Now you can run HPC applications as users, and things should work correctly and easily. However, you should add some tools to the cluster to make it easier to administer, including one that can help with parallel applications. More specifically, I want to install and configure two things: (1) a parallel shell tool that makes life much easier when running the same command across a cluster and (2) ntp, the Network Time Protocol, which allows the compute nodes to all sync their clocks to the master node. (Note: If you don’t sync the clocks to each other, you run the risk of problems popping up between applications.) To begin, I’ll install and configure the parallel shell, pdsh, on the master node.
Several parallel shells can be found around the web. They all focus on how you can run a single command across a range of nodes, which is something that comes in handy when administering HPC clusters. Each one has a slightly different approach in doing this, and some have features that others don’t, but probably the most commonly used parallel shell in HPC is called pdsh. Pdsh is open source, easy to install, very flexible, and in use on a large number of systems, so getting help or getting answers to your questions is fairly easy.
If you installed the extra Yum repos on your Scientific Linux, then installing pdsh is very easy with
yum install pdsh
on the master node and install pdsh on the compute nodes. I’m going to only install pdsh on the master node, and not on the compute nodes, because I don’t see the need to run parallel shell commands from the compute nodes. However, feel free to install pdsh in the chroot environment by following the basic steps of install, configure, rebuild the VNFS, and reboot the compute nodes. Installing pdsh on the master node is easy using Yum:
[root@test1 ~]# yum install pdsh Loaded plugins: fastestmirror, refresh-packagekit, security Loading mirror speeds from cached hostfile * elrepo: mirror.symnds.com * epel: mirror.umd.edu * rpmforge: mirror.us.leaseweb.net * sl: ftp.scientificlinux.org * sl-security: ftp.scientificlinux.org Setting up Install Process Resolving Dependencies --> Running transaction check ---> Package pdsh.x86_64 0:2.27-1.el6.rf will be installed --> Finished Dependency Resolution Dependencies Resolved ======================================================================================================================================================== Package Arch Version Repository Size ======================================================================================================================================================== Installing: pdsh x86_64 2.27-1.el6.rf rpmforge 146 k Transaction Summary ======================================================================================================================================================== Install 1 Package(s) Total download size: 146 k Installed size: 753 k Is this ok [y/N]: y Downloading Packages: pdsh-2.27-1.el6.rf.x86_64.rpm | 146 kB 00:00 Running rpm_check_debug Running Transaction Test Transaction Test Succeeded Running Transaction Installing : pdsh-2.27-1.el6.rf.x86_64 1/1 Installed: pdsh.x86_64 0:2.27-1.el6.rf Complete!
Pdsh is found in the RHEL 6 repo, as you can see above.
Once pdsh is installed, you can immediately start using it. Here is a simple example of checking the date on a compute node:
[laytonjb@test1 sl6.2]$ pdsh -w n0001 date n0001: Sat Jun 2 11:52:01 EDT 2012
Notice that I specified which nodes to use using the -w option. You can also configure pdsh to use an environment variable that lists the nodes, so you don’t have to use the -w option if you don’t want. Please read the documentation on the pdsh site for more details.
The next package to install is ntp. NTP is a way for clocks on disparate nodes to synchronize to each other. This aspect is important for clusters because, if the clocks are too skewed relative to each other, the application might have problems. You need to install ntp on both the master node and the VNFS, but start by installing it on the master node.
As with other packages, installing ntp is very easy with Yum. Typically, it is installed as part of a basic build, but if it isn’t present, I installed it on the master node like this:
[root@test1 ~]# yum install ntp Loaded plugins: fastestmirror, refresh-packagekit, security Loading mirror speeds from cached hostfile * elrepo: mirror.symnds.com * epel: mirror.symnds.com * rpmforge: mirror.teklinks.com * sl: ftp.scientificlinux.org * sl-security: ftp.scientificlinux.org Setting up Install Process Resolving Dependencies --> Running transaction check ---> Package ntp.x86_64 0:4.2.4p8-2.el6 will be installed --> Finished Dependency Resolution Dependencies Resolved ======================================================================================================================================================== Package Arch Version Repository Size ======================================================================================================================================================== Installing: ntp x86_64 4.2.4p8-2.el6 sl 444 k Transaction Summary ======================================================================================================================================================== Install 1 Package(s) Total download size: 444 k Installed size: 1.2 M Is this ok [y/N]: y Downloading Packages: ntp-4.2.4p8-2.el6.x86_64.rpm | 444 kB 00:00 Running rpm_check_debug Running Transaction Test Transaction Test Succeeded Running Transaction Installing : ntp-4.2.4p8-2.el6.x86_64 1/1 Installed: ntp.x86_64 0:4.2.4p8-2.el6 Complete!
Configuring NTP on the master node isn’t too difficult and might not require anything more than making sure it starts when the node is rebooted. To do this, first make sure NTP is running on the master node after being installed and that it starts when the master node is rebooted (this process should be familiar to you by now):
[root@test1 ~]# chkconfig --list ... ntpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off ... [root@test1 ~]# chkconfig ntpd on [root@test1 ~]# chkconfig --list ... ntpd 0:off 1:off 2:on 3:on 4:on 5:on 6:off ... [root@test1 ~]# service ntpd restart Shutting down ntpd: [FAILED] Starting ntpd: [ OK ]
I used the install defaults on Scientific Linux 6.2 on my test system. The file /etc/ntp.conf defines the NTP configuration, with three important lines that point to external servers for synchronizing the clock of the master node:
# Use public servers from the pool.ntp.org project. # Please consider joining the pool (http://www.pool.ntp.org/join.html). server 0.rhel.pool.ntp.org server 1.rhel.pool.ntp.org server 2.rhel.pool.ntp.org
This means the master node is using three external sources for synchronizing clocks. To check this, you can simply use the ntpq command along with the lpeers option:
[root@test1 ~]# ntpq ntpq> lpeers remote refid st t when poll reach delay offset jitter ============================================================================== 10504.x.rootbsd 198.30.92.2 2 u 27 64 3 40.860 141.975 4.746 wwwco1test12.mi 64.236.96.53 2 u 24 64 3 106.821 119.694 4.674 mail.ggong.info 216.218.254.202 2 u 25 64 3 98.118 145.902 4.742
The output indicates that the master node is syncing time with three different servers on the Internet. Knowing how to point NTP to a particular server will be important for configuring the VNFS.
The next step is to install NTP into the chroot environment that is the basis for the VNFS (again, with Yum). Yum is flexible enough that you can tell it the root location where you want it installed. This works perfectly with the chroot environment:
[root@test1 ~]# yum --tolerant --installroot /var/chroots/sl6.2 -y install ntp Loaded plugins: fastestmirror, refresh-packagekit, security Determining fastest mirrors epel/metalink | 12 kB 00:00 * elrepo: mirror.symnds.com * epel: mirror.hiwaay.net * rpmforge: mirror.us.leaseweb.net * sl: ftp.scientificlinux.org * sl-security: ftp.scientificlinux.org atrpms | 3.5 kB 00:00 elrepo | 1.9 kB 00:00 rpmforge | 1.9 kB 00:00 sl | 3.2 kB 00:00 sl-security | 1.9 kB 00:00 warewulf-rhel-6 | 2.3 kB 00:00 Setting up Install Process Resolving Dependencies --> Running transaction check ---> Package ntp.x86_64 0:4.2.4p8-2.el6 will be installed --> Processing Dependency: ntpdate = 4.2.4p8-2.el6 for package: ntp-4.2.4p8-2.el6.x86_64 --> Running transaction check ---> Package ntpdate.x86_64 0:4.2.4p8-2.el6 will be installed --> Finished Dependency Resolution Dependencies Resolved ======================================================================================================================================================== Package Arch Version Repository Size ======================================================================================================================================================== Installing: ntp x86_64 4.2.4p8-2.el6 sl 444 k Installing for dependencies: ntpdate x86_64 4.2.4p8-2.el6 sl 57 k Transaction Summary ======================================================================================================================================================== Install 2 Package(s) Total download size: 501 k Installed size: 1.2 M Downloading Packages: (1/2): ntp-4.2.4p8-2.el6.x86_64.rpm | 444 kB 00:00 (2/2): ntpdate-4.2.4p8-2.el6.x86_64.rpm | 57 kB 00:00 -------------------------------------------------------------------------------------------------------------------------------------------------------- Total 902 kB/s | 501 kB 00:00 Running rpm_check_debug Running Transaction Test Transaction Test Succeeded Running Transaction Installing : ntpdate-4.2.4p8-2.el6.x86_64 1/2 Installing : ntp-4.2.4p8-2.el6.x86_64 2/2 Installed: ntp.x86_64 0:4.2.4p8-2.el6 Dependency Installed: ntpdate.x86_64 0:4.2.4p8-2.el6 Complete!
Now that NTP is installed into the chroot, you just need to configure it by editing the etc/ntp.conf file that is in the chroot. For the example, in this article, the full path to the file is /var/chroots/sl6.2/etc/ntp.conf. The file should look like this,
[root@test1 etc]# more ntp.conf # For more information about this file, see the man pages # ntp.conf(5), ntp_acc(5), ntp_auth(5), ntp_clock(5), ntp_misc(5), ntp_mon(5). #driftfile /var/lib/ntp/drift restrict default ignore restrict 127.0.0.1 server 10.1.0.250 restrict 10.1.0.250 nomodify
where 10.1.0.250 is the IP address of the master node. (It uses the private cluster network.)
To get NTP ready to run on the compute nodes, you have one more small task. When ntp is installed into the chroot, it is not configured to start automatically, so you need to force it to run. A simple way to do this is to put the commands into the etc/rc.d/rc.local file in the chroot. For this example, the full path to the file is /var/chroots/sl6.2/etc/rc.d/rc.local and should look as follows:
#!/bin/sh # # This script will be executed *after* all the other init scripts. # You can put your own initialization stuff in here if you don't # want to do the full Sys V style init stuff. touch /var/lock/subsys/local chkconfig ntpd on service ntpd start
Although you probably have better, more elegant ways to do this, I chose to do it this way for the sake of expediency.
Now ntp is installed and configured in the chroot environment, but because it has changed, you now have to rebuild the VNFS for the compute nodes. Just remember that this is easy to do with the wwvnfs command (don’t forget the hybridpath option):
[root@test1 ~]# wwvnfs --chroot /var/chroots/sl6.2 --hybridpath=/vnfs Using 'sl6.2' as the VNFS name Creating VNFS image for sl6.2 Building template VNFS image Excluding files from VNFS Building and compressing the final image Cleaning temporary files Are you sure you wish to overwrite the Warewulf VNFS Image 'sl6.2'? Yes/No> y Importing into existing VNFS Object: sl6.2 Done.
If the chroot is large, this could take some time, so please be patient. You can also check the load on the master node while it is building the VNFS, so when the load starts coming down, it is probably finished.
Now all you do is reboot your compute node to refresh the VNFS. If you haven’t noticed, up to this point, just sshing to the compute node and typing reboot works very easily. You can even use pdsh, but be careful you don’t reboot a compute node while it is being used. On the test system I’ve been using, the time from when I reboot the compute node to when the compute node is up and I can ssh into it is less than a minute – sometimes around 30 seconds – with my slow Fast Ethernet connection between the master node and the compute nodes. This shows you just how fast a good stateless tool such as Warewulf can provision HPC clusters.
To see whether NTP is working on the compute node, you can use the same commands you used to check NTP on the master node:
[root@test1 ~]# ssh n0001 Last login: Sat May 26 12:50:01 2012 from 10.1.0.250 -bash-4.1# ntpq ntpq> lpeers remote refid st t when poll reach delay offset jitter ============================================================================== 10.1.0.250 .STEP. 16 u 2 64 0 0.000 0.000 0.000 ntpq> quit
So the compute node is using the master node for synchronizing it’s clock.
Summary
In the first article about Warewulf, I just installed Warewulf on the master node and booted a compute node. This doesn’t necessarily mean that the HPC cluster is ready to run applications. You also need to add or configure a few other things on the master and the compute node’s VNFS so that the HPC cluster is much easier to use. This article builds on the first one by adding those “other things” you need to have for a properly functioning cluster (at least in my opinion).
In this article, I focused on adding the following capabilities to the HPC cluster:
- Setting up NFS on the master node
- Creating a Hybrid VNFS to reduce memory usage and improve boot times
- Adding users to the compute node
- Adding a very common parallel shell tool, pdsh, to the master node
- Installing and configuring NTP (a key component for running MPI jobs)
Adding these tools to the HPC cluster gives you a fully functioning cluster that is ready to run jobs. You will have an efficient VNFS that is very small and uses a minimal set of resources on the compute node, allowing very fast boot times. You will also have user accounts with passwordless logins, and the clocks on the nodes will be synchronized so that parallel jobs will run. Finally, you will have a parallel shell, pdsh, that allows you to administer the cluster better.
Although the explanation of these points in this article might seem long, adding this capability to Warewulf is relatively easy. Most of the time, it consists of installing the tool on the master node, in the chroot environment, or both; configuring the tool; possibly rebuilding the VNFS; and rebooting the compute nodes. Nothing too drastic or difficult. Plus, Warewulf allows really fast reboot times, so it’s very simple to make a change and reboot the compute node. Because Warewulf is stateless, these changes are reflected in the compute node when it’s rebooted. With stateful nodes, you would have to reboot the node and boot everything from disk, greatly slowing down the boot process. Also, you don’t have to worry about ensuring that all of the compute nodes are configured exactly the same way.
In the next article in this series, I will build a development environment on the test system, and I will discuss building compilers, MPI libraries, and environment modules and using them with Warewulf. Stay tuned!