Slurm Job Scheduling System
One way to share HPC systems among several users is to use a software tool called a resource manager. Slurm, probably the most common job scheduler in use today, is open source, scalable, and easy to install and customize.
In previous articles, I examined some fundamental tools for HPC systems, including pdsh (parallel shells), Lmod environment modules, and shared storage with NFS and SSHFS. One remaining, virtually indispensable tool is a job scheduler.
One of the most critical pieces of software on a shared cluster is the resource manager, commonly called a job scheduler, which allows users to share the system in a very efficient and cost-effective way. The idea is fairly simple: Users write small scripts, commonly called “jobs,” that define what they want to run and the required resources, which they then submit to the resource manager. When the resources are available, the resource manager executes the job script on behalf of the user. Typically this approach is for batch jobs (i.e., jobs that are not interactive), but it can also be used for interactive jobs, for which the resource manager gives you a shell prompt to the node that is running your job.
Some resource managers are commercially supported and some are open source, either with or without a support option. The list of candidates is fairly long, but the one I talk about in this article is Slurm.
Slurm
Slurm has been around for a while. I remember using it at Linux Networx in the early 2000s. Over the years, it has been developed by Lawrence Livermore National Laboratory, SchedMD, Linux Networx, Hewlett-Packard, and Groupe Bull. According to the website, Slurm provides three functions:
- “… it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work.”
- “… it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.”
- “… it arbitrates contention for resources by managing a queue of pending work.”
These three points are the classic functions of a resource manager (job scheduler), and Slurm does them well.
Slurm is very extensible, with more than 100 optional plugins to cover everything from accounting, to various job reservation approaches, to backfill scheduling, to topology-aware resource selection, to job arrays, to resource limits by user or bank account and other job priority tools. It can even schedule resources and jobs according to the energy consumption of the job itself.
Architecture
The SLUM architecture is very similar to other job schedulers. Each node in the cluster has a daemon running, which in this case is named slurmd. The resources are referred to as nodes. The daemons can communicate in a hierarchical fashion that accommodates fault tolerance. On the Slurm master node, the daemon is slurmctld, which also has failover capability.
The compute resources (nodes) can be divided into partitions that can overlap, allowing partitions to spill over into other partitions according to resource needs. Partitions can be considered job queues that have certain boundaries, such as limits on job size and jog time, which users can use the partition, and so on.
Installing Slurm
The Slurm community builds Ubuntu binaries for download. For other distributions, you will probably have to built the binary yourself, which is not that difficult, although you will need a few dependencies. A good example of installing Slurm binaries on Ubuntu 16.04 is discussed on GitHub, and it even has very useful example configuration files for building a Slurm master (controller) node and one compute (client) node.
The following tips for building and installing Slurm are generally independent of the distribution used.
1. Synchronize clocks across the cluster.
2. Make sure passwordless SSH is working between the control node and all compute nodes, and make sure to do this as a user and not as root.
3. To make life easier, use shared storage between the controller and the compute nodes.
4. Make sure the UIDs and GIDs are consistent throughput the cluster.
5. The general installation flow on the control node is:
- Install the dependencies.
- Install MUNGE, which is an authentication service for creating and validating credentials. Make sure all nodes in your cluster have the same munge.key and the MUNGE daemon, munged, is running before you start the Slurm daemons.
- Install MariaDB (it is a good to have a database) and start the daemon:
systemctl enable mysql systemctl start mysql
- Build and install Slurm.
- Start the Slurm daemons (e.g., run the following commands as root):
systemctl enable slurmctld systemctl enable slurmdbd (enable the database) systemctl enable slurmd (compute node)
- Create the initial Slurm cluster, account, and user (performed by root):
sacctmgr add cluster compute-cluster sacctmgr add account compute-account description="Compute accounts" \ Organization=OurOrg sacctmgr create user myuser account=compute-account adminlevel=None
6. Install Slurm on the compute nodes.
- Install/test MUNGE on the compute node:
systemctl enable munge systemctl restart munge
- Install Slurm.
7. Set up cgroups (if needed).
8. Optional: Enable Slurm PAM SSH control.
Installation might look difficult, but it's not. Notice you install and enable Slurm on the master node (control node) and the compute nodes in the first part.
If you don't want to build and install Slurm on every compute node, you can build RPMs for distributions that use that format, or you can use the Ubuntu files. Slurm is popular enough that you might be able to find RPMs built for the distribution you use.
Configuring Slurm
Slurm is very flexible, and you can configure it for almost any scenario. The first configuration file is slurm.conf (Listing 1).
Listing 1: slurm.conf
01 # 02 # Example slurm.conf file. Please run configurator.html 03 # (in doc/html) to build a configuration file customized 04 # for your environment. 05 # 06 # 07 # slurm.conf file generated by configurator.html. 08 # 09 # See the slurm.conf man page for more information. 10 # 11 ClusterName=compute-cluster 12 ControlMachine=slurm-ctrl 13 # 14 SlurmUser=slurm 15 SlurmctldPort=6817 16 SlurmdPort=6818 17 AuthType=auth/munge 18 StateSaveLocation=/var/spool/slurm/ctld 19 SlurmdSpoolDir=/var/spool/slurm/d 20 SwitchType=switch/none 21 MpiDefault=none 22 SlurmctldPidFile=/var/run/slurmctld.pid 23 SlurmdPidFile=/var/run/slurmd.pid 24 ProctrackType=proctrack/cgroup 25 PluginDir=/usr/lib/slurm 26 ReturnToService=1 27 TaskPlugin=task/cgroup 28 # TIMERS 29 SlurmctldTimeout=300 30 SlurmdTimeout=300 31 InactiveLimit=0 32 MinJobAge=300 33 KillWait=30 34 Waittime=0 35 # 36 # SCHEDULING 37 SchedulerType=sched/backfill 38 SelectType=select/cons_res 39 SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK,\ CR_ONE_TASK_PER_CORE 40 FastSchedule=1 41 # 42 # LOGGING 43 SlurmctldDebug=3 44 SlurmctldLogFile=/var/log/slurmctld.log 45 SlurmdDebug=3 46 SlurmdLogFile=/var/log/slurmd.log 47 JobCompType=jobcomp/none 48 # 49 # ACCOUNTING 50 JobAcctGatherType=jobacct_gather/cgroup 51 AccountingStorageTRES=gres/gpu 52 DebugFlags=CPU_Bind,gres 53 AccountingStorageType=accounting_storage/slurmdbd 54 AccountingStorageHost=localhost 55 AccountingStoragePass=/var/run/munge/munge.socket.2 56 AccountingStorageUser=slurm 57 # 58 # COMPUTE NODES (PARTITIONS) 59 GresTypes=gpu 60 DefMemPerNode=64000 61 NodeName=linux1 Gres=gpu:8 CPUs=80 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 \ RealMemory=515896 State=UNKNOWN 62 PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
This file offers are a large number of configuration options, and the man pages can help explain them, so the following is just a quick review.
- Notice the use of ports 6817 and 6818.
- SchedulerType=sched/backfill tells Slurm to use the backfill scheduler.
- In several places, GPUs are considered in the configuration:
AccountingStorageTRES=gres/gpu GresTypes=gpu NodeName=linux1 Gres=gpu:8 ...
The term gres, capitalized or not, stands for “generic resource.” Slurm allows you to define resources beyond the defaults of run time, number of CPUs, and so on, and could include disk space or almost anything you can dream.
Two very important lines in the configuration file define the node names with their configuration and a partition for the compute nodes. For this configuration file, these lines are,
NodeName=slurm-node-0[0-1] Gres=gpu:2 CPUs=10 Sockets=1 CoresPerSocket=10 \ ThreadsPerCore=1 RealMemory=30000 State=UNKNOWN PartitionName=compute Nodes=ALL Default=YES MaxTime=48:00:00 DefaultTime=04:00:00 \ MaxNodes=2 State=UP DefMemPerCPU=3000
Notice that you can use abbreviations for a range of nodes. They tell Slurm how many of the generic resources it contains (in this case, two GPUs); then, you can to tell it the number of cores, number of cores per socket, threads per core, and the amount of memory available (e.g., 30,000MB, or 30GB, here).
CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes #TaskAffinity=yes
This file allows cgroups to restrict the number of cores, the devices, and the memory space (ConstrainRAMSpace), which allows you, the Slurm admin, to control and limit the number of cores and memory.
In the file gres.conf, you can configure the generic resources, which in this case are GPUs:
Name=gpu File=/dev/nvidia0 CPUs=0-4 Name=gpu File=/dev/nvidia1 CPUs=5-9
The first line says that the first GPU is associated with cores 0-4 (the first five cores, or half the cores in the node). The second line defines the second GPU for cores 5-9, or the second half of the cores. When submitting a job to Slurm that uses these resources, you can specify them with a simple option, for example,
$ srun --gres=gpu:1
which submits the job requesting a single GPU.
Common Slurm Commands
Slurm comes with a range of commands for administering, using, and monitoring a Slurm configuration. A number of tutorials detail their use, but to be complete, I will look at a few of the most command commands.
sinfo
The all-purpose command sinfo lets users discover how Slurm is configured:
$ sinfo -s PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST p100 up infinite 4/9/3/16 node[212-213,215-218,220-229]
This example lists the status, time limit, node information, and node list of the p100 partition.
sbatch
To submit a batch serial job to Slurm, use the sbatch command:
$ sbatch runscript.sh
For batch jobs, sbatchis one of the most important commands, made powerful by its large number of options.
srun
To run parallel jobs, use srun:
$ srun --pty -p test -t 10 --mem 1000 /bin/bash [script or app]
The same command
$ srun --pty -p test -t 10 --mem 1000 /bin/bash
runs an application script interactively.
scancel
The scancel command allows you to cancel a specific job; for example,
$ scancel 999999
cancels job 999999. You can find the ID of your job with the squeue command.
squeue
To print a list of jobs in the job queue or for a particular user, use squeue. For example,
$ squeue -u akitzmiller
lists the jobs for a particular user.
sacct
The sacct command displays the accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database, and you can run the command against a specific job number:
$ sacct -j 999999
Summary
A resource manager is one of the most critical pieces of software in HPC. It allows systems and their resources to be shared efficiently, and it is remarkably flexible, allowing the creation of multiple queues according to resource types or generic resources (e.g., GPUs in this article). Slurm also has job accounting by default.
The Slurm resource manager is one of the most common job schedulers in use today for very good reasons, some of which I covered here. Prepare to be “Slurmed.”