Slurm Job Scheduling System

Jeff Layton

11/05/2018 04:15 pm

One way to share HPC systems among several users is to use a software tool called a resource manager. Slurm, probably the most common job scheduler in use today, is open source, scalable, and easy to install and customize.

In previous articles, I examined some fundamental tools for HPC systems, including pdsh (parallel shells), Lmod environment modules, and shared storage with NFS and SSHFS. One remaining, virtually indispensable tool is a job scheduler.

One of the most critical pieces of software on a shared cluster is the resource manager, commonly called a job scheduler, which allows users to share the system in a very efficient and cost-effective way. The idea is fairly simple: Users write small scripts, commonly called “jobs,” that define what they want to run and the required resources, which they then submit to the resource manager. When the resources are available, the resource manager executes the job script on behalf of the user. Typically this approach is for batch jobs (i.e., jobs that are not interactive), but it can also be used for interactive jobs, for which the resource manager gives you a shell prompt to the node that is running your job.

Some resource managers are commercially supported and some are open source, either with or without a support option. The list of candidates is fairly long, but the one I talk about in this article is Slurm.

Slurm

Slurm has been around for a while. I remember using it at Linux Networx in the early 2000s. Over the years, it has been developed by Lawrence Livermore National Laboratory, SchedMD, Linux Networx, Hewlett-Packard, and Groupe Bull. According to the website, Slurm provides three functions:

“… it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work.”
“… it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.”
“… it arbitrates contention for resources by managing a queue of pending work.”

These three points are the classic functions of a resource manager (job scheduler), and Slurm does them well.

Slurm is very extensible, with more than 100 optional plugins to cover everything from accounting, to various job reservation approaches, to backfill scheduling, to topology-aware resource selection, to job arrays, to resource limits by user or bank account and other job priority tools. It can even schedule resources and jobs according to the energy consumption of the job itself.

Architecture

The SLUM architecture is very similar to other job schedulers. Each node in the cluster has a daemon running, which in this case is named slurmd. The resources are referred to as nodes. The daemons can communicate in a hierarchical fashion that accommodates fault tolerance. On the Slurm master node, the daemon is slurmctld, which also has failover capability.

The compute resources (nodes) can be divided into partitions that can overlap, allowing partitions to spill over into other partitions according to resource needs. Partitions can be considered job queues that have certain boundaries, such as limits on job size and jog time, which users can use the partition, and so on.

Installing Slurm

The Slurm community builds Ubuntu binaries for download. For other distributions, you will probably have to built the binary yourself, which is not that difficult, although you will need a few dependencies. A good example of installing Slurm binaries on Ubuntu 16.04 is discussed on GitHub, and it even has very useful example configuration files for building a Slurm master (controller) node and one compute (client) node.

The following tips for building and installing Slurm are generally independent of the distribution used.

1. Synchronize clocks across the cluster.

2. Make sure passwordless SSH is working between the control node and all compute nodes, and make sure to do this as a user and not as root.

3. To make life easier, use shared storage between the controller and the compute nodes.

4. Make sure the UIDs and GIDs are consistent throughput the cluster.

5. The general installation flow on the control node is:

Install the dependencies.
Install MUNGE, which is an authentication service for creating and validating credentials. Make sure all nodes in your cluster have the same munge.key and the MUNGE daemon, munged, is running before you start the Slurm daemons.
Install MariaDB (it is a good to have a database) and start the daemon:

systemctl enable mysql
systemctl start mysql

Build and install Slurm.
Start the Slurm daemons (e.g., run the following commands as root):

systemctl enable slurmctld
systemctl enable slurmdbd (enable the database)
systemctl enable slurmd (compute node)

Create the initial Slurm cluster, account, and user (performed by root):

sacctmgr add cluster compute-cluster
sacctmgr add account compute-account description="Compute accounts" \
   Organization=OurOrg
sacctmgr create user myuser account=compute-account adminlevel=None

6. Install Slurm on the compute nodes.

Install/test MUNGE on the compute node:

systemctl enable munge
systemctl restart munge

Install Slurm.

7. Set up cgroups (if needed).

8. Optional: Enable Slurm PAM SSH control.

Installation might look difficult, but it's not. Notice you install and enable Slurm on the master node (control node) and the compute nodes in the first part.

If you don't want to build and install Slurm on every compute node, you can build RPMs for distributions that use that format, or you can use the Ubuntu files. Slurm is popular enough that you might be able to find RPMs built for the distribution you use.

Configuring Slurm

Slurm is very flexible, and you can configure it for almost any scenario. The first configuration file is slurm.conf (Listing 1).

Listing 1: slurm.conf

01 #
02 # Example slurm.conf file. Please run configurator.html
03 # (in doc/html) to build a configuration file customized
04 # for your environment.
05 #
06 #
07 # slurm.conf file generated by configurator.html.
08 #
09 # See the slurm.conf man page for more information.
10 #
11 ClusterName=compute-cluster
12 ControlMachine=slurm-ctrl
13 #
14 SlurmUser=slurm
15 SlurmctldPort=6817
16 SlurmdPort=6818
17 AuthType=auth/munge
18 StateSaveLocation=/var/spool/slurm/ctld
19 SlurmdSpoolDir=/var/spool/slurm/d
20 SwitchType=switch/none
21 MpiDefault=none
22 SlurmctldPidFile=/var/run/slurmctld.pid
23 SlurmdPidFile=/var/run/slurmd.pid
24 ProctrackType=proctrack/cgroup
25 PluginDir=/usr/lib/slurm
26 ReturnToService=1
27 TaskPlugin=task/cgroup
28 # TIMERS
29 SlurmctldTimeout=300
30 SlurmdTimeout=300
31 InactiveLimit=0
32 MinJobAge=300
33 KillWait=30
34 Waittime=0
35 #
36 # SCHEDULING
37 SchedulerType=sched/backfill
38 SelectType=select/cons_res
39 SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK,\
   CR_ONE_TASK_PER_CORE
40 FastSchedule=1
41 #
42 # LOGGING
43 SlurmctldDebug=3
44 SlurmctldLogFile=/var/log/slurmctld.log
45 SlurmdDebug=3
46 SlurmdLogFile=/var/log/slurmd.log
47 JobCompType=jobcomp/none
48 #
49 # ACCOUNTING
50 JobAcctGatherType=jobacct_gather/cgroup
51 AccountingStorageTRES=gres/gpu
52 DebugFlags=CPU_Bind,gres
53 AccountingStorageType=accounting_storage/slurmdbd
54 AccountingStorageHost=localhost
55 AccountingStoragePass=/var/run/munge/munge.socket.2
56 AccountingStorageUser=slurm
57 #
58 # COMPUTE NODES (PARTITIONS)
59 GresTypes=gpu
60 DefMemPerNode=64000
61 NodeName=linux1 Gres=gpu:8 CPUs=80 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 \
   RealMemory=515896 State=UNKNOWN
62 PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

This file offers are a large number of configuration options, and the man pages can help explain them, so the following is just a quick review.

Notice the use of ports 6817 and 6818.
SchedulerType=sched/backfill tells Slurm to use the backfill scheduler.
In several places, GPUs are considered in the configuration:

AccountingStorageTRES=gres/gpu
GresTypes=gpu
NodeName=linux1 Gres=gpu:8 ...

The term gres, capitalized or not, stands for “generic resource.” Slurm allows you to define resources beyond the defaults of run time, number of CPUs, and so on, and could include disk space or almost anything you can dream.

Two very important lines in the configuration file define the node names with their configuration and a partition for the compute nodes. For this configuration file, these lines are,

NodeName=slurm-node-0[0-1] Gres=gpu:2 CPUs=10 Sockets=1 CoresPerSocket=10 \
   ThreadsPerCore=1 RealMemory=30000 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=48:00:00 DefaultTime=04:00:00 \
   MaxNodes=2 State=UP DefMemPerCPU=3000

Notice that you can use abbreviations for a range of nodes. They tell Slurm how many of the generic resources it contains (in this case, two GPUs); then, you can to tell it the number of cores, number of cores per socket, threads per core, and the amount of memory available (e.g., 30,000MB, or 30GB, here).

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
 
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
#TaskAffinity=yes

This file allows cgroups to restrict the number of cores, the devices, and the memory space (ConstrainRAMSpace), which allows you, the Slurm admin, to control and limit the number of cores and memory.

In the file gres.conf, you can configure the generic resources, which in this case are GPUs:

Name=gpu File=/dev/nvidia0 CPUs=0-4
Name=gpu File=/dev/nvidia1 CPUs=5-9

The first line says that the first GPU is associated with cores 0-4 (the first five cores, or half the cores in the node). The second line defines the second GPU for cores 5-9, or the second half of the cores. When submitting a job to Slurm that uses these resources, you can specify them with a simple option, for example,

$ srun --gres=gpu:1

which submits the job requesting a single GPU.

Common Slurm Commands

Slurm comes with a range of commands for administering, using, and monitoring a Slurm configuration. A number of tutorials detail their use, but to be complete, I will look at a few of the most command commands.

sinfo

The all-purpose command sinfo lets users discover how Slurm is configured:

$ sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
p100     up   infinite         4/9/3/16  node[212-213,215-218,220-229]

This example lists the status, time limit, node information, and node list of the p100 partition.

sbatch

To submit a batch serial job to Slurm, use the sbatch command:

$ sbatch runscript.sh

For batch jobs, sbatchis one of the most important commands, made powerful by its large number of options.

srun

To run parallel jobs, use srun:

$ srun --pty -p test -t 10 --mem 1000 /bin/bash [script or app]

The same command

$ srun --pty -p test -t 10 --mem 1000 /bin/bash

runs an application script interactively.

scancel

The scancel command allows you to cancel a specific job; for example,

 $ scancel 999999

cancels job 999999. You can find the ID of your job with the squeue command.

squeue

To print a list of jobs in the job queue or for a particular user, use squeue. For example,

$ squeue -u akitzmiller

lists the jobs for a particular user.

sacct

The sacct command displays the accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database, and you can run the command against a specific job number:

$ sacct -j 999999

Summary

A resource manager is one of the most critical pieces of software in HPC. It allows systems and their resources to be shared efficiently, and it is remarkably flexible, allowing the creation of multiple queues according to resource types or generic resources (e.g., GPUs in this article). Slurm also has job accounting by default.

The Slurm resource manager is one of the most common job schedulers in use today for very good reasons, some of which I covered here. Prepare to be “Slurmed.”

Tags: job scheduler , resource manager , resources , Slurm , Slurm