Resource Management with Slurm
Slurm Job Scheduling System
In previous articles, I examined some fundamental tools for HPC systems, including pdsh
[1] (parallel shells), Lmod environment modules [2], and shared storage with NFS and SSHFS [3]. One remaining, virtually indispensable tool is a job scheduler.
One of the most critical pieces of software on a shared cluster is the resource manager, commonly called a job scheduler, which allows users to share the system in a very efficient and cost-effective way. The idea is fairly simple: Users write small scripts, commonly called "jobs," that define what they want to run and the required resources, which they then submit to the resource manager. When the resources are available, the resource manager executes the job script on behalf of the user. Typically this approach is for batch jobs (i.e., jobs that are not interactive), but it can also be used for interactive jobs, for which the resource manager gives you a shell prompt to the node that is running your job.
Some resource managers are commercially supported and some are open source, either with or without a support option. The list of candidates is fairly long, but the one I talk about in this article is Slurm [4].
Slurm
Slurm has been around for a while. I remember using it at Linux Networx in the early 2000s. Over the years, it has been developed by Lawrence Livermore National Laboratory, SchedMD [5], Linux Networx, Hewlett-Packard, and Groupe Bull [6]. According to the website, Slurm provides three functions [7]:
- "… it
Buy this article as PDF
(incl. VAT)