« Previous 1 2
Slurm Job Scheduling System
Common Slurm Commands
Slurm comes with a range of commands for administering, using, and monitoring a Slurm configuration. A number of tutorials detail their use, but to be complete, I will look at a few of the most command commands.
sinfo
The all-purpose command sinfo lets users discover how Slurm is configured:
$ sinfo -s PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST p100 up infinite 4/9/3/16 node[212-213,215-218,220-229]
This example lists the status, time limit, node information, and node list of the p100 partition.
sbatch
To submit a batch serial job to Slurm, use the sbatch command:
$ sbatch runscript.sh
For batch jobs, sbatchis one of the most important commands, made powerful by its large number of options.
srun
To run parallel jobs, use srun:
$ srun --pty -p test -t 10 --mem 1000 /bin/bash [script or app]
The same command
$ srun --pty -p test -t 10 --mem 1000 /bin/bash
runs an application script interactively.
scancel
The scancel command allows you to cancel a specific job; for example,
$ scancel 999999
cancels job 999999. You can find the ID of your job with the squeue command.
squeue
To print a list of jobs in the job queue or for a particular user, use squeue. For example,
$ squeue -u akitzmiller
lists the jobs for a particular user.
sacct
The sacct command displays the accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database, and you can run the command against a specific job number:
$ sacct -j 999999
Summary
A resource manager is one of the most critical pieces of software in HPC. It allows systems and their resources to be shared efficiently, and it is remarkably flexible, allowing the creation of multiple queues according to resource types or generic resources (e.g., GPUs in this article). Slurm also has job accounting by default.
The Slurm resource manager is one of the most common job schedulers in use today for very good reasons, some of which I covered here. Prepare to be “Slurmed.”
« Previous 1 2