Submitting jobs with Slurm¶
Resource sharing on a high-performance cluster dedicated to scientific computing is organized by a piece of software called a resource manager or job scheduler. Indeed, users do not run software on those cluster like they would do on a workstation, rather, they must submit jobs, which are run unattended, by the job scheduler at the time, and on the resources, decided by its algorithm.
Slurm is a resource manager and job scheduler designed to do just that, and much more. It was originally created by people at the Livermore Computing Center, and has grown into a full-fledge open-source software backed up by a large community, commercially supported by the original developers, and installed in many of the Top500 supercomputers.
Anatomy of a cluster¶
A high-performance compute cluster (HPC cluster) is a made of a set of powerful computers, called nodes in this context, linked together through a fast network. Multiple format factor exist, but one common one is the “pizza box” shape, depicted in the image below showing multiple such servers stacked in a data center rack.
A cluster typically hosts multiple types of nodes:
- management nodes: these computers run the management services like Slurm, but also the user identities database, the monitoring and reporting applications, etc. ;
- login nodes: these are the computers users login into through SSH to install software and submit jobs ;
- storage nodes: these computers host user files in possibly multiple filesystems ; and
- compute nodes: these are the computers where jobs run and where the computations actually take place. They are often organised into partitions
Alongside these most common types, a cluster can also comprise
- visualization nodes: computers dedicated to data visualization with specific hardware ;
- I/O nodes: computers dedicated to data ingress and outgress to and from the cluster.
The typical usage therefore involves connecting to a login node, making sure data are stored on the right filesystems and software are available, and then submitting jobs. The jobs are dispatched to the compute nodes by the resource manager based on the adequation between the resources resquested by the job and those offered by the compute nodes.
Warning
On a cluster, you do not run the computations on the login node. Computations belong on the compute nodes, when, and where decided by the scheduler.
Inside a compute node, the computation is carried on by the processors. A processor looks like this:
A typical compute node has one, two, or four sockets on the motherboard, that each can host a processor. The processors are made of multiple cores, that can be though of as independent computing units, each one being in turn possibly “divided” into two hardware threads.
Hardware threads are a technology designed to hide the latencies of the memory and feed the compute units fast enough to keep them busy all the time. It can be activated or deactivated depending on the cluster. Depending on the workload, it can benefit or slightly harm the computations times.
Compute nodes also provide volatile working memory, or RAM, where data are stored when they are being processed or produced, before they are enventually saved to permanent storage (disks).
Compute nodes can also offer accelerators, such as GPUs, which are pieces of hardware where heavy computations can be offloaded. GPUs are often much more powerfull than regular processors, but do not have the same flexibility and require specific programming skills.
Gathering cluster information¶
Slurm offers the sinfo
command to get an overview of the resources
offered by the cluster.
By default, sinfo
lists the partitions that are available. A
partition is a set of compute nodes (computers dedicated to...
computing) grouped logically based on either physical properties of the hardware
or job scheduling policies. Typical examples include partitions dedicated to
debugging where only small and short jobs can be scheduled, or partitions
dedicated to visualization with nodes equipped with specific graphic cards.
# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch up infinite 2 alloc node[08-09]
batch up infinite 6 idle node[10-16]
debug* up 30:00 8 idle node[01-07]
In the above example, we see two partitions, named batch and debug.
The latter is the default partition as it is marked with an asterisk.
All nodes of the debug partition are idle, while two of the batch
partition are being used. The nodes in this example are named node01
to node16
.
The sinfo
command also lists the time limit (column TIMELIMIT
) to which
jobs are subject. On every cluster, jobs are limited to a maximum run time, to
allow job rotation and let every user a chance to see their job being started.
Generally, the larger the cluster, the smaller the maximum allowed time.
If the output of the sinfo
command is organised differently from the above,
it probably means default options are set through environment variables or
wrappers are used at your site to provide more suitable defaults. Or you could have a different Slurm version.
The command sinfo
can output the information in a node-oriented fashion,
with the argument -N
.
# sinfo -N -l
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
node[01-02] 2 debug* idle 32 2:8:2 3448 38536 16 Intel (null)
node[03,05-07] 4 debug* idle 32 2:8:2 3384 38536 16 Intel (null)
node03 1 debug* down 32 2:8:2 3394 38536 16 Intel "Disk replacement"
node[08-09] 2 batch allocated 32 2:8:2 246 82306 16 AMD (null)
node[10-16] 7 batch idle 32 2:8:2 246 82306 16 AMD (null)
With the -l
argument, more information about the nodes is provided, among
which the number of “CPUs” (CPUS
), which is the number of processing units
that the jobs can use. It should generally correspond to the number of sockets
(S
) times number of cores per socket (C
) times number of hardware
threads per core (T
in the S:C:T
column) but can be lower in the case
some CPUs are reserved for system use.
The other columns report the volatile working memory (RAM – MEMORY
), the
size of the local temporary disk (also called local scratch space –
TMP_DISK
), and the node “weight” (an internal parameter specifying
preferences in nodes for allocations when there are multiple possibilities). The
last but one column (AVAIL_FE
) show so-called features of the nodes,
that are set by the administrator, and can refer to a processor vendor or
family, a specific network equipment, or any desirable feature of the node, that
can be used to choose one node type to another.
The last column, (REASON
), if not null
, describes the reason why a node
would not be available.
Note
You can actually specify precisely what information you would like sinfo
to output by using its --format
argument. For more details, have a look at the command manpage with man sinfo
.
Anatomy of a job¶
Jobs are made of one or multiple sequential steps, each consisting in one or multiple parallel tasks that will be dispatched to possibly distinct nodes.
Each task will be allocated CPUs, memory, and possible other generic resources in an exclusive manner by Slurm. Two jobs cannot share the same resource unless explicitly forced by the admins, but that is generally not the case. Therefore, jobs can only start when all needed resources are free and not needed by another higher priority job. Jobs are indeed assigned a priority when they are submitted, which can depend upon multiple factors. More on that later.
For the scheduling process to work properly, you will need to describe your job before you submit it:
- what are the steps (i.e. which program must be run and how) ;
- how many tasks there will be ;
- what resource each task needs (CPU, memory, etc.), and
- for how long the job is supposed to run.
All of these, along with potentially aditionnal job parameters submission options, can be described in a submission script.
A submission script is a shell script, e.g. a
Bash script, whose
comments, if they are prefixed with #SBATCH
, are understood by Slurm as
parameters describing resource requests and other submissions options.
You can get the complete list of parameters from the sbatch manpage
man sbatch
.
Important
The #SBATCH
directives must appear at the top of the submission file,
before any other line except for the very first line which should be the
shebang (e.g. #!/bin/bash
).
The script itself is a job step. Other job steps are created with the
srun
command, that takes as argument the name and options of the program to run.
Note
If there is only one step, the srun
command can be omitted, but it has consequences in how signals are handled and how accurate reporting is. It is often best to use it in all cases.
For instance, the following script, hypothetically named submit.sh
,
#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res.txt
#SBATCH --partition=debug
#
#SBATCH --time=10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=100
srun hostname
srun sleep 60
describes a job made of 2+1 steps (the submission script itself plus two
explicit steps created by calling srun
twice), each step consisting of only
one task that needs one CPU and 100MB of RAM. The first step will run the
hostname
command, and the second one the useless sleep
command. The job
is supposed to run for 10 minutes on the debug
partition, be named test
,
and create an output file named res.txt
.
Important
It is important to note that as the job will run unattended, it will not be attached to a terminal (screen) so everything that the program that is run would like to write to the terminal will be redirected by Slurm to a file, whose name is specified by the --output
parameter. (Note that for the same reasons, the program cannot expect input from the user through the keyboard to run properly.)
Once the submission script is written properly, you need to submit
it to slurm through the sbatch
command, which, upon success, responds
with the jobid attributed to the job. (The dollar sign below is the
shell prompt)
$ sbatch submit.sh
sbatch: Submitted batch job 12321
Warning
Make sure to submit the job with sbatch
and not bash
; also do not execute it directly. This would ignore all resource request and your job would run with minimal resources, or could possible run on the frontend rather than on a compute node.
The job then enters the queue in the PENDING state. You can verify this with
$ squeue --me
Once resources become available and the job has highest priority, an allocation is created for it and it goes to the RUNNING state. If the job completes correctly, it goes to the COMPLETED state, otherwise, it is set to the FAILED state.
Why is my job not starting?¶
A job can stay in the PENDING state for a long time, especially if the maximum allowed time on the cluster is large. It is always important though to check the reason for which the job is pending. If the reason is Priority, then there is not much you can do, but there are other possible reasons, some of which require action on your part.
The reason is listed in the last column of the default output of squeue
and the description of the reason codes is available from the documentation.
Gathering job information¶
The squeue
command shows the list of jobs which are currently running
(they are in the RUNNING state, noted as ‘R’) or waiting for
resources (noted as ‘PD’, short for PENDING).
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345 debug job1 dave R 0:21 4 node[09-12]
12346 debug job2 dave PD 0:00 8 (Resources)
12348 debug job3 ed PD 0:00 4 (Priority)
The above output shows that one job is running, whose name is job1 and
whose jobid is 12345. The jobid is a unique identifier that is used
by many Slurm commands when actions must be taken about one particular
job. For instance, to cancel job job1, you would use scancel 12345
.
Time is the time the job has been running until now. Node is the number
of nodes which are allocated to the job, while the Nodelist
column lists
the nodes which have been allocated for running jobs. For pending jobs,
that column gives the reason why the job is pending. In the example, job
12346 is pending because requested resources (CPUs, or other) are not available in
sufficient amounts, while job 12348 is waiting for job 12346, whose
priority is higher, to run. Each job is indeed assigned a priority
depending on several parameters whose details are explained in section
Slurm priorities. Note that the priority for pending jobs can be obtained
with the sprio
command. Other reasons can exist ; they are listed in the Slurm documentation.
There are many switches you can use to filter the output by user
--user
, by partition --partition
by state --state
etc. As with
the sinfo
command, you can choose what you want sprio
to output with the
--format
parameter.
Interestingly, you can get near-realtime information about your job when it is running
(memory consumption, etc.) with the sstat
command, by
running sstat -j jobid
.
You can select what you want sstat
to output with the --format
parameter. Refer to the manpage for more information man sstat
.
Upon completion, the output file contains the result of the commands run
in the script file. In the above example, you can see it with cat
res.txt
command.
The example shown above illustrates a serial job which runs a single CPU on a single node. It
does not take advantage of multi-processor nodes or the multiple compute nodes
available with a cluster, by contrast to the example jobs in the squeue
output that
use multiple nodes.
The next sections explain how to create parallel jobs.
Going parallel¶
There are several ways a parallel job, one that leverages multiple compute units (CPUs) at the same time, can be created.
Warning
The type of job that can be created depends on the capabilities of the software being used. Parallel computing requires specific programming techniques, which, if not used, lead to simply the same computation being performed multiple times, with no actual benefit.
If the resource request fit with the capabilities of the program, then Slurm will be able to allocate every parallel sequence of computing instructions, called a thread in programming, to a different CPU in a precise one-to-one mapping. This process is called “binding”, or “pinning”. If there are more threads created by the program than there are CPUs available, the operating system will iteratively start and stop threads to allocate CPU time to every thread, in a procedure called “context switching”. That procedure consumes resource in itself and incurs significant overhead to be considered undesirable for high-performance computing.
Single-node parallelism¶
In the Linux operating system, running a program will initiate a process, that is a running copy of the program in the main memory. A process initially consists of one single thread, but it can clone itself into multiple threads that all share the same memory space (shared-memory programming) and can perform computations in parallel, but it can also spawn (“fork” in Linux terms) other processes, whose memory spaces are independent and whith which it can communicate internally.
Multithreaded programms can be written using the pthreads system call, but most multithreaded scientific software is written with OpenMP. OpenMP allows creating Single program, Multiple data programs, or SPMD. The same program is run multiple times, but each instance is identified by a thread ID, or rank that can be used to differenciate the behavior of each instance, for instance working on a separate subset of the data. Optimised linear algebrae of signal processing libraries use OpenMP a lot for parallelism.
Note
Even if the software you write is purely sequential and does not use OpenMP for instance, if it links to optimized scientific libraries (as is the case for interpreted languages like Python, R or Julia, they will use OpenMP behind the scenes so you should request at least 4 or 8 CPUs.
Multi-node parallelism¶
- Multi-node parallelism assumes distributed memory programming with either
- no communication between the processes on the different nodes ; or
- message passing through the network ; or
- communication through a common filesystem.
The first case, called embarrassingly parallel, simply relies on multiple instances of the same program being started with either - a different environment, and possibly - different arguments.
The second case is typical of high-performance computing, and requires specific large-bandwith low-latency, network hardware. Most message passing scientific software use MPI, a library that takes care of instantiating multiple instances of the same program on different nodes, and allow them to send and revieve messages through the network. This is another example of SPMD programming.
The last case is mostly encountered with Master/Worker setups (a specific case of multiple program multiple data) where one program (master) is designed to prepare and distribute work, while another (worker) is designed to perform the work. Only one master is run, but there can be as many workers instances as needed.
Important
The above taxonomy must be take with a grain of salt. Some Master/Worker programs use the network rather than the disk for communication, or use interprocess communication. In that later case, they can only do single-node paralelism. As for Message Passing, it can be done all in a single node. Also shared memory programming can be done on multiple nodes if specific libraries are used.
Slurm tasks¶
In the Slurm context, a task is to be understood as a running instance of a
program which matches the definition of a “process” seen earlier. Tasks for
each step are spawned by using the srun
command as was explained earlier.
Note
MPI programs can also be started by the mpirun
command instead. The mpirun
command is provided with the MPI library, while srun
is provided by Slurm. There are pros and cons to using one versus another so it will bascally depends on the specific parameters that are needed for starting the program.
The number of tasks (processes) that srun
must spawn is decided by the value
of the --ntasks
option. Each process will be allocated the number of CPUs
dictated by the value of the --cpus-per-task
option, and the memory obtained
by the multiplication of the value of the --mem-per-cpu
option by the
number of CPUs allocated per task.
Tasks cannot be split across several compute nodes, so all CPUs requested with
the --cpus-per-task
option will be located on the same compute node. By
contrast, CPUs allocated to distinct tasks can end up on distinct nodes.
Each task has an environment variable called SLURM_PROCID
set by Slurm to a
distinct value from 0 to the number of tasks minus one.
More submission script examples¶
Here are some quick sample submission scripts. For more detailed information, make sure to have a look at the Slurm FAQ and to follow our training sessions or to watch the training video. There is also an interactive Script Generation Wizard you can use to help you in submission scripts creation.
Message passing example (MPI)¶
#!/bin/bash
#
#SBATCH --job-name=test_mpi
#SBATCH --output=res_mpi.txt
#
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
module load OpenMPI
srun hello.mpi
Request four cores on the cluster for 10 minutes, using 100 MB of RAM
per core. Assuming hello.mpi
was compiled with MPI support,
srun
will create four instances of it, on the nodes allocated by
Slurm.
You can try the above example by downloading the example hello world program from Wikipedia (name it for instance wiki_mpi_example.c), and compiling it with
module load OpenMPI
mpicc wiki_mpi_example.c -o hello.mpi
The res_mpi.txt
file should contain something like
We have 4 processors
Hello 1! Processor 1 reporting for duty
Hello 2! Processor 2 reporting for duty
Hello 3! Processor 3 reporting for duty
Embarrassingly parallel workload example (job array)¶
This setup is useful for problems based on random draws (e.g. Monte-Carlo simulations). In such cases, you can have four programs drawing 1000 random samples and combining their output afterwards (with another program) you get the equivalent of drawing 4000 samples.
Another typical use of this setting is parameter sweep. In this case the same computation is carried on several times by a given code, differing only in the initial value of some high-level parameter for each run. An example could be the optimisation of an integer-valued parameter through range scanning in a job array:
#!/bin/bash
#
#SBATCH --job-name=test_emb_arr
#SBATCH --output=res_emb_arr.txt
#
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
#
#SBATCH --array=1-8
srun ./my_program.exe $SLURM_ARRAY_TASK_ID
In that configuration, the command my_program.exe
will be run eight times,
creating eight distinct jobs, each time with a different argument passed with the
environment variable defined by slurm SLURM_ARRAY_TASK_ID ranging
from 1 to 8, as specified by the --array
parameter.
The same idea can be used to process several data files.
To different instances of the program we must pass a different file to read,
based upon the value set in the $SLURM_*
environment variable. For instance,
assuming there are exactly eight files in /path/to/data
we can create
the following script:
#!/bin/bash
#
#SBATCH --job-name=test_emb_arr
#SBATCH --output=res_emb_arr.txt
#
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
#
#SBATCH --array=0-7
FILES=(/path/to/data/*)
srun ./my_program.exe ${FILES[$SLURM_ARRAY_TASK_ID]}
In this case, eight jobs will be submitted, each with a different filename given
as an argument to my_program.exe
defined in the array FILES[]
. As the
FILES[]
Bash array is zero-indexed, the Slurm job array IDs must also start
at 0 so the argument is --array=0-7
. One pain point is that the number of
files in the directory must match the number of jobs in the array.
Note that the same recipe can be used with a numerical argument that is
not simply an integer sequence, by defining a Bash array ARGS[]
containing the
desired values:
ARGS=(0.05 0.25 0.5 1 2 5 100)
srun ./my_program.exe ${ARGS[$SLURM_ARRAY_TASK_ID]}
Here again, the Slurm job array numbering must start at 0 to make sure all items in the ARGS[]
Bash array are processed.
Warning
If the running time of your program is small, say ten minutes or less, creating a job array will incur a lot of overhead and you should consider packing your jobs.
Packed jobs example¶
By default, the srun
command in a submission script inherits all non-GRES
resource allocated in the job, but with the --exact
parameter, you can split
the resource and allocate them to multiple steps in parallel.
As an example, the following job submission script will ask Slurm for 8 CPUs,
then it will run the myprog
program 1000 times with arguments passed from 1
to 1000. But with the -N1 -n1 -c1 --exact
option, it will control that at any point
in time only 8 instances are effectively running, each being allocated one CPU. You
can at this point decide to allocate several CPUs or tasks by adapting the corresponding
parameters.
#! /bin/bash
#
#SBATCH --ntasks=8
for i in {1..1000}
do
srun -N1 -n1 -c1 --exact ./myprog $i &
done
wait
The for
-loop can be replaced with GNU parallel
if installed on your
system:
parallel -P $SLURM_NTASKS srun -N1 -c1 -n1 --exact ./myprog ::: {1..1000}
Similarly, many files can be processed with one job submission script. The
following script will run myprog
for every file in /path/to/data
,
but maximum 8 at a time, and using one CPU per task.
#! /bin/bash
#
#SBATCH --ntasks=8
for file in /path/to/data/*
do
srun -N1 -n1 -c1 --exact ./myprog $file &
done
wait
Here again the for
-loop can be replaced with another command, xargs
:
find /path/to/data -print0 | xargs -0 -n1 -P $SLURM_NTASKS srun -n1 --exclusive ./myprog
Master/worker program example¶
#!/bin/bash
#
#SBATCH --job-name=test_ms
#SBATCH --output=res_ms.txt
#
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
srun --multi-prog multi.conf
With file multi.conf
being, for example, as follows
0 echo I am the Master
1-3 echo I am worker %t
The above instructs Slurm to create four tasks (or processes), one
running echo 'I am the Master'
, and the other 3 running
echo I am worker %t
. The %t
placeholder will be replaced with the task id. This is typically used in a
producer/consumer setup where one program (the master) create
computing tasks for the other program (the workers) to perform.
Upon completion of the above job, file res_ms.txt
will contain
I am worker 2
I am worker 3
I am worker 1
I am the Master
though not necessarily in the same order.
Hybrid jobs¶
You can mix multi-processing (MPI) and multi-threading (OpenMP) in the same job, simply like this:
#! /bin/bash
#
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=4
module load OpenMPI
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./myprog
or even a job array of hybrid jobs:
#! /bin/bash
#
#SBATCH --array=1-10
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=4
module load OpenMPI
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./myprog $SLURM_ARRAY_TASK_ID
GPU jobs¶
If you want to claim a GPU for your job, you need to specify the GRES Generic Resource Scheduling parameter in your job script. Please note that GPUs are only available in a specific partition whose name depends on the cluster.
#SBATCH --partition=PostP
#SBATCH --gres=gpu:1
A sample job file requesting a node with a GPU could look like this:
#!/bin/bash
#SBATCH --job-name=example
#SBATCH --ntasks=1
#SBATCH --time=1:00:00
#SBATCH --mem-per-cpu=1000
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
module load CUDA
srun ./my_cuda_program
Interactive jobs¶
Slurm jobs are normally batch jobs in the sense that they are run unattended. If you want to have a direct view on your job, for tests or debugging, you have two options.
If you need simply to have an interactive Bash session on a compute node, with the same environment set as the batch jobs, run the following command:
srun --pty bash -l
Doing that, you are submitting a 1-CPU, default memory, default duration job that will return a Bash prompt when it starts.
If you need more flexibility, you will need to use the
salloc command. The salloc
command accepts the same parameters as sbatch
as far as resource requirement
are concerned. Once the allocation is granted, you can use srun
the same way
you would in a submission script.
Further readings¶
Now that you know how to submit a job, you might be interested in learning more about the following topics:
- Slurm : Slurm FAQ, Priority computation, Official documentation
- Checkpointing: when a job is too long to fit within the maximum wall time of the cluster, it must be checkpointed and restarted. See our training session video on the subject.
- Workflow managers: when you have a multitude of jobs to manage, you might find a workflow manager to be a good helping tool. See the videos of the workshop we organised on the subject.