Slurm F.A.Q¶
Q00 Can I bypass Slurm?¶
No. All computations must be submited as Slurm jobs. You are not allowed to run jobs on the login nodes (exception for little tests) ; you may also not connect to compute nodes by SSH outside of a Slurm allocation.
Warning
By using the CÉCI clusters, you accept the rules related to the Fair use of the CÉCI clusters .
Q01 Is there a generic submission script I can use?¶
See the Slurm Script Generation Wizard.
Q02 How can I get my job to start early?¶
First, make sure you only request the resources you need. The more you
ask, the longer you will wait. Then, try and make your job flexible in
terms of resources. If your program is able to work through the network,
do not ask for all tasks on the same node. Use the Slurm options
cleverly. For instance, the --nodes
option allows specifying a range of
number of nodes, e.g. --nodes=2-4
, meaning that your job will start as
soon as at least two nodes are available, but if, by then, four nodes
are available, you will be allocated four nodes. Another example is
–time-min, which allows you to specify a minimum running time which you
are willing to set for your job if it makes it possible to start it
earlier (through backfilling). Note that you can get, at any moment, the
remaining time for your job in your script with the squeue -h -j
$SLURM\_JOBID -o %L
and that the --signal
option can be used to be
warned before the job is killed.
Of course, it is always easier to get a job running when there are few jobs waiting in the queue. Try to plan ahead your work and submit your jobs when people usually do not work on the cluster: holidays, exam periods, etc.
Q03 How do I cancel a job?¶
Use the scancel jobid
command with the jobid of the job you want
cancelled. In the case you want to cancel all your jobs, type scancel -u
login
. You can also cancel all your pending jobs for instance with
scancel -t PD
.
Q04 How do I submit a job to a specific queue?¶
Slurm uses the term partition rather than queue. To submit a job to
a given partition, use the --partition
option of the sbatch
, srun
, or
salloc
commands.
Queues can also appear under the form of qualities of service (QOS).
To use a particular QOS, use the --qos
option of the above-listed
commands.
To view all partitions on a cluster, use sinfo
, while qualities of
service can be listed with sacctmgr list qos
.
Q05 How do I create a parallel environment?¶
Slurm ignores the concept of parallel environment as such. Slurm simply
requires that the number of nodes, or number of cores be specified. But
you can have the control on how the cores are allocated; on a single
node, on several nodes, etc. using the --cpus-per-task
and
--ntasks-per-node
options for instance.
With those options, there are several ways to get the same allocation.
For instance, the following : --nodes=4 --ntasks=4 --cpus-per-task=4
is
equivalent in terms of resource allocation to --ntasks=16
--ntasks-per-node=4
. But each one will lead to environment variables being set,
and understood, differently by srun
and mpirun
: in the first case 4
processes are launched while in the second one 16 processes will be launched.
Suppose you need 16 cores, these are some possible scenarios:
- you use mpi and do not care about where those cores are distributed:
--ntasks=16
- you want to launch 16 independent processes (no communication):
--ntasks=16
- you want those cores to spread across distinct nodes:
--ntasks=16 --ntasks-per-node=1
or--ntasks=16 --nodes=16
- you want those cores to spread across distinct nodes and no
interference from other jobs:
--ntasks=16 --nodes=16 --exclusive
- you want 16 processes to spread across 8 nodes to have two processes
per node:
--ntasks=16 --ntasks-per-node=2
- you want 16 processes to stay on the same node:
--ntasks=16 --ntasks-per-node=16
- you want one process that can use 16 cores for multithreading:
--ntasks=1 --cpus-per-task=16
- you want 4 processes that can use 4 cores each for
multithreading:
--ntasks=4 --cpus-per-task=4
Q06 How do I choose a node with certain features (e.g. CPU, GPU, etc.) ?¶
Slurm associates to each node a set of Features and a set of Generic resources. Features are immutable characteristics of the node (e.g. network connection type) while generic resources are consumable resources, meaning that as users reserve them, they become unavailable for the others (e.g. compute accelerators).
Features are requested with --constraint="feature1&feature2"
or
--constraint="feature1|feature2"
. The former requests both, while the
latter, as one would expect, requests at least one of feature1 or
feature2. More complex expressions can be constructed. Type man
sbatch
for details.
Generic resources are requested with --gres="resource:2"
to request 2
resources.
Generic resources and features can be listed with the command scontrol show nodes
.
Q07 How do I get the list of features and resources of each node ?¶
The command sinfo
gives such information. You need to run it with
specific output parameters though:
sinfo -o "%15N %10c %10m %25f %10G"
It will output something like:
ceciuser@cecicluster:~ $ sinfo -o "%15N %10c %10m %25f %10G"
NODELIST CPUS MEMORY FEATURES GRES
mback[01-02] 8 31860+ Opteron,875,InfiniBand (null)
mback[03-04] 4 31482+ Opteron,852,InfiniBand (null)
mback05 8 64559 Opteron,2356 (null)
mback06 16 64052 Opteron,885 (null)
mback07 8 24150 Xeon,X5550 TeslaC1060
mback[08-19] 8 24151 Xeon,L5520,InfiniBand (null)
mback[20-32,34] 8 16077 Xeon,L5420 (null)
In the above screen capture, we see that some compute nodes have Infiniband connections, some have Intel processors, while others have AMD processors. The node mback07 furthermore has one GPU, which is a generic resource in the sense that once requested by a job, it becomes unavailable for the others.
Q08 Is OpenMP ‘slurm-aware’ ?¶
No, you need to set export OMP_NUM_THREADS=...
in your submission
script.
For instance, if you requested several cores with the --cpus-per-task
option, you
can write:
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
Q09 Is MPI ‘slurm-aware’ ?¶
Yes, you do not need to specify the -np nor the -host, or hostfile options to mpirun or mpiexec. Simply go with
mpirun ./a.out
assuming you requested several cores with --ntasks
Do not forget to set the environment correctly with something like
module load openmpi/gcc
if necessary.
Q10 How do I know how much memory my job is using/has used ?¶
If your job is still running, you can have memory information with
sstat
. If your job is done, the information is provided by sacct
. Both
support the --format
option so you can run, for instance:
sacct --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize
See the manpages for both utilities:
man sstat
man sacct
Q11 How do I use the local scratch space ?¶
The CÉCI clusters are configured so that a temporary directory is created for each job on each node, and referenced to by the $LOCALSCRATCH
environment variable. The temporary directory will typically be something like /scratch/<login_name>/<job_id>
.
Therefore, in the case of a single-node job, you can simply cp
or rsync
the file or directory you need directly to $LOCALSCRATCH
. For instance:
cp $HOME/inputfile $LOCALSCRATCH/
cd $LOCALSCRATCH
# do some work here
The $LOCALSCRATCH
is automatically deleted at the end of the job. It is therefore necessary to copy back any result back to a global filesystem, either $HOME
or $GLOBALSCRATCH
for further processing or as a temporary storage before egressing to a long-term storage system.
cp -r $HOME/inputs $LOCALSCRATCH/
cd $LOCALSCRATCH
# do some work here
cp -r ./results $GLOBALSCRATCH
In the event of a multi-node job, Slurm offers the sbcast
command that propagates a file to the local file
systems of the nodes that were allocated to the job. However, sbcast
works one file at a time.
One other way to deal with several files is to run something like srun cp ...
. Beware in that case that the srun
command will run as many cp
commands as the number of ntasks
requested for the job, which could be larger than the number of allocated nodes, in which case multiple cp
command will fight over access to the destination directory and might overwrite each other’s files. Make sure then to restrict the srun
to the number of nodes that are allocated:
srun -n $SLURM_JOB_NUM_NODES cp -r $HOME/inputs $LOCALSCRATCH/
Q12 How do I get the node list in full rather than in compressed format ?¶
Slurm describes node lists with notations like node[05-07,09-17]. To get
the full list, use the scontrol
command:
ceciuser@cecicluster:~ $ scontrol show hostname node[05-07,09-17] | paste -d, -s
node05,node06,node07,node09,node10,node11,node12,node13,node14,node15,node16,node17
Q13 How do I know which slots exactly are assigned to my job ?¶
The command scontrol show -d job jobid
gives very detailed
information about jobs.
Q14 How do I translate a script from another job scheduler into Slurm ?¶
The Slurm developers maintain a ‘Rosetta stone of Workload manager‘ which gives the correspondences between the options of several job schedulers. [Direct link - pdf 156kb].
Q15 When will my job start ?¶
A job starts either when it has the highest priority and the required resources are
available, or when it has an opportunity to backfill (See the document Slurm
priorities for details). The squeue --start
command
gives an estimation of the date and time a job is supposed to start but
beware that the estimation is based on the situation at a given time.
Slurm cannot anticipate higher-priority jobs being submitted after
yours, or machine downtimes which lead to fewer resources for the jobs,
or job crashes which can lead to large jobs starting earlier than
expected thus making smaller jobs scheduled for backfilling to lose that
backfilling opportunity.
Q16 How do I know to which partition I should submit my job so that it starts as early as possible?¶
Simply submit the job to all the partitions you are considering, by listing
them with the --partition
option:
#SBATCH --partition=partition1,partition2
The job will be submitted to the partition which offers the earliest allocation according to your job parameters and priority.
Q17 What are the queue settings and cpu resources available per user on each CÉCI cluster?¶
The default settings are summarized on the cluster description page. These settings are indicative only, and may change depending on the load on the clusters. To obtain current settings on a given queue, see Question 4.
Q18 Why do I get allocated more CPUs than I requested?¶
There are multiple reasons why your job could get allocated more resource than it explicitely requested. One obvious reason would be the use of --exclusive
; even if the job requests 1 CPU, if it specifies --exclusive
, it receives all the CPUs of the node. Another possible reason is hardware multithreading. Both AMD and Intel CPUs offer two “compute units” per physical core, called hardware threads, that are not entirely independent from one another. Historically, hardware multithreading was a key component in performances of database systems or web servers, but a liability in HPC, and was therefore often disabled. Nowadays, the impact is not so clear and really depends on the software, even for CPU-bound HPC applications. Slurm is aware of this and lets the user choose to use both hardware threads, and place two OpenMP threads or two MPI ranks on the same physical core or not. This is done with the “placement options”, of which the simplest of use is --hint=[no]multithread
. Jobs that set nomultithread
, or that equivalently request --threads-per-core=1
will receive, in practice, both hardware threads for each CPU they request, hence doubling the size of the allocation. Slurm indeed will not “split” a physical core into two different jobs. Finally it could also be due to a restriction in the memory that a job can request per core, setup through the MaxMemPerCPU
cluster configuration option. Jobs that request --mem-per-cpu
values above that configured value, rather than being denied submission, will in effect receive more CPUs so that this limit is honored.