Slurm priorities¶
Jobs are scheduled by Slurm according to their priority, this is not a
first-come first-serve queue. We cannot stress this enough; the waiting time
for a job is not related to the length of the queue; it is related to the
user’s fairshare and to the size of the job. Even if the queue has 1000 jobs
pending, your job could be starting right away if your fairshare is favorable,
or if your job is so small it can be backfilled and scheduled in the shadow of
a larger, higher-priority job. But of course if you do not submit your job,
you have zero chance for your job to start... Priorities are computed per
partition. Some clusters have special partitions with small max time that allow
for fast job turnaround. On those partitions, your small jobs are even more
akin to start soon. To view the partitions on the clusters, use the sinfo
command, or consult the clusters details on the CÉCI website.
Slurm computes job priorities regularly and updates them to reflect continuous change in the situation. For instance, if the priority is configured to take into account the past usage of the cluster by the user, running jobs of one user do lower the priority of that users’ pending jobs.
The way the priority is updated depends on many configuration details. This document explains how to discover them and find the appropriate documentation so as to be able to understand how priorities are computed for a particular cluster.
Two parameters in Slurm’s configuration determine how priorities are computed. They are named SchedulerType and PriorityType.
Internal or external scheduling¶
The first parameter, SchedulerType, determines how jobs are scheduled based on available resources, requested resources, and job priorities. Scheduling can be taken care of by an external program such as Moab or Maui, or by Slurm itself.
In that later case, the scheduling type can be builtin, in which case all jobs run in priority order, or backfill. Backfill is a mechanism by which lower priority jobs can start earlier to fill the idle slots provided they are finished before the next high priority jobs is expected to start based on resource availability.
To find out which solution is implemented on a cluster, you can issue the following command:
scontrol show config | grep SchedulerType
If the answer is sched/wiki, this means that scheduling is handled by Maui, while sched/wiki2 means scheduling is done by Moab. If scheduling is actually configured to be managed by Slurm, the above command should return sched/builtin or sched/backfill.
See the slurm.conf manpage and search for ‘SchedulerType’ for more information.
If the scheduling is performed externally to Slurm (Maui or Moab), you will need to look for the proper documentation. If, as most likely, the scheduling is handled internally, the following section explains how to understand priority computations.
Priority computation¶
The way the priority is computed for a job depends on another parameter which is called PriorityType. It can take the following values:
- priority/basic Jobs are given a strictly first arrived first served priority. (Mostly used in case of an external scheduler.)
- priority/multifactor Jobs are prioritized according to several criteria such as past cluster usage, job size, queue time, etc.
- priority/multifactor2 A variation of the previous.
To find out which solution is implemented on a cluster, you can issue the following command:
scontrol show config | grep PriorityType
The most used configuration is most probably priority/multifactor. The priority then depends on five elements:
- Job age: how long the job has been waiting in the queue ;
- User fairshare: a measure of past usage of the cluster by the user ;
- Job size: the number of CPUs a job requests ;
- Partition: the partition to which a job is submitted , specified with
the
--partition
submission parameter; - QOS: a quality of service associated with the job, specified with the
--qos
submission parameter.
Note that the job age parameter is bounded so that priority stops increasing when the bound is attained. The job size parameter can be configured to favour small or large jobs, although it is used most of the time to favour large jobs. The fairshare parameter has a ‘forgetting’ parameter that leads to considering only the recent history of the user and not its total use over the time life of the cluster.
All these are combined in a weighted average to form the priority. The weights can be found by running
sprio -w
A detailed description of how these are computed (including the fairshare), is given in the Slurm documentation for multifactor and for sheduling.
The precise configuration for a cluster can be found by running the following command:
scontrol show config | grep ^Priority
Finding a user’s current fairshare situation is done with the sshare command.
Getting the priority given to a job can be done either with squeue
squeue -o %Q -j jobid
or with the sprio command which gives the details of the computation.