Slurm interactive jobs¶
When submitting an interactive job to Slurm, there are several issues to take into account:
- the interactive session might start when you are not available ;
- your personnal computer might get disconnected.
Planning the start of the job¶
While there is not way, besides creating an advance reservation, to ensure a job starts at a specific wanted time, you can make sure it does not start at an unwanted time. For instance, you might not be interested for a job to start at 2:00 AM. Or it is 4PM and you leave in one hour and you are not interesting for the job to start with so little time left.
When you submit the job, you can specify the earliest time at which the job should start, and Slurm will not consider that job until that specified time has come. That is done with the --begin
option common to sbatch
, srun
and salloc
. Here are some examples:
--begin=2023-03-12T08:30
--begin=tomorrowT09:00
--begin=now+5daysT09:30:00
Make sure to specify a start time in addition to the date, because the default is 00:00:00
.
Once the job has been submitted, you can alter the start time if you changed you mind with the scontrol
command:
scontrol update jobid=<JOBID> StartTime=now+5daysT09:30:00
And you can remove the constraint with
scontrol update jobid=<JOBID> StartTime=now
See the full documentation for --begin
in the Slurm documentation
Preventing disconnections¶
Interactive sessions are typically requested with srun --pty bash
, or with the salloc
command. Both commands will block until Slurm allocates the resources, and will not survive an SSH disconnection from the login node.
Start srun
or salloc
in a Tmux session¶
To make sure the salloc
or srun
keep waiting for the job to start even if your laptop/desktop is disconnected, put to sleep, or abruptly reboots, you need to use a terminal multiplexer. The most used terminal multiplexers are GNU screen and Tmux. This document will focus on Tmux as it offers more feature than Screen while at the same time being nearly as commonly installed.
The basic idea is to start the tmux
command as soon as you are connected with SSH. You will see the screen clear up, and you will enter a tmux session. From there, you can work as usual, and if you happen to be disconnected, you can SSH back to the login node and run the tmux attach
command to re-attach to the running Tmux session that has survived the disconnection.
So the basic idea is to start a Tmux session, and inside that session, run the srun
or salloc
command. Once the command has started and is waiting for Slurm to allocate the resources, you can detach from it if you want by pressing (and release) CTRL-b
and then hitting d
“blindly”. This will bring yo back to the initial shell session you got when you connected with SSH.
Tmux session can have names so you can start multiple ones and attach/detach from them at will. For a more comprehensive tutorial on the features of Tmux, you can refer to this document.
Start tmux
in a sbatch
submission script¶
The above paragraph explained how to start a Tmux session on the login node to immunize the srun
or salloc
command from SSH disconnections.
Another option, which is sligly more complicated to setup, but offers more flexibility, is to start Tmux in a (non-interactive) sbatch
session and attach to it when the job has started.
This allows to
- have commands starting before
tmux
is started, or in parallel ; - have an interactive job that survives login node reboots or other problems ;
- have split panes in the Tmux session that are all running on the compute node ;
- start commands inside Tmux automatically upon job start.
Tmux must be started in “detached” mode from the beginning, otherwise it will complain “open terminal failed: not a terminal”. It is also a good idea to name the session in case you have multiple jobs running on the same compute node.
An example submission script can be:
#!/bin/bash
#SBATCH ... # some job options
# Commands to start outside of the tmux session
# Append an `&` sign to make the command run in parallel
# to the tmux session
./prepare-job.sh
# Start tmux
tmux new -d -s "$SLURM_JOB_ID"
# Run tmux commands to setup the session
tmux split-window -h # this will setup two panes, splitting the screen horizontally
# Run commands inside the Tmux session
tmux send-keys -t "$SLURM_JOB_ID" "echo hello world" Enter
# Finally, make job wait for user to connected
sleep 8h
Once the job has started, you can use srun
to attach to the Tmux session:
srun --jobid <JOBID> --pty tmux a -t <JOBID>