Slurm interactive jobs

When submitting an interactive job to Slurm, there are several issues to take into account:

  1. the interactive session might start when you are not available ;
  2. your personnal computer might get disconnected.

Planning the start of the job

While there is not way, besides creating an advance reservation, to ensure a job starts at a specific wanted time, you can make sure it does not start at an unwanted time. For instance, you might not be interested for a job to start at 2:00 AM. Or it is 4PM and you leave in one hour and you are not interesting for the job to start with so little time left.

When you submit the job, you can specify the earliest time at which the job should start, and Slurm will not consider that job until that specified time has come. That is done with the --begin option common to sbatch, srun and salloc. Here are some examples:

--begin=2023-03-12T08:30
--begin=tomorrowT09:00
--begin=now+5daysT09:30:00

Make sure to specify a start time in addition to the date, because the default is 00:00:00.

Once the job has been submitted, you can alter the start time if you changed you mind with the scontrol command:

scontrol update jobid=<JOBID> StartTime=now+5daysT09:30:00

And you can remove the constraint with

scontrol update jobid=<JOBID> StartTime=now

See the full documentation for --begin in the Slurm documentation

Preventing disconnections

Interactive sessions are typically requested with srun --pty bash, or with the salloc command. Both commands will block until Slurm allocates the resources, and will not survive an SSH disconnection from the login node.

Start srun or salloc in a Tmux session

To make sure the salloc or srun keep waiting for the job to start even if your laptop/desktop is disconnected, put to sleep, or abruptly reboots, you need to use a terminal multiplexer. The most used terminal multiplexers are GNU screen and Tmux. This document will focus on Tmux as it offers more feature than Screen while at the same time being nearly as commonly installed.

The basic idea is to start the tmux command as soon as you are connected with SSH. You will see the screen clear up, and you will enter a tmux session. From there, you can work as usual, and if you happen to be disconnected, you can SSH back to the login node and run the tmux attach command to re-attach to the running Tmux session that has survived the disconnection.

So the basic idea is to start a Tmux session, and inside that session, run the srun or salloc command. Once the command has started and is waiting for Slurm to allocate the resources, you can detach from it if you want by pressing (and release) CTRL-b and then hitting d “blindly”. This will bring yo back to the initial shell session you got when you connected with SSH.

Tmux session can have names so you can start multiple ones and attach/detach from them at will. For a more comprehensive tutorial on the features of Tmux, you can refer to this document.

Start tmux in a sbatch submission script

The above paragraph explained how to start a Tmux session on the login node to immunize the srun or salloc command from SSH disconnections.

Another option, which is sligly more complicated to setup, but offers more flexibility, is to start Tmux in a (non-interactive) sbatch session and attach to it when the job has started.

This allows to

  • have commands starting before tmux is started, or in parallel ;
  • have an interactive job that survives login node reboots or other problems ;
  • have split panes in the Tmux session that are all running on the compute node ;
  • start commands inside Tmux automatically upon job start.

Tmux must be started in “detached” mode from the beginning, otherwise it will complain “open terminal failed: not a terminal”. It is also a good idea to name the session in case you have multiple jobs running on the same compute node.

An example submission script can be:

#!/bin/bash
#SBATCH ... # some job options

# Commands to start outside of the tmux session
# Append an `&` sign to make the command run in parallel
# to the tmux session
./prepare-job.sh

# Start tmux
tmux new -d -s "$SLURM_JOB_ID"

# Run tmux commands to setup the session
tmux split-window -h # this will setup two panes, splitting the screen horizontally

# Run commands inside the Tmux session
tmux send-keys -t "$SLURM_JOB_ID" "echo hello world" Enter

# Finally, make job wait for user to connected
sleep 8h

Once the job has started, you can use srun to attach to the Tmux session:

srun --jobid <JOBID> --pty tmux a -t <JOBID>