Advanced SLURM Features

Dependency chains

SLURM supports dependency chains which are collections of batch jobs with defined dependencies. Job dependencies can be defined using the –dependency argument of the sbatch command:

#SBATCH --dependency=<type>

The available dependency types for job chains are:

  • after:<jobID> job starts when job with <jobID> begun execution

  • afterany:<jobID> job starts when job with <jobID> terminates

  • afterok:<jobID> job starts when job with <jobID> terminates successfully

  • afternotok:<jobID> job starts when job with <jobID> terminates with failure

  • singleton jobs starts when any previously job with the same job name and user terminates

Job Arrays

SLURM supports job arrays which is a mechanism for submitting and managing collections of similar jobs quickly and easily. Job arrays are only supported for the sbatch command and are defined using the option –array=<indices>. All jobs use the same initial options (e.g. number of nodes, time limit, etc.), however since each part of the job array has access to the SLURM_ARRAY_TASK_ID environment variable individual setting for each job is possible. For example the following job submission

$ sbatch --array=1-3 -N1

will generate a job array containing three jobs. Assuming that the jobID reported by sbatch is 42, then the parts of the array will have the following environment variables set:

# array index 1

# array index 2

# array index 3
Some additional options are available to specify file names for standard input (stdin), standard output (stdout), and standard error output (stderr): the marker %A will be replaced by the value of SLURM_ARRAY_JOB_ID and marker %a will be replaced by the value of SLURM_ARRAY_TASK_ID.
The following example creates a job array of 42 jobs with indices 0-41. Each job will run on a separate node with 24 tasks per node. Depending on the queuing situation, some jobs may be running and some may be waiting in the queue. Each part of the job array will execute the same binary but with different input files.
#SBATCH --nodes=1
#SBATCH --output=prog-%A_%a.out
#SBATCH --error=prog-%A_%a.err
#SBATCH --time=00:30:00
#SBATCH --array=0-41
#SBATCH --account=xz0123

srun --ntasks-per-node=24 ./prog input_${SLURM_ARRAY_TASK_ID}.txt


Similar to the IBM Power6 used in BLIZZARD, the Haswell and Broadwell processors deployed for MISTRAL offer the possibility of Simultaneous Multithreading (SMT) in the form of the Intel Hyper-Threading (HT) Technology. With HT enabled, each (physical) processor core can execute two threads or tasks simultaneously. We visualize this in the following for the Haswell nodes only - the equivalent for Broadwell nodes is obvious.

Each node on MISTRAL phase1 partition ’compute’ consists of two Intel Xeon E5-2680 v3 processors, located on socket zero and one. The first 24 processing units are physical cores labelled from 0 to 23. The second 24 processing units are HyperThreads labelled from 24 to 47. Figure below depicts a node schematically and illustrates the naming convention.

On MISTRAL, we have HT enabled on each compute node and SLURM always uses the option –threads-per-core=2 implicitly such that the user is urged to bind the tasks/threads in an appropriate way. The slurm option –cpus-per-task specifies the number of logical CPUs that are to be used for each MPI-tasks. Setting e.g. –cpus-per-task=1 for a pure MPI parallel program results in usage of HyperThreads (which is also the default if no option is given) while –cpus-per-task=2 would run the code on physical cores only not using HyperThreads. For OpenMP threads to run on logical CPUs, the OpenMP runtime environment needs to be modified like

export KMP_AFFINITY=granularity=thread,compact

The keyword granularity specifies whether OpenMP threads are bound to HyperThreads or physical cores (use granularity=core in this case).