Slurm Introduction#

This page serves as an overview of user commands provided by Slurm and how users should use Slurm in order to run jobs on Levante. A concise cheat sheet for Slurm can be downloaded here.

Slurm Commands#

Slurm offers a rich suite of user commands to allocate resources, launch parallel applications, query job status, control jobs, inspect accounting information etc. The most commonly used commands are:

sinfo: show information about all partitions and nodes managed by Slurm as well as about general system state. It has a wide variety of filtering, sorting, and formatting options.
squeue: query the list of pending and running jobs.
sbatch: submit a batch script.
salloc: request an (interactive) job allocation. When the job starts, a shell (or other program specified on the command line) is started on the allocated node. The allocation is released when the user exits the shell or cancels the job.
srun: initiate parallel job steps within a job or start an interactive job.
scancel: cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
scontrol: (primarily used by the administrators) useful to manage job settings or get some information about the system entities such as nodes, partitions, jobs, and configurations.
sstat: query near-realtime status information about e.g. CPU and memory usage for a running job.
sacct: retrieve accounting information about jobs and job steps. For completed jobs sacct queries the accounting database.
sacctmgr: (primarily used by the administrators) query account information.

Examples for usage of the above commands along with more specific explanations can be found in section Slurm Command Examples below.

Allocating Resources with Slurm#

A job allocation, which is a set of computing resources (nodes, cores, and/or GPUs) assigned to a user’s request for a specified amount of time, can be created using the Slurm salloc, sbatch or srun commands. The salloc and sbatch commands make resource allocations only. The srun command launches parallel tasks, hence it replaces mpirun and mpiexec in Slurm installations, and implicitly creates a resource allocation if not started within one. Also, srun distributes tasks across allocated resources and binds individual tasks to cores and GPUs.

The most common way to allocate resources and execute a job on Levante is to write a batch script and submit it to Slurm with the sbatch command. The batch script is a shell script consisting of two parts: resource requests and job steps. Resource requests are specifications for number of nodes needed to execute the job, time duration of the job etc. Job steps are user-defined tasks to be executed using those resources. The resource request and other Slurm submission options are prefixed by #SBATCH directives and must precede any executable commands in the batch script. For example:

#!/bin/bash
#SBATCH --partition=compute
#SBATCH --account=xz0123
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128
#SBATCH --time=00:30:00
#SBATCH --output=log.%j.txt

# Begin of section with executable commands
set -e
ls -l
srun ./my_program

The shell script itself is regarded by Slurm as job step 0 and is (serially) executed on the first compute node in the job allocation. To execute parallel (MPI) tasks, users call the Slurm srun command within the script. Thereby, a new job step is initiated. It is possible to execute parallel programs in the form of job steps in any configuration within the job allocation. This means, a single job step can use all allocated resources or several job steps (created via multiple srun calls) can use a subset of allocated resources and run simultaneously.

By default, the current working directory of the Slurm job is the same as the directory from where sbatch was invoked. That is not necessarily the same directory in which the job script resides.

Task binding to CPUs/cores and an appropriate task distribution are essential to achieving good performance on clusters like Levante. For this purpose, the srun options --cpu-bind and --distribution should be used. The --cpu-bind option controls how tasks are bound (pinned) to specific CPUs, whole cores, entire sockets, or not at all. The option --distribution controls the distribution of allocated CPUs across nodes:sockets:cores for binding to tasks. It is essential for minimizing communication overhead when running parallel jobs. Usually, the two-part form nodes:sockets suffices. The command below, for example

srun --cpu_bind=cores --distribution=block:cyclic ./my_program

will bind tasks of ./my_program to dedicated cores thus preventing the operating system from moving them to other cores during program execution. The first distribution method (block) ensures that the consecutive tasks are placed on the same node until the maximum number of tasks per node is reached. The second distribution method (cyclic) means that whithin a node, tasks are distributed in a round-robin (cyclic) manner between sockets.

When the option --cpu-bind=verbose is set, srun reports the selected distribution and binding of tasks to CPUs. The CPU mask, which is written to standard error, can then be visualized on the Slurm Binding Website provided for Levante by DKRZ to check if the actual binding matches expectations.

The following table lists the most common job allocation and submission options (in long and short form) that can be used in a batch script or as command line options:

job allocation option	default value	description
`--account=<project account>` `-A <project account>`	none	Project account that should be charged. This option is mandatory on Levante.
`--partition=<name>` `-p <name>`	none	Partition to run the job. This option is mandatory on Levante.
`--job-name=<jobname>` `-J <jobname>`	job script’s name	Job name
`--nodes=<number>` `-N <number>`	1	Number of nodes for the allocation
`--ntasks=<number>` `-n <number>`	1 [1]	Number of tasks (MPI processes). Can be omitted if `--nodes` and `--ntasks-per-node` are given
`--ntasks-per-node=<num>`	1 [1]	Number of tasks per node
`--cpus-per-task=<number>` `-c <number>`	1 [1]	Number of threads (logical cpus) per task. Used for OpenMP or hybrid job
`--mem=<size>[units]`	partition dependent	Real memory required per node. Set to `0` for entire memory.
`--output=<path>/<file pattern>` `-o <path>/<file pattern>`	slurm-`%j`.out, where `%j` will be replaced by JobID	Standard output file
`--error=<path>/<file pattern>` `-e <path>/<file pattern>`	slurm-`%j`.out, where `%j` will be replaced by JobID	Standard error file
`--time=<walltime>` `--t <walltime>`	partition dependent	Requested walltime limit for the job; possible time formats are: - [hours:]minutes[:seconds] e.g. 20, 01:20, 01:20:30 - days-hours[:minutes][:seconds] e.g. 2-0, 1-5:20, 1-5:20:30
`--constraint=<list>` `-C <list>`	none	Node-features requested for the job. See configuration for available features.
`--mail-user=<email>`	email associated with the DKRZ user account	Email address for notifications
`--mail-type=<mode>`	–	Event types for email notifications. Possible values are BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_[90,80,50]
`--exclude=<nodelist>` `-x <nodelist>`	–	Exclude specified nodes from job allocation
`--nodelist=<nodelist>` `-w <nodelist>`	–	Request specified nodes for job allocation (if necessary additional nodes will be added to fulfill the requirement for number of nodes)
`--requeue` or `--no-requeue`	no-requeue	Specifies whether the batch job should be requeued after a node failure. Caution: if a job is requeued, the whole batch script is executed from its beginning

Note

On Levante, specification of -A resp. --account is necessary to submit a job, otherwise a submission will be rejected. You can query the accounts for which job submission is allowed using the Slurm sacctmgr command:

$ sacctmgr -s show user name=$USER

Furthermore, you will have to specify the partition on which the job will run by using either the -p or --partition option. Otherwise the submission will be rejected.

The complete list of parameters can be inquired from the sbatch manual pages:

$ man sbatch

Batch scripts are submitted using the Slurm sbatch command:

$ sbatch [OPTIONS] <job script>

Command line options passed to the sbatch command override the #SBATCH settings in the job script. On success, sbatch writes the JobID to standard output.

Note

Setting Slurm options via environment variables will override any matching options set in a batch script, and command line options will override any matching environment variable.

All environment variables set at the time of submission are propagated to the Slurm jobs. With some options of the allocation commands (like --export for sbatch or srun), users can change this default behaviour. The users can load modules and prepare the desired environment before job submission, and then this environment will be passed to the jobs that will be submitted. Of course, a good practice is to include e.g. necessary module commands in job scripts, in order to explicitly configure the environment of the jobs and provide documentation for yourself and colleagues. For maximum reproducibility, we recommend to start the script with module purge and follow up with loading all environment modules required.

Slurm Command Examples#

Query Commands#

List all jobs submitted to Slurm:
```
$ squeue
```
This command shows an overview of submitted jobs. Apart from JobID, partition, job name, user name and number of nodes, the command shows the job state, running time, and the nodelist for running jobs, or the reason why the job is pending. Typical job states from submission until completion are: PENDING (PD), RUNNING (R), COMPLETING (CG), COMPLETED (CD), and cancelled (CA). Some usual reasons for pending jobs are:
- Job is waiting for resources to become available (Resources)
- Higher priority jobs are occupying the queue (Priority)
- The maximum number of jobs allowed for a user within the association was exceeded (AssocMaxJobsLimit)
- The specified job run-time exceeds the limit in place for the selected partition (PartitionTimeLimit)

List all jobs submitted by you:

$ squeue -u $USER
$ squeue --me
$ squeue -o "%.18i %.9P %.20j %.8u %.8a %.2t %.10M %.6D %R" --me

Check all available partitions and nodes:
```
$ sinfo
```
The sinfo command reports the states of the partitions and the nodes. The partitions may be in state UP, DOWN or INACTIVE. The UP state means that a partition will accept new submissions and the jobs will be scheduled. The DOWN state allows submissions to a partition but the jobs will not be scheduled. The INACTIVE state means that submissions are not allowed. The nodes also can be in various states, such as alloc (allocated), comp (completing), down, idle, maint (maintenance), resv (reserved) etc. Description of all node states can be get from the sinfo man page.
List state summary for all partitions:
```
$ sinfo -s
```
The column NODES(A/I/O/T) shows number of nodes in the states allocated/idle/other/total for each Slurm partition.
Query information about the nodes in the specified partition (e.g. interactive):
```
$ sinfo -p interactive
```
Query configuration and limits for one or several specific partition (e.g. compute):
```
$ scontrol show partition compute
```
Check properties of a specific node (e.g. l10042):
```
$ scontrol show node l10042
```

Job control Commands#

Cancel job with Slurm JobID 10815:
```
$ scancel 10815
```
Cancel all your jobs:
```
$ scancel -u $USER
```
With the additional option -i (interactive mode) Slurm asks for confirmation before canceling the job.
Display status information of a running job with Slurm JobID 12345:
```
$ sstat -a -j 12345
```
sstat provides various status information (e.g. CPU time, Virtual Memory (VM) usage, Resident Set Size (RSS), Disk I/O etc.) for running jobs. The metrics of interest can be specified using option --format or -o (see next example).

Display selected status information of the running job with Slurm JobID 14242:

$ sstat -o jobid%20,avecpu,avepages,maxrss,maxvmsize -j 12345

# For a list of all available metrics use the option --helpformat or look into sstat man page:
$ sstat --helpformat
$ man sstat

Hold pending job with Slurm JobId 12345:
```
$ scontrol hold 12345
```
Release held job with Slurm JobId 12345:
```
$ scontrol release 12345
```

Accounting Commands#

Display all associations of your user account:
```
$ sacctmgr -s show user $USER
```
A user association refers to the relationship between a user and the resources they are authorized to use. It defines access to various project accounts, usage limits, available QOS (Quality Of Service) and other job-related policies. The above command can be used to query project accounts you are allowed to use to submit batch jobs on Levante.
Check job history for user b123456:
```
$ sacct -X -u b123456
```
Check job history (jobid, number of nodes, list of nodes, job state and exit code) for user b123456 in specified time period (e.g. March 2022):
```
$ sacct -X -u b123456 -o "jobid,nnodes,nodelist,state,exit" \
-S 2022-03-01 -E 2022-03-31T23:59:59
```

Check memory usage for the completed job with Slurm JobID 12345:

$ sacct --duplicates -j 12345 \
--format=JobID,JobName,MaxRSS,MaxRSSNode,MaxRSSTask,MaxVMSize,MaxVMSizeNode,MaxVMSizeTask

Interactive Usage#

Ressources for interactive sessions can be allocated using the Slurm salloc command. The following command, for example, will allocate 2 nodes from the compute partition for 30 minutes:

$ salloc --x11 --partition=compute --nodes=2 --time=00:30:00 --account xz0123

Once an allocation has been made, the salloc command will start a shell on the first allocated node. After a successful allocation the users can either run interactive processing on the compute node or execute srun from that shell to spawn their applications. For example:

$ srun --ntasks=4 --ntasks_per_node=2 --cpus_per_task=4 ./my_program

The interactive session is terminated by exiting the shell.

Requesting specific features for nodes#

When asking Slurm to provide a dedicated set of nodes to the user, one can use the --constraint option. This option is especially useful if your job needs nodes with larger ammounts of memory than 256G. In general we provide nodes with the memory features: 256G, 512G, and 1024G. Different constraints can be combined using the & (AND) or | (OR) operator, e.g.:

#SBATCH --constraint="512G|1024G"

To actually get this memory, the job also needs the following directive:

#SBATCH --mem=0

To only schedule jobs in the gpu partition to nodes with 80GB NVIDIA A100 GPUs, use

#SBATCH --constraint=a100_80

Please refer to the detailed hardware list to identify which node feautures are defined. Also notice that the more you constrain the eligible nodes, the longer the job will take until it is scheduled.