Slurm Introduction#
This page serves as an overview of user commands provided by Slurm and how users should use Slurm in order to run jobs on Levante. A concise cheat sheet for Slurm can be downloaded here.
Slurm Commands#
Slurm offers a rich suite of user commands to allocate resources, launch parallel applications, query job status, control jobs, inspect accounting information etc. The most commonly used commands are:
sinfo: show information about all partitions and nodes managed by Slurm as well as about general system state. It has a wide variety of filtering, sorting, and formatting options.
squeue: query the list of pending and running jobs.
sbatch: submit a batch script.
salloc: request an (interactive) job allocation. When the job starts, a shell (or other program specified on the command line) is started on the allocated node. The allocation is released when the user exits the shell or cancels the job.
srun: initiate parallel job steps within a job or start an interactive job.
scancel: cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
scontrol: (primarily used by the administrators) useful to manage job settings or get some information about the system entities such as nodes, partitions, jobs, and configurations.
sstat: query near-realtime status information about e.g. CPU and memory usage for a running job.
sacct: retrieve accounting information about jobs and job steps. For completed jobs
sacct
queries the accounting database.sacctmgr: (primarily used by the administrators) query account information.
Examples for usage of the above commands along with more specific explanations can be found in section Slurm Command Examples below.
Allocating Resources with Slurm#
A job allocation, which is a set of computing resources (nodes, cores,
and/or GPUs) assigned to a user’s request for a specified amount of
time, can be created using the Slurm salloc
, sbatch
or
srun
commands. The salloc
and sbatch
commands make
resource allocations only. The srun
command launches parallel
tasks, hence it replaces mpirun
and mpiexec
in Slurm
installations, and implicitly creates a resource allocation if not
started within one. Also, srun
distributes tasks across allocated
resources and binds individual tasks to cores and GPUs.
The most common way to allocate resources and execute a job on Levante
is to write a batch script and submit it to Slurm with the sbatch
command. The batch script is a shell script consisting of two parts:
resource requests and job steps. Resource requests are specifications
for number of nodes needed to execute the job, time duration of the
job etc. Job steps are user-defined tasks to be executed using those
resources. The resource request and other Slurm submission options are
prefixed by #SBATCH
directives and must precede any executable
commands in the batch script. For example:
#!/bin/bash
#SBATCH --partition=compute
#SBATCH --account=xz0123
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128
#SBATCH --time=00:30:00
#SBATCH --output=log.%j.txt
# Begin of section with executable commands
set -e
ls -l
srun ./my_program
The shell script itself is regarded by Slurm as job step 0 and is
(serially) executed on the first compute node in the job allocation.
To execute parallel (MPI) tasks, users call the Slurm srun
command
within the script. Thereby, a new job step is initiated. It is
possible to execute parallel programs in the form of job steps in any
configuration within the job allocation. This means, a single job step
can use all allocated resources or several job steps (created via
multiple srun
calls) can use a subset of allocated resources and
run simultaneously.
By default, the current working directory of the Slurm job is the same
as the directory from where sbatch
was invoked. That is not
necessarily the same directory in which the job script resides.
Task binding to CPUs/cores and an appropriate task distribution are
essential to achieving good performance on clusters like Levante. For
this purpose, the srun
options --cpu-bind
and
--distribution
should be used. The --cpu-bind
option controls
how tasks are bound (pinned) to specific CPUs, whole cores, entire
sockets, or not at all. The option --distribution
controls the
distribution of allocated CPUs across nodes:sockets:cores for
binding to tasks. It is essential for minimizing communication
overhead when running parallel jobs. Usually, the two-part form
nodes:sockets suffices. The command below, for example
srun --cpu_bind=cores --distribution=block:cyclic ./my_program
will bind tasks of ./my_program
to dedicated cores thus preventing
the operating system from moving them to other cores during program
execution. The first distribution method (block) ensures that the
consecutive tasks are placed on the same node until the maximum
number of tasks per node is reached. The second distribution method
(cyclic) means that whithin a node, tasks are distributed in a
round-robin (cyclic) manner between sockets.
When the option --cpu-bind=verbose
is set, srun
reports
the selected distribution and binding of tasks to CPUs. The CPU mask,
which is written to standard error, can then be
visualized on the Slurm Binding Website provided for Levante by DKRZ to check
if the actual binding matches expectations.
The following table lists the most common job allocation and submission options (in long and short form) that can be used in a batch script or as command line options:
job allocation option |
default value |
description |
---|---|---|
--account=<project account> -A <project account> |
none |
Project account that should be charged. This option is mandatory on Levante. |
--partition=<name> -p <name> |
none |
Partition to run the job. This option is mandatory on Levante. |
--job-name=<jobname> -J <jobname> |
job script’s name |
Job name |
--nodes=<number> -N <number> |
1 |
Number of nodes for the allocation |
--ntasks=<number> -n <number> |
1 [1] |
Number of tasks (MPI processes). Can be omitted if |
--ntasks-per-node=<num> |
1 [1] |
Number of tasks per node |
--cpus-per-task=<number> -c <number> |
1 [1] |
Number of threads (logical cpus) per task. Used for OpenMP or hybrid job |
--mem=<size>[units] |
partition dependent |
Real memory required per node. Set to |
--output=<path>/<file pattern> -o <path>/<file pattern> |
slurm- |
Standard output file |
--error=<path>/<file pattern> -e <path>/<file pattern> |
slurm- |
Standard error file |
--time=<walltime> --t <walltime> |
partition dependent |
Requested walltime limit for the job; possible time formats are:
- [hours:]minutes[:seconds]
e.g. 20, 01:20, 01:20:30
- days-hours[:minutes][:seconds]
e.g. 2-0, 1-5:20, 1-5:20:30
|
--constraint=<list> -C <list> |
none |
Node-features requested for the job. See configuration for available features. |
|
email associated with the DKRZ user account |
Email address for notifications |
|
– |
Event types for email notifications. Possible values are BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_[90,80,50] |
--exclude=<nodelist> -x <nodelist> |
– |
Exclude specified nodes from job allocation |
--nodelist=<nodelist> -w <nodelist> |
– |
Request specified nodes for job allocation (if necessary additional nodes will be added to fulfill the requirement for number of nodes) |
--requeue or--no-requeue |
no-requeue |
Specifies whether the batch job should be requeued after a node failure. Caution: if a job is requeued, the whole batch script is executed from its beginning |
Note
On Levante, specification of -A
resp. --account
is necessary to submit a job, otherwise a submission will be
rejected. You can query the accounts for which job submission is
allowed using the Slurm sacctmgr
command:
$ sacctmgr -s show user name=$USER
Furthermore, you will have to specify the partition on which the
job will run by using either the -p
or --partition
option.
Otherwise the submission will be rejected.
The complete list of parameters can be inquired from the sbatch
manual pages:
$ man sbatch
Batch scripts are submitted using the Slurm sbatch
command:
$ sbatch [OPTIONS] <job script>
Command line options passed to the sbatch
command override the
#SBATCH
settings in the job script. On success, sbatch
writes
the JobID to standard output.
Note
Setting Slurm options via environment variables will override any matching options set in a batch script, and command line options will override any matching environment variable.
All environment variables set at the time of submission are propagated
to the Slurm jobs. With some options of the allocation commands (like
--export
for sbatch
or srun
), users can change this
default behaviour. The users can load modules and prepare the desired
environment before job submission, and then this environment will be
passed to the jobs that will be submitted. Of course, a good practice
is to include e.g. necessary module
commands in job scripts, in
order to explicitly configure the environment of the jobs and provide
documentation for yourself and colleagues. For maximum
reproducibility, we recommend to start the script with module
purge
and follow up with loading all environment modules required.
Slurm Command Examples#
Query Commands#
List all jobs submitted to Slurm:
$ squeue
This command shows an overview of submitted jobs. Apart from JobID, partition, job name, user name and number of nodes, the command shows the job state, running time, and the nodelist for running jobs, or the reason why the job is pending. Typical job states from submission until completion are: PENDING (PD), RUNNING (R), COMPLETING (CG), COMPLETED (CD), and cancelled (CA). Some usual reasons for pending jobs are:
Job is waiting for resources to become available (Resources)
Higher priority jobs are occupying the queue (Priority)
The maximum number of jobs allowed for a user within the association was exceeded (AssocMaxJobsLimit)
The specified job run-time exceeds the limit in place for the selected partition (PartitionTimeLimit)
List all jobs submitted by you:
$ squeue -u $USER $ squeue --me $ squeue -o "%.18i %.9P %.20j %.8u %.8a %.2t %.10M %.6D %R" --me
Check all available partitions and nodes:
$ sinfo
The
sinfo
command reports the states of the partitions and the nodes. The partitions may be in state UP, DOWN or INACTIVE. The UP state means that a partition will accept new submissions and the jobs will be scheduled. The DOWN state allows submissions to a partition but the jobs will not be scheduled. The INACTIVE state means that submissions are not allowed. The nodes also can be in various states, such as alloc (allocated), comp (completing), down, idle, maint (maintenance), resv (reserved) etc. Description of all node states can be get from thesinfo
man page.List state summary for all partitions:
$ sinfo -s
The column NODES(A/I/O/T) shows number of nodes in the states allocated/idle/other/total for each Slurm partition.
Query information about the nodes in the specified partition (e.g. interactive):
$ sinfo -p interactive
Query configuration and limits for one or several specific partition (e.g. compute):
$ scontrol show partition compute
Check properties of a specific node (e.g. l10042):
$ scontrol show node l10042
Job control Commands#
Cancel job with Slurm JobID 10815:
$ scancel 10815
Cancel all your jobs:
$ scancel -u $USER
With the additional option
-i
(interactive mode) Slurm asks for confirmation before canceling the job.Display status information of a running job with Slurm JobID 12345:
$ sstat -a -j 12345
sstat
provides various status information (e.g. CPU time, Virtual Memory (VM) usage, Resident Set Size (RSS), Disk I/O etc.) for running jobs. The metrics of interest can be specified using option--format
or-o
(see next example).Display selected status information of the running job with Slurm JobID 14242:
$ sstat -o jobid%20,avecpu,avepages,maxrss,maxvmsize -j 12345 # For a list of all available metrics use the option --helpformat or look into sstat man page: $ sstat --helpformat $ man sstat
Hold pending job with Slurm JobId 12345:
$ scontrol hold 12345
Release held job with Slurm JobId 12345:
$ scontrol release 12345
Accounting Commands#
Display all associations of your user account:
$ sacctmgr -s show user $USER
A user association refers to the relationship between a user and the resources they are authorized to use. It defines access to various project accounts, usage limits, available QOS (Quality Of Service) and other job-related policies. The above command can be used to query project accounts you are allowed to use to submit batch jobs on Levante.
Check job history for user b123456:
$ sacct -X -u b123456
Check job history (jobid, number of nodes, list of nodes, job state and exit code) for user b123456 in specified time period (e.g. March 2022):
$ sacct -X -u b123456 -o "jobid,nnodes,nodelist,state,exit" \ -S 2022-03-01 -E 2022-03-31T23:59:59
Check memory usage for the completed job with Slurm JobID 12345:
$ sacct --duplicates -j 12345 \ --format=JobID,JobName,MaxRSS,MaxRSSNode,MaxRSSTask,MaxVMSize,MaxVMSizeNode,MaxVMSizeTask
Interactive Usage#
Ressources for interactive sessions can be allocated using the Slurm
salloc
command. The following command, for example, will allocate
2 nodes from the compute
partition for 30 minutes:
$ salloc --x11 --partition=compute --nodes=2 --time=00:30:00 --account xz0123
Once an allocation has been made, the salloc
command will start a
shell on the first allocated node. After a successful allocation
the users can either run interactive processing on the compute node or
execute srun
from that shell to spawn their applications. For
example:
$ srun --ntasks=4 --ntasks_per_node=2 --cpus_per_task=4 ./my_program
The interactive session is terminated by exiting the shell.
Requesting specific features for nodes#
When asking Slurm to provide a dedicated set of nodes to the user, one
can use the --constraint
option. This option is especially
useful if your job needs nodes with larger ammounts of memory than
256G. In general we provide nodes with the memory features: 256G,
512G, and 1024G. Different constraints can be combined using the &
(AND) or | (OR) operator, e.g.:
#SBATCH --constraint="512G|1024G"
To actually get this memory, the job also needs the following directive:
#SBATCH --mem=0
To only schedule jobs in the gpu partition to nodes with 80GB NVIDIA A100 GPUs, use
#SBATCH --constraint=a100_80
Please refer to the detailed hardware list to identify which node feautures are defined. Also notice that the more you constrain the eligible nodes, the longer the job will take until it is scheduled.