SLURM Introduction#

This page serves as an overview of user commands provided by SLURM and how users should use SLURM in order to run jobs on Levante. A concise cheat sheet for SLURM can be downloaded here. A comparison of commands for different job scheduling systems can be found here.

SLURM Commands#

SLURM offers a variety of user commands for all necessary actions concerning the job management. With these commands the users have a rich interface to allocate resources, query job status, control jobs, manage accounting information and to simplify their work with some utility commands. For examples how to use these command, see section SLURM Command Examples.

  • sinfo show information about all partitions and nodes managed by SLURM as well as about general system state. It has a wide variety of filtering, sorting, and formatting options.

  • squeue query the list of pending and running jobs. By default it reports the list of pending jobs sorted by priority and the list of running jobs sorted separately according to the job priority. The most relevant job states are running (R), pending (PD), completing (CG), completed (CD) and cancelled (CA). The TIME field shows the actual job execution time. The NODELIST (REASON) field indicates on which nodes the job is running or the reason why the job is pending. Typical reasons for pending jobs are waiting for resources to become available (Resources) and queuing behind a job with higher priority (Priority).

  • sbatch submit a batch script. The script will be executed on the first node of the allocation. The working directory coincides with the working directory of the sbatch directory. Within the script one or multiple srun commands can be used to create job steps and execute parallel applications.

  • scancel cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.

  • salloc request interactive jobs/allocations. When the job starts, a shell (or other program specified on the command line) is started on the allocated node. The allocation is released when the user exits the shell or cancels the job.

  • srun initiate parallel job steps within a job or start an interactive job.

  • scontrol (primarily used by the administrators) provides some functionality for the users to manage jobs or get some information about the system configuration such as nodes, partitions, jobs, and configurations.

  • sstat query near-realtime status information related to CPU, task, node, RSS and virtual memory for a running job.

  • sacct retrieve accounting information about jobs and job steps. For completed jobs sacct queries the accounting database.

  • sacctmgr (primarily used by the administrators) query information about accounts and other accounting information.

Allocating Resources with SLURM#

A job allocation, which is a set of computing resources (nodes or cores) assigned to a user’s request for a specified amount of time, can be created using the SLURM salloc, sbatch or srun commands. The salloc and sbatch commands make resource allocations only. The srun command launches parallel tasks and implicitly create a resource allocation if not started within one.

The usual way to allocate resources and execute a job on Levante is to write a batch script and submit it to SLURM with the sbatch command. The batch script is a shell script consisting of two parts: resources requests and job steps. Resources requests are specifications for number of nodes needed to execute the job, time duration of the job etc. Job steps are user’s tasks that must be executed. The resources request and other SLURM submission options are prefixed by #SBATCH directives and must precede any executable commands in the batch script. For example:

#!/bin/bash
#SBATCH --partition=compute
#SBATCH --account=xz0123
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128
#SBATCH --time=00:30:00

# Begin of section with executable commands
set -e
ls -l
srun ./my_program

The script itself is regarded by SLURM as the first job step and is (serially) executed on the first compute node in the job allocation. To execute parallel (MPI) tasks users call SLURM srun command within the script. Thereby, a new job step is initiated. It is possible to execute parallel programs in the form of job steps in any configuration within the job allocation. This means, a job step can use all allocated resources or several job steps (created via multiple srun calls) can use a subset of allocated resources.

The following table describes the most common or required allocation and submission options that can be defined in a batch script (long and short options are listed):

sbatch option

default value

descripion

--nodes=<number>
-N <number>

1

Number of nodes for the allocation

--ntasks=<number>
-n <number>

1

Number of tasks (MPI processes). Can be omitted if --nodes and --ntasks-per-node are given

--ntasks-per-node=<num>

1

Number of tasks per node

--cpus-per-task=<number>
-c <number>

1

Number of threads (logical cpus) per task. Used for OpenMP or hybrid job

--mem=<size>[units]

partition dependent

Real memory required per node. Set to 0 for entire memory.

--output=<path>/<file pattern>
-o <path>/<file pattern>

slurm-%j.out, where %j will be replaced by JobID

Standard output file

--error=<path>/<file pattern>
-e <path>/<file pattern>

slurm-%j.out, where %j will be replaced by JobID

Standard error file

--time=<walltime>
--t <walltime>

partition dependent

Requested walltime limit for the job; possible time formats are:
- [hours:]minutes[:seconds]
e.g. 20, 01:20, 01:20:30
- days-hours[:minutes][:seconds]
e.g. 2-0, 1-5:20, 1-5:20:30
--partition=<name>
-p <name>

none

Partition to run the job

--constraint=<list>
-C <list>

none

Node-features requested for the job. See configuration for available features.

--mail-user=<email>

email associated with the DKRZ user account

Email address for notifications

--mail-type=<mode>

Event types for email notifications. Possible values are BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_[90,80,50]

--job-name=<jobname>
-J <jobname>

job script’s name

Job name

--account=<project account>
-A <project account>

none

Project account that should be charged

--exclude=<nodelist>
-x <nodelist>

Exclude specified nodes from job allocation

--nodelist=<nodelist>
-w <nodelist>

Request specified nodes for job allocation (if necessary additional nodes will be added to fulfill the requirement for number of nodes)

--requeue or
--no-requeue

no-requeue

Specifies whether the batch job should be requeued after a node failure. Caution: if a job is requeued, the whole batch script is executed from its beginning

The complete list of parameters can be inquired from the sbatch manual pages:

man sbatch

Batch scripts are submitted using the SLURM sbatch command:

sbatch [OPTIONS] <job script>

Options passed to sbatch command override the #SBATCH settings. On success, sbatch writes the job ID to standard output.

Note

On the Levante specification of -A resp. --account is necessary to submit a job, otherwise submission will be rejected. You can query the accounts for which job submission is allowed using the SLURM sacctmgr command:

sacctmgr -s show user name=$USER

Furthermore, you will have to specify the partition on which the job will run by using either the -p resp. --partition option to sbatch. Otherwise the submission will be rejected.

Note

Setting SLURM options via environment variables will override any matching options set in a batch script, and command line options will override any matching environment variable.

Remember the difference between options for selection, allocation and distribution in SLURM. Selection and allocation work with sbatch, but task distribution and binding should directly be specified with srun (within batch script). The following four steps give an overview.

  1. Resource selection, e.g.

#SBATCH --nodes=2
#SBATCH --sockets-per-node=2
#SBATCH --cores-per-socket=64
#SBATCH --mem=500G

which could just be satisfied by some nodes in the levante cluster having 512GB memory.

  1. Resource allocation, e.g.

#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=2

which defines how many MPI-taks will be used on which resources.

  1. Start application relying on the sbatch options only. Task binding and distribution with srun, e.g.

srun --cpu_bind=cores --distribution=block:cyclic ./my_program
  1. Start application using only a subset of the allocated resources. In this case you need to give again all relevant allocation options to srun (like --ntasks or --ntasks-per-node), e.g.

srun --ntasks=2 --ntasks-per-node=1 --cpu_bind=cores --distribution=block:cyclic ./my_program

All environment variables set at the time of submission are propagated to the SLURM jobs. With some options of the allocation commands (like --export for sbatch or srun), users can change this default behaviour. The users can load modules and prepare the desired environment before job submission, and then this environment will be passed to the jobs that will be submitted. Of course, a good practice is to include module commands in job scripts, in order to have full control of the environment of the jobs.

SLURM Command Examples#

Query Commands#

Normally, the jobs will pass through several states during their life cycle. Typical job states from submission until completion are: PENDING (PD), RUNNING (R), COMPLETING (CG) and COMPLETED (CD). Further job state codes are described in squeue man page. Below some examples of SLURM query commands are provided.

  • List all jobs submitted to SLURM:

    squeue
    
  • List all jobs submitted by you:

    squeue -u $USER
    
  • Check available partitions and nodes:

    sinfo
    

    The sinfo command reports the states of the partitions and the nodes. The partitions may be in state UP, DOWN or INACTIVE. The UP state means that a partition will accept new submissions and the jobs will be scheduled. The DOWN state allows submissions to a partition but the jobs will not be scheduled. The INACTIVE state means that submissions are not allowed. The nodes also can be in various states, such as alloc (allocated), comp (completing), down, idle, maint, resv (reserved) etc. Description of all node states can be get from the sinfo man page.

  • List partition state summary

    sinfo -s
    

    The column NODES(A/I/O/T) shows number of nodes in the states “allocated/idle/other/total” for each SLURM partition.

  • Query configuration and limits for one specific partition (here compute):

    scontrol show partition compute
    
  • Check one node (here l10042):

    scontrol show node l10042
    

Job control#

  • Cancel job with SLURM JobId 10815:

    scancel 10815
    
  • Cancel all your jobs:

    scancel -u $USER
    

    With the additional option -i (interactive mode) SLURM asks for confirmation before canceling the job.

  • Display status information of a running job 14242:

    sstat -j 14242
    

    sstat provides various status information (e.g. CPU time, Virtual Memory (VM) usage, Resident Set Size (RSS), Disk I/O etc.) for running jobs. The metrics of interest can be specified using option --format or -o (s. next example).

  • Display selected status information of the running job 14242:

    sstat -o jobid,avecpu,avepages,maxrss,maxvmsize -j 14242
    
    # For a list of all available metrics use the option --helpformat or look into sstat man page
    sstat --helpformat
    man sstat
    
  • Hold pending job with SLURM JobId 10815:

    scontrol hold 10815
    
  • Release job with SLURM JobId 10815:

    scontrol release 10815
    

Accounting Commands#

  • Check user association (here for user b123456):

    sacctmgr show assoc where user=b123456
    or
    sacctmgr -s show user name=b123456
    
  • Check job history for user b123456:

    sacct -X -u b123456
    
  • Check job history (jobid, number of nodes, list of nodes, job state and exit code) for user b123456 in specified time period (March 2022):

    sacct -X -u b123456 -o "jobid,nnodes,nodelist,state,exit" -S 2022-03-01 -E 2022-03-31T23:59:59
    
  • Check memory usage for the completed job with the jobid 12345:

    sacct --duplicates -j 12345 --format=JobID,JobName,MaxRSS,MaxRSSNode,MaxRSSTask,MaxVMSize,MaxVMSizeNode,MaxVMSizeTask
    

Interactive Usage#

Interactive sessions can be allocated using the SLURM salloc command. The following command for example will allocate 2 nodes (in the compute partition) for 30 minutes:

salloc --partition=compute --nodes=2 --time=00:30:00 --account xz0123

Once an allocation has been made, the salloc command will start a shell on the allocated node. After a successful allocation the users can either run interactive processing on the compute node or execute srun from that shell to spawn their applications. For example:

srun --ntasks=4 --ntasks_per_node=2 --cpus_per_task=4 ./my_program

The interactive session is terminated by exiting the shell.

Requesting specific features for nodes#

When asking SLURM to provide a dedicated set of nodes to the user, one can use the --constraint option. This option is especially useful if your job needs nodes with larger ammounts of memory than 256G. In general we provide nodes with the memory features: 256G, 512G, and 1024G. Different constraints can be combined using the & (AND) or | (OR) operator, e.g.:

#SBATCH --constraint="512G|1024G"

To actually get this memory, the job also needs the following directive:

#SBATCH --mem=0

To only schedule jobs in the gpu partition to nodes with 80GB NVIDIA A100 GPUs, use

#SBATCH --constraint=a100_80

Please refer to the detailed hardware list to identify which node feautures are defined.