Using srun¶
Overview¶
With SLURM srun
command users can spawn any kind of application,
process or task inside a job allocation or directly start executing
a parallel job (and indirectly ask SLURM to create the appropriate
allocation). It can be a shell command, any single-/multi-threaded
executable in binary or script format, MPI application or hybrid
application with MPI and OpenMP. When no allocation options are
defined with srun
command the options from sbatch
or
salloc
are inherited. Srun should preferably be used either
inside a job script submitted by sbatch
or after calling salloc
The allocation options of the srun commands for the job steps are (almost) the same as for sbatch and salloc described in the SLURM introduction. The example command below spawns 48 tasks on 2 nodes (24 tasks per node) for 30 minutes:
$ srun -N 2 -n 48 -t 30 -A xz0123 ./my_small_test_job
You need to specify the project account (-A
) to be charged for this
job in the same manner as for salloc
and sbatch
commands.
Process and Thread Binding¶
MPI jobs¶
Mapping of processes (i.e. placement of tasks on nodes and sockets)
can be specified via srun option - -distribution
(or -m
).
The syntax is as follows:
srun --distribution=<block|cyclic|arbitrary|plane=<options>[:block|cyclic]>
The first argument (before the “:”) controls the distribution of ranks across nodes (e.g. block or cycle on successive nodes). The second (optional) distribution specification (after the “:”) controls the distribution of ranks across sockets within a node (i.e. block or cycle on successive sockets).
Process/task binding to cores and cpus can be done via srun options
--cpu_bind
. The syntax is:
srun --cpu bind=[{quiet,verbose},]type
To bind tasks to physical cores replace type by cores
To bind tasks to logical cpus / Hyper-Threads replace type by threads
For custom bindings use
map_cpu:<list>
, where <list> is a comma separated list of cpu IDs (0,1,2…,23)
For details please take a look at the man page of the srun command or contact DKRZ user’s consultancy. In most cases use
srun --cpu_bind=verbose,cores --distribution=block:cyclic ./myapp
if you do not want to use Hyper-Threads and
srun --cpu_bind=verbose,threads --distribution=block:cyclic ./myapp
if you intend to use Hyper-Threads. You might also benefit from different task distributions than block:cyclic.
OpenMP jobs¶
Thread binding is accomplished via Intel runtime library using the KMP_AFFINITY environment variable. The syntax is
export KMP_AFFINITY=[<modifier>,...]<type>[,<permute>][,<offset>]
with
modifier – verbose: giving detailed output on how binding was done – granularity=core: reserve full physical cores (i.e. two logical CPUs) to run threads on – granularity=thread/fine: reserve logical CPUs / Hyper-Threads to run threads
type – compact: places the threads as close to each other as possible – scatter: distributes the threads as evenly as possible across the entire allocation
permute controls which levels are most significant when sorting the machine topology map, i.e.. 0=CPUs (default), 1=cores, 2=sockets/LLC
offset indicates the starting position for thread assignment
For details please take a look at the Intel manuals or contact DKRZ user’s consultancy. In most cases use
export KMP_AFFINITY=granularity=core,compact,1
if you do not want to use Hyper-Threads and
export KMP_AFFINITY=granularity=thread,compact,1
if you intend to use Hyper-Threads. You might also try scatter instead of compact placement to take benefit from bigger L3 cache.
Hybrid MPI/OpenMP jobs¶
In this case you need to combine the two binding methods mentioned above.
Keep in mind that we are using --threads-per-core=2
throughout the cluster.
Hence you need to specify the amount of logical cpus per process/task on the
basis of Hyper-Threads even if you do not intend to use HyperThreads!
Examples on how to achieve correct binding using a full node can be found in
the HLRE-3 MISTRAL User's Manual
.
Multiple Program Multiple Data¶
SLURM supports the MPMD (Multiple Program Multiple Data) execution model that
can be used for MPI applications, where multiple executables can have one
common MPI_COMM_WORLD communicator. In order to use MPMD the user has to set
the srun option --multi-prog <filename>
. This option expects a configuration
text file as an argument, in contrast to the SPMD (Single Program Multiple Data)
case where srun has to be given the executable.
Each line of the configuration file can have two or three possible fields separated by space and the format is
<list of task ranks> <executable> [<input arguments>]
In the first field a comma separated list of ranks for the MPI tasks that will be spawned is defined. Possible values are integer numbers or ranges of numbers. The second field is the path/name of the executable. And the third field is optional and specifies the arguments of the program.
The following example provides are job script frame for execution of a coupled atmosphere-ocean model using 8 nodes:
#!/bin/bash
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=24
#SBATCH --time=00:30:00
#SBATCH --exclusive
#SBATCH --account=xz0123
# Domain decomposition for atmosphere model
ECHAM_NPROCA=6
ECHAM_NPROCB=16
# Domain decomposition for ocean model
MPIOM_NPROCX=12
MPIOM_NPROCY=8
# Paths to executables
ECHAM_EXECUTABLE=../bin/echam6
MPIOM_EXECUTABLE=../bin/mpiom.x
# Derived values useful for creation of MPMD configuration file
(( ECHAM_NCPU = ECHAM_NPROCA * ECHAM_NPROCB ))
(( MPIOM_NCPU = MPIOM_NPROCX * MPIOM_NPROCY ))
(( NCPU = ECHAM_NCPU + MPIOM_NCPU ))
(( MPIOM_LAST_CPU = MPIOM_NCPU - 1 ))
(( ECHAM_LAST_CPU = NCPU - 1 ))
# Create MPMD configuration file
cat > mpmd.conf <<EOF
0-${MPIOM_LAST_CPU} $MPIOM_EXECUTABLE
${MPIOM_NCPU}-${ECHAM_LAST_CPU} $ECHAM_EXECUTABLE
EOF
# Run MPMD parallel program using Intel MPI
module load intelmpi
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
srun -l --cpu bind=verbose,cores --multi-prog mpmd.conf
Frequency Scaling¶
The Intel Haswell and Broadwel processors allow for CPU frequency
scaling which in general enables the operating system to scale the CPU
frequency up or down in order to save power. CPU frequencies can be
scaled automatically depending on the system load or manually by
userspace programs. This is done via power schemes for the CPU - so called
governors. Only one may be active at a time. The default governor is ”ondemand”
which allows the operating system to scale down the CPU frequency on the
compute nodes to 1.2GHz if they are in idle state. The user can set the
governor to ”userspace” and specify a fixed CPU frequencies instead.
Therefore the batch job needs to define the desired behaviour via the
environmental variable SLURM_CPU_FREQ_REQ or via the srun option --cpu-freq
.
To set the CPU frequency to the nominal value (2.5 GHz for Haswell in the compute partition and 2.1 GHz for Broadwell in the compute2 partition) use:
export SLURM_CPU_FREQ_REQ=HighM1
You might also request a different frequency (that must be specified in kHz, e.g. 2100000) or enable automatic frequency scaling depending on the workload by setting
export SLURM_CPU_FREQ_REQ=ondemand
On the DKRZ HPC system mistral we are using SLURM plugins to configure all cores to run at the fixed nominal frequency depending on the chosen CPU (Haswell or Broadwell) if you are using srun to execute a parallel job. Therefore, you normally do not need to explicitly set the frequency.