Example Batch Scripts#

Two-way simultaneous multithreading (SMT) is enabled on all Levante nodes i.e. the operating system recognizes 256 logical CPUs per node, while there are only 128 physical cores. In most cases, it is advisable to not employ the simultaneous threads for the application, but to leave them for the operating system.

Below examples of batch scripts for different use cases are provided.

MPI job without simultaneous multithreading#

The overall setting of the batch script does not vary whether one is using IntelMPI or OpenMPI (or any other MPI implementation). Specific environment variables should be set in order to fine-tune the used MPI. Especially, the parallel application should always be started using the srun command instead of invoking mpirun, mpiexec or others.

In the following examples 12*128 cores are used to execute a parallel program.

#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --partition=compute
#SBATCH --nodes=12
#SBATCH --ntasks-per-node=128
#SBATCH --exclusive
#SBATCH --time=00:30:00
#SBATCH --mail-type=FAIL
#SBATCH --account=xz0123
#SBATCH --output=my_job.%j.out

# limit stacksize ... adjust to your programs need
# and core file size
ulimit -s 204800
ulimit -c 0

# Replace this block according to https://docs.dkrz.de/doc/levante/running-jobs/runtime-settings.html#mpi-runtime-settings
echo "Replace this block according to  https://docs.dkrz.de/doc/levante/running-jobs/runtime-settings.html#mpi-runtime-settings"
exit 23
# End of block to replace

# Use srun (not mpirun or mpiexec) command to launch
# programs compiled with any MPI library
srun -l --cpu_bind=verbose --hint=nomultithread \
  --distribution=block:cyclic ./myprog

Note: --hint=nomultithread cannot be used in conjunction with --ntasks-per-core, -threads-per-core and --cpu-bind. (--cpu-bind=verbose is allowed though.)

Please also read the section compiling and linking MPI programs on Levante.

Hybrid (MPI/OpenMP) job without simultaneous multithreading#

The following job example will allocate 4 compute nodes from the compute partition for 1 hour. The job will launch 32 MPI ranks per nodes, and 4 OpenMP threads per rank. On each node all 128 physical cores will be used.

#!/bin/bash
#SBATCH --job-name=my_job      # Specify job name
#SBATCH --partition=compute    # Specify partition name
#SBATCH --nodes=4              # Specify number of nodes
#SBATCH --ntasks-per-node=32   # Specify number of (MPI) tasks on each node
#SBATCH --time=01:00:00        # Set a limit on the total run time
#SBATCH --mail-type=FAIL       # Notify user by email in case of job failure
#SBATCH --account=xz0123       # Charge resources on this project account
#SBATCH --output=my_job.o%j    # File name for standard output

# Bind your OpenMP threads
export OMP_NUM_THREADS=4
export KMP_AFFINITY="verbose,granularity=fine,scatter"
export KMP_LIBRARY="turnaround"


# limit stacksize ... adjust to your programs need
# and core file size
ulimit -s 204800
ulimit -c 0
export OMP_STACKSIZE=128M

# Replace this block according to https://docs.dkrz.de/doc/levante/running-jobs/runtime-settings.html#mpi-runtime-settings
echo "Replace this block according to https://docs.dkrz.de/doc/levante/running-jobs/runtime-settings.html#mpi-runtime-settings"
exit 23
# End of block to replace


# Use srun (not mpirun or mpiexec) command to launch
# programs compiled with any MPI library
srun -l --cpu_bind=verbose --cpus-per-task=4 --hint=nomultithread \
  --distribution=block:cyclic:block ./myprog

Serial job#

The following script allocates resources in the shared partition to execute a non-MPI program.

#!/bin/bash
#SBATCH --job-name=my_job      # Specify job name
#SBATCH --partition=shared     # Specify partition name
#SBATCH --mem=10G              # Specify amount of memory needed
#SBATCH --time=00:30:00        # Set a limit on the total run time
#SBATCH --mail-type=FAIL       # Notify user by email in case of job failure
#SBATCH --account=xz0123       # Charge resources on this project account
#SBATCH --output=my_job.o%j    # File name for standard output

set -e
ulimit -s 204800

module load python3

# Execute serial programs, e.g.
python -u /path/to/myscript.py

In case your serial job needs more memory (i.e. its execution is terminated with an OUT_OF_MEMORY status) you need to increase the amount of memory (--mem) further. Slurm will automatically adjust the number of CPUs allocated for the job to satisfy its memory request. In the above example, the --mem=10G will cause Slurm to actually allocate 12 logical CPUs since the shared partition has a limit of 940 MB of memory per CPU.

MPI job using half a GPU node#

This script allocates one CPU and one GPU for each task, the remaining resources of the node might be shared with other jobs. If the job has higher demand for memory use --mem= or --mem-per-cpu= to account for that.

#!/bin/bash
#SBATCH --job-name=my_gpu_job      # Specify job name
#SBATCH --partition=gpu            # Specify partition name
#SBATCH --nodes=1                  # Specify number of nodes
#SBATCH --ntasks-per-node=2        # Specify number of (MPI) tasks on each node
#SBATCH --gpus-per-task=1          # Specify number of GPUs per task
#SBATCH --time=00:30:00            # Set a limit on the total run time
#SBATCH --mail-type=FAIL           # Notify user by email in case of job failure
#SBATCH --account=xz0123           # Charge resources on this project account
#SBATCH --output=my_job.o%j        # File name for standard output

set -e
ulimit -s 204800

# Check GPUs available for the job
# nvidia-smi

# Check GPUs visible for each task
# srun -l nvidia-smi

srun -l  bash -c 'echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"'

To get information on the GPUs allocated to the job, run nvidia-smi. The GPUs are numbered, starting at 0. When the command is executed for each task, using srun, each call only shows the GPUs dedicated to that respective task. In this example, each thread shows only one GPU (all with number 0).

If not manually overwritten, Slurm sets the environment variable CUDA_VISIBLE_DEVICES. This variable contains the indices of the visible GPUs.

MPI job using multiple full GPU nodes#

This script runs a job with 8 tasks on two nodes, using all four GPUs available for each node. Since all GPU resources are used, we added the Slurm option --exclusive to get access to all resources of each node. To be able to use all of each node’s memory, the option --mem=0 has to be added.

#!/bin/bash
#SBATCH --job-name=my_gpu_job      # Specify job name
#SBATCH --partition=gpu            # Specify partition name
#SBATCH --nodes=2                  # Specify number of nodes
#SBATCH --ntasks-per-node=4        # Specify number of (MPI) tasks on each node
#SBATCH --gpus-per-task=1          # Specify number of GPUs per task
#SBATCH --exclusive                # https://slurm.schedmd.com/sbatch.html#OPT_exclusive
#SBATCH --mem=0                    # Request all memory available on all nodes
#SBATCH --time=00:30:00            # Set a limit on the total run time
#SBATCH --mail-type=FAIL           # Notify user by email in case of job failure
#SBATCH --account=xz0123           # Charge resources on this project account
#SBATCH --output=my_job.o%j        # File name for standard output

set -e
ulimit -s 204800

# Check GPUs available for the job
# nvidia-smi

# Check GPUs visible for each task
# srun -l nvidia-smi

srun -l  bash -c 'echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"'

Appart from allocating more resources, this example is similar to the previous script.

For specific binding and runtime settings for GPU nodes please refer to GPU specific runtime settings.

Machine learning example using GPUs#

For individual applications it is sufficient to specify the total number of GPUs needed for the job by setting the option --gpus=<#GPUs>. This is, for example, the case for many machine learning applications, where the underlying algorithms recognize and use the GPU resources. Here, all four GPUs of one node are requested and a Python script, that makes use of the GPUs, is executed.

#!/bin/bash
#SBATCH --job-name=my_gpu_job      # Specify job name
#SBATCH --partition=gpu            # Specify partition name
#SBATCH --nodes=1                  # Specify number of nodes
#SBATCH --gpus=4                   # Specify number of GPUs needed for the job
#SBATCH --exclusive                # https://slurm.schedmd.com/sbatch.html#OPT_exclusive
#SBATCH --mem=0                    # Request all memory available on all nodes
#SBATCH --time=00:30:00            # Set a limit on the total run time
#SBATCH --mail-type=FAIL           # Notify user by email in case of job failure
#SBATCH --account=xz0123           # Charge resources on this project account
#SBATCH --output=my_job.o%j        # File name for standard output

module load pytorch

# Run GPU-aware script, e.g.
python -u /path/to/myscript.sh

For more information on machine learning on Levante, please refer to Machine Learning on Levante.