Runtime Settings#
When running programs on Levante, various settings might be needed to achieve satisfying performance, or, in some cases, even to allow a program to run. This section describes some environment settings which are most often needed to be set in Slurm batch scripts.
MPI Runtime Settings#
Modern MPI library implementations provide a large number of user-configurable parameters and algorithms for performance tuning. Although the local configuration of MPI libraries is initially performed by vendor to match the characteristics of the cluster, the performance of a specific application can often be further improved by up to 15% by optimal choice of tunable parameters.
Since tuning options are specific to an MPI library and an application, the following recommendations for MPI runtime settings are only a starting point for each version.
OpenMPI#
As a minimal environmental setting we recommend the following to make use of the UCX toolkit.
export OMPI_MCA_osc="ucx"
export OMPI_MCA_pml="ucx"
export OMPI_MCA_btl="self"
export UCX_HANDLE_ERRORS="bt"
export OMPI_MCA_pml_ucx_opal_mem_hooks=1
The following settings can bring further improvements in the runtime of the programmes, but must depend on the respective configuration.
export OMPI_MCA_io="romio321" # basic optimisation of I/O
export UCX_TLS="shm,rc_mlx5,rc_x,self" # for jobs using LESS than 150 nodes
export UCX_TLS="shm,dc_mlx5,dc_x,self" # for jobs using MORE than 150 nodes
export UCX_UNIFIED_MODE="y" # JUST for homogeneous jobs on CPUs, do not use for GPU nodes
For some applications (e.g. high-resolution version of ICON) a special MPI_Alltoallv algorithm has to be used in case you notice a deadlock in Alltoallv.
export OMPI_MCA_coll_tuned_use_dynamic_rules="true"
export OMPI_MCA_coll_tuned_alltoallv_algorithm=2
The ompi_info
tool can be used to get detailed information about OpenMPI
installation and local configuration:
ompi_info --all
Intel MPI#
Environment variables for Intel MPI start with an I_MPI_
prefix. The complete reference of environment variables can be found at Intel’s site.
On Levante, to run programs built with Intel MPI, you should set at least
the following environment variables:
export I_MPI_PMI=pmi
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
For large jobs we recommend to use PMI-2 instead of PMI. The corresponding settings are:
export I_MPI_PMI=pmi2
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so
srun --mpi=pmi2 ...
OpenMP Runtime Settings#
For typical HPC applications, one thread per physical core is often reasonable. For OpenMP/MPI-hybrid jobs, OMP_NUM_THREADS up to max. 8 are usually good (8 cores share one LLC - last level cache).
export KMP_AFFINITY="granularity=fine,scatter"
export KMP_LIBRARY="turnaround"
GPU specific runtime settings#
The following list shows the default runtime setting used for ICON when running on Levante’s gpu
partition.
export OMPI_MCA_pml=ucx # Use UCX to support InfiniBand devices and CUDA [1]
export OMPI_MCA_btl="self" # Only use self transport to reduce overhead [2]
export UCX_RNDV_SCHEME=put_zcopy # Preferred communication scheme with Rendezvous protocol
export UCX_RNDV_THRESH=16384 # Threshold when to switch transport from TCP to NVLINK [3]
export UCX_IB_GPU_DIRECT_RDMA=yes # Allow remote direct memory access from/to GPU
export UCX_TLS=cma,rc,mm,cuda_ipc,cuda_copy,gdr_copy # Include cuda and gdr based transport layers for communication [4]
export UCX_MEMTYPE_CACHE=n # Prevent misdetection of GPU memory as host memory [5]
- References:
Levante’s GPU nodes contain two AMD CPUs, which are both connected to their respective network device (mlx5_0
and mlx5_1
).
Of the four GPUs, two are connected to one CPU, respectively, and hence have more direct access to a particular network device.
The following table illustrates this configuration.
Socket |
NUMA |
Core # |
SMT # |
GPU # |
Network device |
---|---|---|---|---|---|
0 |
0 |
0-15 |
128-143 |
|
|
1 |
16-31 |
144-159 |
1 |
||
2 |
32-47 |
160-175 |
|||
3 |
48-63 |
176-191 |
0 |
||
1 |
4 |
64-79 |
192-207 |
|
|
5 |
80-95 |
208-223 |
3 |
||
6 |
96-111 |
224-239 |
|||
7 |
112-127 |
240-255 |
2 |
For applications communicating across several nodes, it might be beneficial to specify which of the two network controllers to use.
This can be done by setting the environment variable UCX_NET_DEVICES=mlx5_[0,1]:1
in a wrapper script for each thread, depending on which socket it is running on.
How this wrapper script looks like depends on the application and thus can not be generalized.
If you have questions about it, please contact the User support.
Other Runtime Settings#
There are a number of other settings that can be controlled via envVars and in some cases have positive results in terms of the runtime of a job - here we just name a few.
export MALLOC_TRIM_THRESHOLD_="-1"
so the heap can only grow, once mapped memory is not returned to the kernel. Avoids some sys-calls in this regard.
export MKL_DEBUG_CPU_TYPE="5"
export MKL_ENABLE_INSTRUCTIONS="AVX2"
Useful if executable was linked against Intel MKL. Is no longer effective for newer Intel oneAPI compiler/libraries, as MKL now has implementation branches for “Zen” that are also are automatically selected.
Ressource Limits#
Stack Size#
Using an unlimited
size stack might have a negative influence on
performance. Also, an unlimited stack hides invalid memory
accesses. Therefore it’s recommended to define the actually needed
amount. For example, to set the limit for stack size to 200MB (200*1024)
use one of the following statements:
ulimit -s 204800 # bash
limit stacksize 204800 # tcsh
It might be necessary to further increase the stack size if your program uses large automatic arrays. If the stack size is too small the program usually will crash with an error message like this:
"Caught signal 11 (Segmentation fault: address not mapped to object at
address 0x0123456789abcdef)".
Obviously, the actual address will vary. If increasing the stack size
does not resolve the program abort, a Segmentation fault
error is a
strong indication for a bug in your program.
Core File Size#
It is also recommended to disable core file generation unless needed for debugging purposes:
ulimit -c 0 # bash
limit core 0 # tcsh
All current limits can be listed with the following command:
ulimit -a # bash
limit # tcsh