When running programs on Levante, various settings might be needed to achieve satisfying performance, or, in some cases, even to allow a program to run. This section describes some environment settings which are most often needed to be set in Slurm batch scripts.
MPI Runtime Settings#
Modern MPI library implementations provide a large number of user-configurable parameters and algorithms for performance tuning. Although the local configuration of MPI libraries is initially performed by vendor to match the characteristics of the cluster, the performance of a specific application can often be further improved by up to 15% by optimal choice of tunable parameters.
Since tuning options are specific to an MPI library and an application, the following recommendations for MPI runtime settings are only a starting point for each version.
As a minimal environmental setting we recommend the following to make use of the UCX toolkit.
export OMPI_MCA_osc="ucx" export OMPI_MCA_pml="ucx" export OMPI_MCA_btl="self" export UCX_HANDLE_ERRORS="bt" export OMPI_MCA_pml_ucx_opal_mem_hooks=1
The following settings can bring further improvements in the runtime of the programmes, but must depend on the respective configuration.
export OMPI_MCA_io="romio321" # basic optimisation of I/O export UCX_TLS="shm,rc_mlx5,rc_x,self" # for jobs using LESS than 150 nodes export UCX_TLS="shm,dc_mlx5,dc_x,self" # for jobs using MORE than 150 nodes export UCX_UNIFIED_MODE="y" # JUST for homogeneous jobs on CPUs, do not use for GPU nodes
For some applications (e.g. high-resolution version of ICON) a special MPI_Alltoallv algorithm has to be used in case you notice a deadlock in Alltoallv.
export OMPI_MCA_coll_tuned_use_dynamic_rules="true" export OMPI_MCA_coll_tuned_alltoallv_algorithm=2
ompi_info tool can be used to get detailed information about OpenMPI
installation and local configuration:
Environment variables for Intel MPI start with an
I_MPI_ prefix. The complete reference of environment variables can be found at Intel’s site.
On Levante, to run programs built with Intel MPI, you should set at least
the following environment variables:
export I_MPI_PMI=pmi export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
For large jobs we recommend to use PMI-2 instead of PMI. The corresponding settings are:
export I_MPI_PMI=pmi2 export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so srun --mpi=pmi2 ...
OpenMP Runtime Settings#
For typical HPC applications, one thread per physical core is often reasonable. For OpenMP/MPI-hybrid jobs, OMP_NUM_THREADS up to max. 8 are usually good (8 cores share one LLC - last level cache).
export KMP_AFFINITY="granularity=fine,scatter" export KMP_LIBRARY="turnaround"
Other Runtime Settings#
There are a number of other settings that can be controlled via envVars and in some cases have positive results in terms of the runtime of a job - here we just name a few.
so the heap can only grow, once mapped memory is not returned to the kernel. Avoids some sys-calls in this regard.
export MKL_DEBUG_CPU_TYPE="5" export MKL_ENABLE_INSTRUCTIONS="AVX2"
Useful if executable was linked against Intel MKL. Is no longer effective for newer Intel oneAPI compiler/libraries, as MKL now has implementation branches for “Zen” that are also are automatically selected.
unlimited size stack might have a negative influence on
performance. Also, an unlimited stack hides invalid memory
accesses. Therefore it’s recommended to define the actually needed
amount. For example, to set the limit for stack size to 200MB (200*1024)
use one of the following statements:
ulimit -s 204800 # bash limit stacksize 204800 # tcsh
It might be necessary to further increase the stack size if your program uses large automatic arrays. If the stack size is too small the program usually will crash with an error message like this:
"Caught signal 11 (Segmentation fault: address not mapped to object at address 0x0123456789abcdef)".
Obviously, the actual address will vary. If increasing the stack size
does not resolve the program abort, a
Segmentation fault error is a
strong indication for a bug in your program.
Core File Size#
It is also recommended to disable core file generation unless needed for debugging purposes:
ulimit -c 0 # bash limit core 0 # tcsh
All current limits can be listed with the following command:
ulimit -a # bash limit # tcsh