MPI Runtime Settings¶
Modern MPI library implementations provide a large number of user-configurable parameters and algorithms for performance tuning. Although the local configuration of MPI libraries is initially performed by vendor to match the characteristics of the cluster, the performance of a specific application can often be further improved by up to 15% by optimal choice of tunable parametes.
Since tuning options are specific to an MPI library and application, the recommendation for MPI runtime setting below are just a starting point for each version.
OpenMPI based MPI libraries¶
OpenMPI 2.0.0 and later¶
As a minimal environmental setting we recommend the following to make use of the Mellanox HPC-X toolkit. This is just a starting point, users will have to tune the environment depending on the used application.
export OMPI_MCA_pml=cm # sets the point-to-point management layer export OMPI_MCA_mtl=mxm # sets the matching transport layer (MPI-2 one-sided comm.) export MXM_RDMA_PORTS=mlx5_0:1 export MXM_LOG_LEVEL=ERROR export MXM_HANDLE_ERRORS=bt export UCX_HANDLE_ERRORS=bt # enable HCOLL based collectives export OMPI_MCA_coll=^fca # disable FCA for collective MPI routines export OMPI_MCA_coll_hcoll_enable=1 # enable HCOLL for collective MPI routines export OMPI_MCA_coll_hcoll_priority=95 export OMPI_MCA_coll_hcoll_np=8 # use HCOLL for all communications with more than 8 tasks export HCOLL_MAIN_IB=mlx5_0:1 export HCOLL_ENABLE_MCAST=1 export HCOLL_ENABLE_MCAST_ALL=1 # disable specific HCOLL functions (strongly depends on the application) export HCOLL_ML_DISABLE_BARRIER=1 export HCOLL_ML_DISABLE_IBARRIER=1 export HCOLL_ML_DISABLE_BCAST=1 export HCOLL_ML_DISABLE_REDUCE=1
ompi_info tool can be used to get detailed information about O
pen MPI installation and local configuration:
BullxMPI (out of support)¶
In general, it is advisable to use the Bullx MPI installation with MXM (Mellanox Messaging) support to accelerate the underlying send/receive (or put/get) messages. The following variables have to be set:
export OMPI_MCA_pml=cm # sets the point-to-point management layer export OMPI_MCA_mtl=mxm # sets the matching transport layer (MPI-2 one-sided comm.) export MXM_RDMA_PORTS=mlx5_0:1
Alternatively, the default OpenMPI behavior can be specified using:
export OMPI_MCA_pml=ob1 export OMPI_MCA_mtl=^mxm
When mistral was installed it was recommended to accelerate MPI collective operations by Mellanox FCA (Fabric Collectives Accelerations) tools - this is no longer possible! If you are using old job scripts and find your jobs aborting due to FCA/FMM errors, please deactivate FCA usage with the following variable:
export OMPI_MCA_coll=^ghc,fca # disable BULLs GHC and Mellanox FCA tools for collectives
2019-02-19: FCA is no longer supported and the centralized fcamanager is not available on mistral - we recommend switching to OpenMPI instead.
You will find the Bullx MPI documentation by BULL/Atos in the section Manuals.
Intel MPI 2017 and later¶
A good starting point for MPI based tuning is the following setting which enforces shared memory for MPI intranode communication and DAPL based internode communication:
export I_MPI_FABRICS=shm:dapl export I_MPI_FALLBACK=disable export I_MPI_SLURM_EXT=0 export I_MPI_LARGE_SCALE_THRESHOLD=8192 # set to a value larger than the number of MPI-tasks used !!! export I_MPI_DYNAMIC_CONNECTION=1 export I_MPI_CHECK_DAPL_PROVIDER_COMPATIBILITY=0 export I_MPI_HARD_FINALIZE=1
To further tune the MPI library usage, one might enable collection of MPI call statistics, e.g.
and analyse the results with respect to used MPI functions afterwards (see lightweigth MPI analysis). One might for example switch the used algorithm for MPI_Alltoallv via
Unlimited stack size might have negative influence on performance - better use the actually needed amount, e.g.
ulimit -s 102400 # using bash limit stacksize 102400 # using csh
It is also recommended to disable core file generation if it is not needed for debugging purposes.
ulimit -c 0 # using bash limit core 0 # using csh
In batch jobs, you will also have to propagate the modified settings from the job head node to all other compute nodes when invoking srun, i.e.
srun --propagate=STACK,CORE [any other options]