MPI Runtime Settings

Modern MPI library implementations provide a large number of user-configurable parameters and algorithms for performance tuning. Although the local configuration of MPI libraries is initially performed by vendor to match the characteristics of the cluster, the performance of a specific application can often be further improved by up to 15% by optimal choice of tunable parametes.

Since tuning options are specific to an MPI library and application, the recommendation for MPI runtime setting below are just a starting point for each version.

OpenMPI based MPI libraries

OpenMPI 2.0.0 and later

As a minimal environmental setting we recommend the following to make use of the Mellanox HPC-X toolkit. This is just a starting point, users will have to tune the environment depending on the used application.

export OMPI_MCA_pml=cm         # sets the point-to-point management layer
export OMPI_MCA_mtl=mxm        # sets the matching transport layer (MPI-2 one-sided comm.)
export MXM_RDMA_PORTS=mlx5_0:1
export MXM_LOG_LEVEL=ERROR
export MXM_HANDLE_ERRORS=bt
export UCX_HANDLE_ERRORS=bt

# enable HCOLL based collectives
export OMPI_MCA_coll=^fca              # disable FCA for collective MPI routines
export OMPI_MCA_coll_hcoll_enable=1    # enable HCOLL for collective MPI routines
export OMPI_MCA_coll_hcoll_priority=95
export OMPI_MCA_coll_hcoll_np=8        # use HCOLL for all communications with more than 8 tasks
export HCOLL_MAIN_IB=mlx5_0:1
export HCOLL_ENABLE_MCAST=1
export HCOLL_ENABLE_MCAST_ALL=1

# disable specific HCOLL functions (strongly depends on the application)
export HCOLL_ML_DISABLE_BARRIER=1
export HCOLL_ML_DISABLE_IBARRIER=1
export HCOLL_ML_DISABLE_BCAST=1
export HCOLL_ML_DISABLE_REDUCE=1

The ompi_info tool can be used to get detailed information about O pen MPI installation and local configuration:

ompi_info --all

BullxMPI (out of support)

In general, it is advisable to use the Bullx MPI installation with MXM (Mellanox Messaging) support to accelerate the underlying send/receive (or put/get) messages. The following variables have to be set:

export OMPI_MCA_pml=cm         # sets the point-to-point management layer
export OMPI_MCA_mtl=mxm        # sets the matching transport layer (MPI-2 one-sided comm.)
export MXM_RDMA_PORTS=mlx5_0:1

Alternatively, the default OpenMPI behavior can be specified using:

export OMPI_MCA_pml=ob1
export OMPI_MCA_mtl=^mxm

When mistral was installed it was recommended to accelerate MPI collective operations by Mellanox FCA (Fabric Collectives Accelerations) tools - this is no longer possible! If you are using old job scripts and find your jobs aborting due to FCA/FMM errors, please deactivate FCA usage with the following variable:

export OMPI_MCA_coll=^ghc,fca         # disable BULLs GHC and Mellanox FCA tools for collectives

Note

2019-02-19: FCA is no longer supported and the centralized fcamanager is not available on mistral - we recommend switching to OpenMPI instead.

You will find the Bullx MPI documentation by BULL/Atos in the section Manuals.

Intel MPI 2017 and later

A good starting point for MPI based tuning is the following setting which enforces shared memory for MPI intranode communication and DAPL based internode communication:

export I_MPI_FABRICS=shm:dapl
export I_MPI_FALLBACK=disable
export I_MPI_SLURM_EXT=0
export I_MPI_LARGE_SCALE_THRESHOLD=8192 # set to a value larger than the number of MPI-tasks used !!!
export I_MPI_DYNAMIC_CONNECTION=1
export I_MPI_CHECK_DAPL_PROVIDER_COMPATIBILITY=0
export I_MPI_HARD_FINALIZE=1

To further tune the MPI library usage, one might enable collection of MPI call statistics, e.g.

export I_MPI_STATS=20

and analyse the results with respect to used MPI functions afterwards (see lightweigth MPI analysis). One might for example switch the used algorithm for MPI_Alltoallv via

export I_MPI_ADJUST_ALLTOALLV=1

All MPIs

Unlimited stack size might have negative influence on performance - better use the actually needed amount, e.g.

ulimit -s 102400       # using bash
limit stacksize 102400 # using csh

It is also recommended to disable core file generation if it is not needed for debugging purposes.

ulimit -c 0    # using bash
limit core 0   # using csh

In batch jobs, you will also have to propagate the modified settings from the job head node to all other compute nodes when invoking srun, i.e.

srun --propagate=STACK,CORE [any other options]