Bus error in jobs

The solution

Update: The problem was solved by an update of the Lustre-client by our storage vendor. The workaround described below should no longer be necessary. If you one of your jobs runs into a bus error, please let us know.

The problem

When running jobs on Levante, these sometimes fail with a Bus error, similar to the example below:

4800: =================================
4800: [l10148:3175595] *** Process received signal ***
4800: [l10148:3175595] Signal: Bus error (7)
4800: [l10148:3175595] Signal code:  (-6)
4800: [l10148:3175595] Failing at address: 0x5e6a003074ab
4800: [l10148:3175595] [ 0] /usr/lib64/libpthread.so.0(+0x12b20)[0x7ffef7214b20]

The following stack frames are part of the MPI library in most cases.

According to our observations, the bus error happens on nodes doing large amounts of output, and to be more specific, on NUMA domains with output ranks. The memory consumption on the node is rather low.

A probable reason

The memory of a levante node is split into 8 parts (NUMA domains). Each of these is best accessed by 16 cores “next” to it, but can also be accessed by the other cores. Cross-domain memory acces is slow, though. Under most circumstances, a process will receive memory in its NUMA domain until that is full and will receive memory from the other domains if more memory is requested.

Observing the memory consumption on a NUMA level, one finds that shortly before the bus error the used memory in the NUMA domain doing output grows significantly. We assume that the subsequent bus error is due to a bug in the lustre kernel memory management. DDN is working on a fix.

How to mitigate

Since, as discussed in the previous section, one NUMA domain is going low on memory, the obvious mitigation is to bind the ranks doing output to more than one NUMA domain. In the following, we show a mpmd.conf which assert that rank 0 allocates memory from all NUMA domains of the first socket and rank 96 from all NUMA domains of the second socket. All other processes keep using “their” domain.

0 numactl --interleave=0-3 -- /path/to/your/model
96 numactl --interleave=4-7 -- /path/to/your/model
* numactl --localalloc -- /path/to/your/model

You have to adapt the ranks to the ones doing output and the sockets to the ones to which the corresponding ranks are bound. Beware that this slows down computation considerably. So if you are doing classical rank 0-i/o, i.e. you interleave computation on rank 0 with output, this mitigation can be costly performance-wise.

To use the mpmd.conf, add --multi-prog mpmd.conf to your srun command line.

You can also use numactl with normal (non-mpi) binaries, e.g.

numactl --interleave=0-7 matlab [matlab options and arguments] &