Bus error in jobs#
The solution#
Note
Update 2022-06-14: The problem was solved by an update of the Lustre-client by our storage vendor. The workaround described below should no longer be necessary. If one of your jobs runs into a bus error, please let us know.
The problem#
When running jobs on Levante, these sometimes fail with a Bus error, similar to the example below:
4800: =================================
4800: [l10148:3175595] *** Process received signal ***
4800: [l10148:3175595] Signal: Bus error (7)
4800: [l10148:3175595] Signal code: (-6)
4800: [l10148:3175595] Failing at address: 0x5e6a003074ab
4800: [l10148:3175595] [ 0] /usr/lib64/libpthread.so.0(+0x12b20)[0x7ffef7214b20]
The following stack frames are part of the MPI library in most cases.
According to our observations, the bus error happens on nodes doing large amounts of output, and to be more specific, on NUMA domains with output ranks. The memory consumption on the node is rather low.
A probable reason#
The memory of a levante node is split into 8 parts (NUMA domains). Each of these is best accessed by 16 cores “next” to it, but can also be accessed by the other cores. Cross-domain memory acces is slow, though. Under most circumstances, a process will receive memory in its NUMA domain until that is full and will receive memory from the other domains if more memory is requested.
Observing the memory consumption on a NUMA level, one finds that shortly before the bus error the used memory in the NUMA domain doing output grows significantly. We assume that the subsequent bus error is due to a bug in the lustre kernel memory management. DDN is working on a fix.
How to mitigate#
Since, as discussed in the previous section, one NUMA domain is going low on
memory, the obvious mitigation is to bind the ranks doing output to more than
one NUMA domain. In the following, we show a mpmd.conf
which assert that
rank 0 allocates memory from all NUMA domains of the first socket and rank 96
from all NUMA domains of the second socket. All other processes keep using
“their” domain.
0 numactl --interleave=0-3 -- /path/to/your/model
96 numactl --interleave=4-7 -- /path/to/your/model
* numactl --localalloc -- /path/to/your/model
You have to adapt the ranks to the ones doing output and the sockets to the ones to which the corresponding ranks are bound. Beware that this slows down computation considerably. So if you are doing classical rank 0-i/o, i.e. you interleave computation on rank 0 with output, this mitigation can be costly performance-wise.
To use the mpmd.conf
, add --multi-prog mpmd.conf
to your srun
command
line.
You can also use numactl
with normal (non-mpi) binaries, e.g.
numactl --interleave=0-7 matlab [matlab options and arguments] &