How to get more memory for my Slurm job?#
Overview#
The amount of memory specified on the Levante configuration page for different node types refers to the total physical memory installed in a node. Since some memory is reserved for the needs of the operating system and the memory-based local file system (e.g. /tmp, /usr), the amount of memory actually available for job execution is less than the total physical memory of a node.
The table below provides numbers for the preset amounts of
physical memory (RealMemory
), memory reserved for the system
(MemSpecLimit
) and memory available for job execution (which is the
difference between RealMemory
and MemSpecLimit
) for three
Levante node variants:
Node type |
Feature |
RealMemory (M) |
MemSpecLimit (M) |
Available Memory (M) |
---|---|---|---|---|
Standard memory |
|
257400 |
16384 |
241016 (ca. 235G) |
Large memory |
|
515000 |
16384 |
498616 (ca. 486G) |
Fat memory |
|
1031600 |
16384 |
1015216 (ca. 991G) |
By default, the amount of memory available for a job is the product of
the number of the CPUs allocated by Slurm for the job and the
default memory per CPU (DefMemPerCPU
), which is set to 940M
for all Levante partitions regardless of the node type. For
example, for a full (i.e. exclusively used) node the amount of
available memory would be 256 CPUs * 940M = 240640M
(235G). This means, even if you select nodes with more RAM by
choosing the interactive
or gpu
partition or by adding one of
the following --constraint
options to your job script:
#SBATCH --constraint=512G # Selects large memory node
#SBATCH --constraint=512G|1024G # Select large or ("|") fat memory node
the usable memory will be still set to 235G by default. Slurm will not allocate the available main memory automatically on Levante. To achieve that, it is necessary to addtionally specify the following option:
#SBATCH --mem=0 # Use all available memory
In this context, zero is treated as a special case and all available memory on each node is allocated for job execution.
If your job tries to use more memory than allocated by Slurm, it will
be terminated with the OUT_OF_MEMORY
state, as can retrospectively
be checked with the sacct
command, e.g.:
$ sacct -X -j <jobid> -o jobid,jobname%20,state%30
JobID JobName State
------------ ---------------------------- -----------------------------------
<jobid> my_jobname OUT_OF_MEMORY
The corresponding job log file often contains the following error message:
slurmstepd: error: Detected 1 oom-kill event(s) in
StepId=jobid.0 cgroup. Some of your processes may have been
killed by the cgroup out-of-memory handler.
srun: error: l40272: task 0: Out Of Memory srun:
launch/slurm: _step_signal: Terminating StepId=jobid.0
Below we describe a number of methods you can try if your job runs out of memory. In many cases that will only provide a starting point. Subsequently, a code change, after checking for possible memory leaks, and deletion/deallocation of unused large arrays, will be needed.
You can use ClusterCockpit to monitor used memory over time for your jobs.
Approaches to get more memory#
Increase the memory request in the batch script#
Slurm provides the --mem
and --mem-per-cpu
options for this
purpose, e.g.:
#SBATCH --mem=50G
This approach is only reasonable if less than a full node is requested, e.g. for serial or multithreaded applications. Slurm will automatically adjusts the number of the allocated CPUs to match the memory request for the job.
Once the whole node is used, the memory request should be specified as
#SBATCH --mem=0
Select nodes equipped with more RAM#
To do this, you need to specify the --constraint
and --mem=0
options on Levante, for example:
#SBATCH --constraint=512G # Selects large memory node
#SBATCH --constraint=512G|1024G # Select large or ("|") fat memory node
#SBATCH --mem=0 # Use all avilable memory
Please note that due to the small amount of the large (294) and fat memory (18) nodes, the pending times of your jobs might increase considerably.
Increase the number of nodes#
For parallel applications which can be executed on multiple nodes and show good memory scaling you can try to use more nodes to increase the amount of the total memory: e.g.:
#SBATCH --nodes=10
Reduce the number of processes (tasks) per node#
For MPI parallel programs it is generally recommended to use 128 MPI tasks per node since all Levante nodes have 128 physical cores available. In case of memory issues, you can try to reduce the number of tasks per node and spread the tasks over more nodes. For example, to run a parallel program with 256 tasks you can replace the recommended setting:
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=128
with
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=64
For hybrid (MPI + OpenMP) parallel codes, thread parallelism should be used to exploit the available CPUs per node more efficiently. It might be needed to configure the build of the software to actually enable OpenMP work sharing. Since threads share all the memory of a task except for stack memory, memory scaling is usually better than for pure MPI.