How to get more memory for my Slurm job?#

Overview#

The amount of memory specified on the Levante configuration page for different node types refers to the total physical memory installed in a node. Since some memory is reserved for the needs of the operating system and the memory-based local file system (e.g. /tmp, /usr), the amount of memory actually available for job execution is less than the total physical memory of a node.

The table below provides numbers for the preset amounts of physical memory (RealMemory), memory reserved for the system (MemSpecLimit) and memory available for job execution (which is the difference between RealMemory and MemSpecLimit) for three Levante node variants:

Node type	Feature	RealMemory (M)	MemSpecLimit (M)	Available Memory (M)
Standard memory	`256G`	257400	16384	241016 (ca. 235G)
Large memory	`512G`	515000	16384	498616 (ca. 486G)
Fat memory	`1024G`	1031600	16384	1015216 (ca. 991G)

By default, the amount of memory available for a job is the product of the number of the CPUs allocated by Slurm for the job and the default memory per CPU (DefMemPerCPU), which is set to 940M for all Levante partitions regardless of the node type. For example, for a full (i.e. exclusively used) node the amount of available memory would be 256 CPUs * 940M = 240640M (235G). This means, even if you select nodes with more RAM by choosing the interactive or gpu partition or by adding one of the following --constraint options to your job script:

#SBATCH --constraint=512G # Selects large memory node
#SBATCH --constraint=512G|1024G  # Select large or ("|") fat memory node

the usable memory will be still set to 235G by default. Slurm will not allocate the available main memory automatically on Levante. To achieve that, it is necessary to addtionally specify the following option:

#SBATCH --mem=0  # Use all available memory

In this context, zero is treated as a special case and all available memory on each node is allocated for job execution.

If your job tries to use more memory than allocated by Slurm, it will be terminated with the OUT_OF_MEMORY state, as can retrospectively be checked with the sacct command, e.g.:

$ sacct -X -j <jobid> -o jobid,jobname%20,state%30

       JobID              JobName                          State
------------ ---------------------------- -----------------------------------
<jobid>           my_jobname                  OUT_OF_MEMORY

The corresponding job log file often contains the following error message:

slurmstepd: error: Detected 1 oom-kill event(s) in
StepId=jobid.0 cgroup.  Some of your processes may have been
killed by the cgroup out-of-memory handler.

srun: error: l40272: task 0: Out Of Memory srun:
launch/slurm: _step_signal: Terminating StepId=jobid.0

Below we describe a number of methods you can try if your job runs out of memory. In many cases that will only provide a starting point. Subsequently, a code change, after checking for possible memory leaks, and deletion/deallocation of unused large arrays, will be needed.

You can use ClusterCockpit to monitor used memory over time for your jobs.

Approaches to get more memory#

Increase the memory request in the batch script#

Slurm provides the --mem and --mem-per-cpu options for this purpose, e.g.:

#SBATCH --mem=50G

This approach is only reasonable if less than a full node is requested, e.g. for serial or multithreaded applications. Slurm will automatically adjusts the number of the allocated CPUs to match the memory request for the job.

Once the whole node is used, the memory request should be specified as

#SBATCH --mem=0

Select nodes equipped with more RAM#

To do this, you need to specify the --constraint and --mem=0 options on Levante, for example:

#SBATCH --constraint=512G                # Selects large memory node
#SBATCH --constraint=512G|1024G  # Select large or ("|") fat memory node
#SBATCH --mem=0                               # Use all avilable memory

Please note that due to the small amount of the large (294) and fat memory (18) nodes, the pending times of your jobs might increase considerably.

Increase the number of nodes#

For parallel applications which can be executed on multiple nodes and show good memory scaling you can try to use more nodes to increase the amount of the total memory: e.g.:

#SBATCH --nodes=10

Reduce the number of processes (tasks) per node#

For MPI parallel programs it is generally recommended to use 128 MPI tasks per node since all Levante nodes have 128 physical cores available. In case of memory issues, you can try to reduce the number of tasks per node and spread the tasks over more nodes. For example, to run a parallel program with 256 tasks you can replace the recommended setting:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=128

with

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=64

For hybrid (MPI + OpenMP) parallel codes, thread parallelism should be used to exploit the available CPUs per node more efficiently. It might be needed to configure the build of the software to actually enable OpenMP work sharing. Since threads share all the memory of a task except for stack memory, memory scaling is usually better than for pure MPI.

Slurm-managed cronjobs Changelog slk_helpers v1.13.2

30 September 2024

Categories

Tags