Partitions and Limits#

Partitions#

In Slurm multiple nodes can be grouped into partitions which are sets of nodes with associated limits for wall-clock time, job size, etc. These limits are hard limits for the jobs and can only be overruled by QOS (quality of service). The defined partitions can overlap, i.e. one node might be contained in several partitions.

Jobs are allocations of resources by users in order to execute tasks on the cluster for a specified period of time. Furthermore, the concept of job steps is used by Slurm to describe a set of different tasks within the job. One can imagine job steps as smaller allocations or jobs within the job, which can be executed sequentially or in parallel during the main job allocation.

The Slurm sinfo command lists all partitions and nodes managed by Slurm on Levante as well as provides general information about the current nodes’ status (Allocated/Idle/Other/Total):

$ sinfo -o "%.11P %.5a %.10l %.14F %.20f %N" -p compute,gpu,shared,interactive

PARTITION AVAIL TIMELIMIT   NODES(A/I/O/T)    AVAIL_FEATURES                         NODELIST

 gpu             up    12:00:00      11/41/4/56        512G,cell13,a100_80     l[50000,50003,50006,50009,50012,...]
 gpu             up    12:00:00         0/1/1/2           512G,cell09,a100_40      l[40360,40363]
 gpu             up    12:00:00         0/2/0/2           1024G,cell09,a100_40    l[40366,40369]

 compute    up    8:00:00       285/0/0/285          256G,cell01             l[10000-10058,10060-10095,10100-...]
 compute    up    8:00:00       283/5/0/288          256G,cell02             l[10300-10395,10400-10495,10500-...]
 compute    up    8:00:00       280/8/0/288          256G,cell03             l[10600-10695,10700-10795,20000-...]
 compute    up    8:00:00       288/0/0/288          256G,cell04             l[20100-20195,20200-20295,20300-...]
 compute    up    8:00:00       288/0/0/288          256G,cell05             l[20400-20495,20500-20595,20600-...]
 compute    up    8:00:00       285/0/3/288          256G,cell06             l[30000-30095,30100-30195,30200-...]
 compute    up    8:00:00       288/0/0/288          256G,cell07             l[30300-30395,30400-30495,30500-...]
 compute    up    8:00:00       200/7/0/207          256G,cell08             l[30600-30695,30700-30795,40015-...]
 compute    up    8:00:00         30/6/0/36             512G,cell08            l[40030-40047,40072-40083,40090-...]
 compute    up    8:00:00      217/17/0/234         512G,cell09             l[40100-40183,40190-40195,40200-...]
 compute    up    8:00:00           7/2/0/9                1024G,cell09            l[40351-40359]
 compute    up    8:00:00       276/0/0/276          256G,cell11             l[40400-40495,40500-40550,40560-...]
 compute    up    8:00:00           9/0/0/9                1024G,cell11                l[40551-40559]
 compute    up    8:00:00       159/0/0/159          256G,cell10             l[50200-50295,50300-50359,50369-...]

 shared       up  7-00:00:00      15/0/0/15             256G,cell08                   l[40000-40014]


 interactive  up   12:00:00       17/7/0/24             512G,cell08                   l[40048-40071]

For detailed information about all available partitions and their limits use the Slurm scontrol command as follows:

$ scontrol show partition
$ scontrol show partition interactive

The following publicly available partitions are currently defined on Levante:

compute#

This partition consists of 2943 AMD EPYC 7763 Milan compute nodes and is intended for running parallel scientific applications. The compute nodes allocated for a job are used exclusively and cannot be shared with other jobs.

The partition contains nodes with different memory configurations. If you want to use the entire memory of a larger node, you have to request all memory with the --mem=0 option.

shared#

This partition is defined on 15 nodes and can be used to run small jobs not requiring a whole node for the execution, such that one compute node can be shared between different jobs. The partition is dedicated for execution of shared memory applications parallelized with OpenMP or pthreads as well as for serial and parallel data processing jobs which need a considerable longer allocation period than usual compute jobs.

interactive#

The interactive partition is made up of 24 nodes but can be dynamically expanded if there is a short-term need. It is intended for memory or compute intensive data processing and compilation tasks that should not run on the login nodes. Nodes of this partition can be shared with other jobs if a single job does not allocate all resources. Use salloc to allocate the resources and directly jump to that node. Basically, asking for resources from this partition should not incur any waiting times. The total amount of ressources per user in this partition is limited to the equivalent of one node.

gpu#

The 60 nodes in this partition are each equipped with 2 AMD EPYC Milan 7713 CPUs and additional 4 Nvidia A100 GPUs. These can be used for GPGPU-aware scientific applications (e.g. via OpenACC programming) or interactive 3-dimensional data visualization via VirtualGL/TurboVNC. More details on how to use the GPU nodes are given under Using GPU nodes.

Caution

All jobs on levante have to be assigned to a partition - there is no default partition available. Choosing the partition can be done in various ways:

  • Batch script option:

    #SBATCH [-p|--partition=]<partitionname>
    
  • Environment variable:

    $ export SBATCH_PARTITION=<partitionname>
    
  • Command line option:

    $ sbatch [-p|--partition=]<partitionname>
    

Note that an environment variable will override any matching option set in a batch script, and command line options will override any matching environment variable.

Limits#

Limits are meant to enforce fair resource usage policies and keep the cluster available. On Levante, limits are set to avoid situations where a single user consumes disproportionate cluster resources or unintendedly overwhelms the cluster with excessive jobs. Settings for Memory per CPU ensure that the total memory usage stays below the available memory on the node and also prevent any single job from consuming an excessive amount of memory that could negatively impact other jobs running on the same node.

The Slurm limits configured for different partitions on Levante are:

Partition Name

Max Nodes per Job

Max Job Runtime

Max resources*

Shared Node Usage

Default Memory per CPU

Max Memory per CPU

compute

512

8 hours

no limit

no

940 MB

3940 MB

shared

1

7 days

512 CPUs

yes

940 MB

940 MB

interactive

1

12 hours

256 CPUs

yes

1940 MB

1940 MB

gpu

60

12 hours

no limit

yes

1940 MB

3940 MB

gpu-devel

1

30 minutes

no limit

yes

1940 MB

1940 MB

*used simultaneously by all running jobs of a user

Additionally, we enforce the following restrictions regarding the number of jobs per user and project account:

  • Maximum number of submitted jobs: 1000

  • Maximim number of running jobs in compute and shared partitions together: 20

  • Maximum number of running jobs in gpu partition: 5

  • Maximum number of running jobs in interactive partition: 5

Hint

In case you require longer execution times, more concurrent jobs or more nodes per job, you can ask for higher limits by sending a request to support@dkrz.de. Please include the following information in your request: username(s), project id, the reason why you need higher limits, what limits to increase, and for how long those should be increased. Also a brief justification by your project administrator is needed. Based on the technical limitations and fair resource sharing, the predefined limits can then be adjusted for some time to match your purposes by specifying an appropriate Quality of Service (QOS).

A QOS can be requested for a job by specifying the --qos option as follows:

  • Batch script option:

    #SBATCH --qos=name_of_qos
    
  • Environment variable:

    $ export SBATCH_QOS=name_of_qos
    $ expost SALLOC_QOS=name_of_qos
    
  • Command line option:

    $ sbatch --qos=name_of_qos myscript.sh
    

To list job limits and quality of services relevant to you, use the sacctmgr command, for example:

$ sacctmgr -s show user $USER

$ sacctmgr -s show user $USER format=user,account,maxjobs,maxsubmit,qos