Posts tagged slurm

How to get more memory for my Slurm job

The amount of memory specified on the Levante configuration page for different node types refers to the total physical memory installed in a node. Since some memory is reserved for the needs of the operating system and the memory-based local file system (e.g. /tmp, /usr), the amount of memory actually available for job execution is less than the total physical memory of a node.

The table below provides numbers for the preset amounts of physical memory (RealMemory), memory reserved for the system (MemSpecLimit) and memory available for job execution (which is the difference between RealMemory and MemSpecLimit) for three Levante node variants:

Read more ...


Slurm-managed cronjobs

To execute recurring batch jobs at specified dates, times, or intervals, you can use the Slurm scrontab tool. It provides a reliable alternative to the traditionally used cron utility to automate periodic tasks on Levante.

To define the recurring jobs, Slurm uses a configuration file, so-called crontab, which is handled using the scrontab command. The scrontab command with the -e option invokes an editing session, so you can create or modify a crontab:

Read more ...


Why does my job wait so long before being executed? or: Why is my job being overtaken by other jobs in the queue?

There are several possible reasons for to be queued for a long time and/or to be overtaken …

… later submitted jobs with a higher priority (usually these have used less of their share then your job).

Read more ...


When will my SLURM job start?

The SLURM squeue command with the options - -start and -j provides an estimate for the job start time, for example:

Read more ...


How to set the default Slurm project account

Specification of the project account (via option -A or --account) is necessary to submit a job or make a job allocation, otherwise your request will be rejected. To set the default project account you can use the following SLURM input environment variables

SLURM_ACCOUNT - interpreted by srun command

Read more ...


How to display the batch script for a running job

Once your batch job started execution (i.e. is in RUNNING state) your job script is copied to the slurm admin nodes and kept until the job finalizes - this prevents problems that might occur if the job script gets modified while the job is running. As a side-effect you can delete the job script without interfering with the execution of the job.

If you accidentally removed or modified the job script of a running job, you can use the following command to query for the script that is actually used for executing the job:

Read more ...


How to Write a shell alias or function for quick login to a node managed by SLURM

For tasks better run in a dedicated but interactive fashion, it might be advantageous to save the repeating pattern of reserving resources and starting a new associated shell in an alias or function, as explained below.

If you use bash as default shell you can place the following alias definition in your ~/.bashrc file and source this file in the ~/.bash_profile or in the ~/.profile file:

Read more ...


How can I see on which nodes my job was running?

Yon can use the SLURM sacct command with the following options:

Read more ...


How can I choose which account to use, if I am subscribed to more than one project?

Just insert the following line into your job script:

There is no default project account.

Read more ...


Can I run cron jobs on HPC login nodes?

Update 2024-10-01: This procedure has been superseded by the Slurm scrontab feature, now available on Levante.

For system administration reasons users are not allowed to shedule and execute periodic jobs on DKRZ HPC systems using the cron utility. Our recommendation is to use the functionality provided by the workload manager Slurm for this purpose. With the option --begin of the sbatch command you can postpone the execution of your jobs until the specified time. For example, to run a job every day after 12pm you can use the following job script re-submitting itself at the beginning of the execution:

Read more ...