Posts tagged slurm
How to get more memory for my Slurm job
- 30 September 2024
The amount of memory specified on the Levante configuration page for different node types refers to the total physical memory installed in a node. Since some memory is reserved for the needs of the operating system and the memory-based local file system (e.g. /tmp, /usr), the amount of memory actually available for job execution is less than the total physical memory of a node.
The table below provides numbers for the preset amounts of
physical memory (RealMemory
), memory reserved for the system
(MemSpecLimit
) and memory available for job execution (which is the
difference between RealMemory
and MemSpecLimit
) for three
Levante node variants:
Slurm-managed cronjobs
- 06 August 2024
To execute recurring batch jobs at specified dates, times, or intervals, you can use the Slurm scrontab tool. It provides a reliable alternative to the traditionally used cron utility to automate periodic tasks on Levante.
To define the recurring jobs, Slurm uses a configuration file,
so-called crontab, which is handled using the scrontab
command. The scrontab
command with the -e
option invokes an
editing session, so you can create or modify a crontab:
Why does my job wait so long before being executed? or: Why is my job being overtaken by other jobs in the queue?
- 19 June 2017
There are several possible reasons for to be queued for a long time and/or to be overtaken …
… later submitted jobs with a higher priority (usually these have used less of their share then your job).
When will my SLURM job start?
- 19 June 2017
The SLURM squeue command with the options - -start and -j provides an estimate for the job start time, for example:
How to set the default Slurm project account
- 19 June 2017
Specification of the project account (via option -A
or
--account
) is necessary to submit a job or make a job allocation,
otherwise your request will be rejected. To set the default project
account you can use the following SLURM input environment variables
SLURM_ACCOUNT
- interpreted by srun
command
How to display the batch script for a running job
- 19 June 2017
Once your batch job started execution (i.e. is in RUNNING
state)
your job script is copied to the slurm admin nodes and kept until the
job finalizes - this prevents problems that might occur if the job
script gets modified while the job is running. As a side-effect you
can delete the job script without interfering with the execution of
the job.
If you accidentally removed or modified the job script of a running job, you can use the following command to query for the script that is actually used for executing the job:
How to Write a shell alias or function for quick login to a node managed by SLURM
- 19 June 2017
For tasks better run in a dedicated but interactive fashion, it might be advantageous to save the repeating pattern of reserving resources and starting a new associated shell in an alias or function, as explained below.
If you use bash
as default shell you can place the following alias
definition in your ~/.bashrc
file and source this file in the
~/.bash_profile
or in the ~/.profile
file:
How can I see on which nodes my job was running?
- 19 June 2017
Yon can use the SLURM sacct
command with the following options:
How can I choose which account to use, if I am subscribed to more than one project?
- 19 June 2017
Just insert the following line into your job script:
There is no default project account.
Can I run cron jobs on HPC login nodes?
- 19 June 2017
Update 2024-10-01: This procedure has been superseded by the Slurm scrontab feature, now available on Levante.
For system administration reasons users are not allowed to shedule and
execute periodic jobs on DKRZ HPC systems using the cron
utility. Our
recommendation is to use the functionality provided by the workload
manager Slurm for this purpose. With the option --begin
of the
sbatch
command you can postpone the execution of your jobs until
the specified time. For example, to run a job every day after 12pm
you can use the following job script re-submitting itself at the
beginning of the execution: