Posted in 2017
Why does my job wait so long before being executed? or: Why is my job being overtaken by other jobs in the queue?
- 19 June 2017
There are several possible reasons for to be queued for a long time and/or to be overtaken …
… later submitted jobs with a higher priority (usually these have used less of their share then your job).
When will my SLURM job start?
- 19 June 2017
The SLURM squeue command with the options - -start and -j provides an estimate for the job start time, for example:
How to use environment modules in batch scripts
- 19 June 2017
module
is a shell function which modifies shell environment after
loading or unloading a module file. If you are using different shells
as login shell and for job batch scripts (e.g. tcsh
as login shell
and your job scripts start with #!/bin/bash
), you need to add an
appropriate source
command in your script before any invocation of
the module
function (otherwise the module: command not found.
error message will result and the shell environment won’t be modified
as intended):
How to use SSHFS to mount remote lustre filesystem over SSH
- 19 June 2017
In order to interact with directories and files located on the lustre filesystem, users can mount the remote filesystem via SSHFS (SSH Filesystem) over a normal ssh connection.
SSHFS is Linux based software that needs to be installed on your local computer. On Ubuntu and Debian based systems it can be installed through apt-get. On Mac OSX you can install SHFS - you will need to download FUSE and SSHFS from the osxfuse site. On Windows you will need to grab the latest win-sshfs package from the google code repository or use an alternative approach like WinSCP.
How to set the default Slurm project account
- 19 June 2017
Specification of the project account (via option -A
or
--account
) is necessary to submit a job or make a job allocation,
otherwise your request will be rejected. To set the default project
account you can use the following SLURM input environment variables
SLURM_ACCOUNT
- interpreted by srun
command
How to improve interactive performance of MATLAB
- 19 June 2017
When using ssh X11-Forwarding (options -X
or -Y
), matlab can
be slow to start and also exhibit slow response to interactive use. This
is because X11 sends many small packets over the network, often
awaiting a response before continuing. This interacts unfavorably with
medium or even higher latency connections, i.e. WiFi.
A way to eliminate this issue is to use dedicated resources and to
start a remote desktop session that does not suffer from network
latencies in the same manner. This approach requires a VNC client
vncviewer
installed on your local machine. A widely used VNC
client is TurboVNC.
How to display the batch script for a running job
- 19 June 2017
Once your batch job started execution (i.e. is in RUNNING
state)
your job script is copied to the slurm admin nodes and kept until the
job finalizes - this prevents problems that might occur if the job
script gets modified while the job is running. As a side-effect you
can delete the job script without interfering with the execution of
the job.
If you accidentally removed or modified the job script of a running job, you can use the following command to query for the script that is actually used for executing the job:
How to Write a shell alias or function for quick login to a node managed by SLURM
- 19 June 2017
For tasks better run in a dedicated but interactive fashion, it might be advantageous to save the repeating pattern of reserving resources and starting a new associated shell in an alias or function, as explained below.
If you use bash
as default shell you can place the following alias
definition in your ~/.bashrc
file and source this file in the
~/.bash_profile
or in the ~/.profile
file:
How can I see on which nodes my job was running?
- 19 June 2017
Yon can use the SLURM sacct
command with the following options:
How can I run a short MPI job using up to 4 nodes?
- 19 June 2017
You can use SLURM Quality of Service (QOS) express by inserting the following line into your job script:
or using the option –qos with the sbatch command:
How can I get a stack trace if my program crashes?
- 19 June 2017
The classical approach to find the location where your program crashed is to run it in a debugger or inspect a core file with the debugger. A quick way to get the stack trace without the need for a debugger is to compile your program with the following options:
In case of segment violation during execution of the program, detailed information on the location of the problem (call stack trace with routine names and line numbers) will be provided:
How can I choose which account to use, if I am subscribed to more than one project?
- 19 June 2017
Just insert the following line into your job script:
There is no default project account.
How can I check my disk space usage?
- 19 June 2017
Your individual disk space usage in HOME and SCRATCH as well as the quota state of your projects can be checked using the commands listed in File Systems. An additional overview is provided DKRZ online portal. The numbers there are updated daily.
(updated 2023-07-26)
How can I access my Lustre data from outside DKRZ/ZMAW?
- 19 June 2017
For data transfer you can use either sftp:
or rsync command:
Can I run cron jobs on HPC login nodes?
- 19 June 2017
For system administration reasons users are not allowed to shedule and
execute periodic jobs on DKRZ HPC systems using the cron
utility. Our
recommendation is to use the functionality provided by the workload
manager Slurm for this purpose. With the option --begin
of the
sbatch
command you can postpone the execution of your jobs until
the specified time. For example, to run a job every day after 12pm
you can use the following job script re-submitting itself at the
beginning of the execution:
A variety of different date and time specifications is possible with
the --begin
option, for example: now+1hour, midnight,
noon, teatime, YYYY-MM-DD[Thh:mm:ss], 7AM, 6PM
etc. For more details see manual pages of the sbatch command:
Is a FTP client available on Levante?
- 18 June 2017
LFTP is installed on Levante for download and upload of files from/to an external server via File Transfer Protocol (FTP):
The user name for authentication can be provided via option -u
or
--user
.