Posts in FAQ

How to get more memory for my Slurm job

30 September 2024

The amount of memory specified on the Levante configuration page for different node types refers to the total physical memory installed in a node. Since some memory is reserved for the needs of the operating system and the memory-based local file system (e.g. /tmp, /usr), the amount of memory actually available for job execution is less than the total physical memory of a node.

The table below provides numbers for the preset amounts of physical memory (RealMemory), memory reserved for the system (MemSpecLimit) and memory available for job execution (which is the difference between RealMemory and MemSpecLimit) for three Levante node variants:

Read more ...

Slurm-managed cronjobs

06 August 2024

To execute recurring batch jobs at specified dates, times, or intervals, you can use the Slurm scrontab tool. It provides a reliable alternative to the traditionally used cron utility to automate periodic tasks on Levante.

To define the recurring jobs, Slurm uses a configuration file, so-called crontab, which is handled using the scrontab command. The scrontab command with the -e option invokes an editing session, so you can create or modify a crontab:

Read more ...

How to keep disk usage under control

11 January 2024

Below, we provide some tips how to monitor your disk usage on Levante and find large files and directories in case you are nearing or exceeding your disk storage quota.

You can check your disk quota on Levante with the lfsquota.sh wrapper script stored in the /sw/bin directory. To display disk usage and limits for the personal directories assigned to your user account (i.e. /home and /scratch), use:

Read more ...

Set up Remote Developement/Debugging

24 February 2023

Some workarounds related to remote development/debugging on Levante. These workarounds are not tested on all available remote tools but often specific to a certain IDE. We encourage you to test on your favorite IDE and report issues to support@dkrz.de.

Tested on VSCode

../../_images/vscode_token.png

Read more ...

How can I log into levante, change my password and login shell?

24 January 2023

To access Levante login nodes via ssh with X11 forwarding, use:

For macOS we recommend to use -Y instead of the -X option.

luv_user_account_menu_icon

Read more ...

I want to add my own packages to Python or R but they won’t compile

24 January 2023

Python and R, among other scripting languages, allow users to create customized environments including their own set of packages.

For Python you use virtualenv or conda, R can also add locally installed packages.

Read more ...

Which Compiler and MPI library should I use?

24 January 2023

For model simulations in production mode we recommend to use Intel compilers and Open MPI:

Do not forget to consult the recommended environment settings and adjust your run script accordingly. Without these settings applications can run unexpectedly slowly.

Read more ...

How do I log into the same login node I used before

24 January 2023

levante.dkrz.de maps to a whole group of nodes to distribute the load:

All login nodes share the same file system so most of the time you do not have to care which node you are on. However, there are reasons why you may want to connect to a specific node (for example, to reattach to a running tmux session). You first have to find out on which node you are. This may be indicated in your prompt or you can also use hostname for this purpose:

Read more ...

How big are my files?

21 October 2022

Our server luv shows you how much storage space your project is using and also how much each individual project member is contributing. For technical reasons, we can only show the apparent size (see below) of a user’s files. For the entire project, however, we show the actually occupied disk space. We are not happy about this inconsistency but for now, we have to live with it.

Here we try to explain the difference between the two ways to measure the size of files.

Read more ...

Can I get my compute time back?

22 August 2022

Who wouldn’t want to use all of their node hours on Levante in the most productive way? After all, this is how we present our project in the proposal. Everything will go according to plan. What could go wrong? In fact, a lot, so can I get the time back in that case? Here is an incomplete list of things we often hear.

I found a bug in my model and now I have to run all experiments again.

Read more ...

Using tmux for persistent ssh connections

02 August 2022

Tmux allows you to keep a shell session active when closing your ssh connection. So, when you close your laptop to change a room, you don’t lose the current state of your command line tools. It has many more useful features. However, you need to remember the login node you used to start tmux, and X11 applications (e.g. ncview) will not survive the disconnect. emacs will survive in command line mode (module load emacs ; emacs -nw).

The following examples will assume that you have prescribed your user name and allowed (trusted) X11 forwarding for dkrz.de machines in your ~/.ssh/config

Read more ...

How to prevent interuptions of ssh connections to DKRZ systems?

23 October 2019

If your ssh connections to login nodes are interrupted after short periods without any keyboard activitiy and you get an error message containing the string “broken pipe”, try to set the ServerAliveInterval parameter appropriately.

On Linux and macOS systems with the OpenSSH implementation there is essentially two methods to adjust this client setting. First, it can be set as a command-line option to ssh:

Read more ...

How do I share files with members of another project?

17 September 2019

You can use ACLs to achieve this. As a member of project group ax0001, you would have to create a directory in your project’s work for example

It could be any other place on Lustre file systems where you have write access. Then you grant project bx0002 permissions to this directory

Read more ...

Why do I receive .Xauthority file error messages?

08 July 2019

When you open a new terminal session with X forwarding turned on (ssh -X ...), the .Xauthority file in your home directory gets updated by the xauth program. This file is used to keep X authentication keys in order to prevent unauthorized connections to your local display.

Sometimes, the .Xauthority file cannot be updated due to the temporary issues with the Lustre file system, where your home directory is located, and you might experience an error message like:

Read more ...

Python Matplotlib fails with “QXcbConnection: Could not connect to display”

18 September 2018

Matplotlib is useful for interactive 2D plotting and also for batch production of plots inside a job. The default behavior is to do interactive plotting which requires the package to open a window on your display. For this purpose you have to log into mistral with X11 forwarding enabled.

If you run matplotlib in a jobscript where you just want to create files of your plots, you have to tell matplotlib to use a non-interactive backend. See matplotlib’s documentation how to do that and which backends are available. Here is how to select the Agg backend (raster graphics png) inside your script. Add to the top of your imports

Read more ...

How can I avoid core files if my program crashes

30 May 2018

Core files can be very helpful when debugging a problem but they also take a long time to get written for large parallel programs. The following command will limit the core size to zero, i.e. no core files will be written:

The effect of the above command call can be checked with:

Read more ...

Why does my job wait so long before being executed? or: Why is my job being overtaken by other jobs in the queue?

19 June 2017

There are several possible reasons for to be queued for a long time and/or to be overtaken …

… later submitted jobs with a higher priority (usually these have used less of their share then your job).

Read more ...

When will my SLURM job start?

19 June 2017

The SLURM squeue command with the options - -start and -j provides an estimate for the job start time, for example:

Read more ...

How to use environment modules in batch scripts

19 June 2017

module is a shell function which modifies shell environment after loading or unloading a module file. If you are using different shells as login shell and for job batch scripts (e.g. tcsh as login shell and your job scripts start with #!/bin/bash), you need to add an appropriate source command in your script before any invocation of the module function (otherwise the module: command not found. error message will result and the shell environment won’t be modified as intended):

Read more ...

How to use SSHFS to mount remote lustre filesystem over SSH

19 June 2017

In order to interact with directories and files located on the lustre filesystem, users can mount the remote filesystem via SSHFS (SSH Filesystem) over a normal ssh connection.

SSHFS is Linux based software that needs to be installed on your local computer. On Ubuntu and Debian based systems it can be installed through apt-get. On Mac OSX you can install SHFS - you will need to download FUSE and SSHFS from the osxfuse site. On Windows you will need to grab the latest win-sshfs package from the google code repository or use an alternative approach like WinSCP.

Read more ...

How to set the default Slurm project account

19 June 2017

Specification of the project account (via option -A or --account) is necessary to submit a job or make a job allocation, otherwise your request will be rejected. To set the default project account you can use the following SLURM input environment variables

SLURM_ACCOUNT - interpreted by srun command

Read more ...

How to improve interactive performance of MATLAB

19 June 2017

When using ssh X11-Forwarding (options -X or -Y), matlab can be slow to start and also exhibit slow response to interactive use. This is because X11 sends many small packets over the network, often awaiting a response before continuing. This interacts unfavorably with medium or even higher latency connections, i.e. WiFi.

A way to eliminate this issue is to use dedicated resources and to start a remote desktop session that does not suffer from network latencies in the same manner. This approach requires a VNC client vncviewer installed on your local machine. A widely used VNC client is TurboVNC.

Read more ...

How to display the batch script for a running job

19 June 2017

Once your batch job started execution (i.e. is in RUNNING state) your job script is copied to the slurm admin nodes and kept until the job finalizes - this prevents problems that might occur if the job script gets modified while the job is running. As a side-effect you can delete the job script without interfering with the execution of the job.

If you accidentally removed or modified the job script of a running job, you can use the following command to query for the script that is actually used for executing the job:

Read more ...

How to Write a shell alias or function for quick login to a node managed by SLURM

19 June 2017

For tasks better run in a dedicated but interactive fashion, it might be advantageous to save the repeating pattern of reserving resources and starting a new associated shell in an alias or function, as explained below.

If you use bash as default shell you can place the following alias definition in your ~/.bashrc file and source this file in the ~/.bash_profile or in the ~/.profile file:

Read more ...

How can I see on which nodes my job was running?

19 June 2017

Yon can use the SLURM sacct command with the following options:

Read more ...

How can I run a short MPI job using up to 4 nodes?

19 June 2017

You can use SLURM Quality of Service (QOS) express by inserting the following line into your job script:

or using the option –qos with the sbatch command:

Read more ...

How can I get a stack trace if my program crashes?

19 June 2017

The classical approach to find the location where your program crashed is to run it in a debugger or inspect a core file with the debugger. A quick way to get the stack trace without the need for a debugger is to compile your program with the following options:

In case of segment violation during execution of the program, detailed information on the location of the problem (call stack trace with routine names and line numbers) will be provided:

Read more ...

How can I choose which account to use, if I am subscribed to more than one project?

19 June 2017

Just insert the following line into your job script:

There is no default project account.

Read more ...

How can I check my disk space usage?

19 June 2017

Your individual disk space usage in HOME and SCRATCH as well as the quota state of your projects can be checked using the commands listed in File Systems. An additional overview is provided DKRZ online portal. The numbers there are updated daily.

(updated 2023-07-26)

Read more ...

How can I access my Lustre data from outside DKRZ/ZMAW?

19 June 2017

For data transfer you can use either sftp:

or rsync command:

Read more ...

Can I run cron jobs on HPC login nodes?

19 June 2017

Update 2024-10-01: This procedure has been superseded by the Slurm scrontab feature, now available on Levante.

For system administration reasons users are not allowed to shedule and execute periodic jobs on DKRZ HPC systems using the cron utility. Our recommendation is to use the functionality provided by the workload manager Slurm for this purpose. With the option --begin of the sbatch command you can postpone the execution of your jobs until the specified time. For example, to run a job every day after 12pm you can use the following job script re-submitting itself at the beginning of the execution:

Read more ...

Is a FTP client available on Levante?

18 June 2017

LFTP is installed on Levante for download and upload of files from/to an external server via File Transfer Protocol (FTP):

The user name for authentication can be provided via option -u or --user.

Read more ...