ClusterCockpit#

ClusterCockpit is a framework for job-specific performance and power monitoring on distributed HPC clusters. The ClusterCockpit instance hosted at DKRZ provides job data for the past 180 days and detailed metric data for the past 90 days.

Note

ClusterCockpit is available at https://clustercockpit.dkrz.de for all DKRZ users who have access to Levante and who are allowed to submit batch jobs. Please contact support@dkrz.de if you encounter any problems with the login.

Job List#

The landing page after login displays the number of currently running and total jobs for your account. Following one of these links will open the job list.

ClusterCockpit job list

This overview page shows a summary of your batch jobs on levante, allowing to sort and filter based on jobs properties. The set of metrics displayed per job can be configured according to your preferences. New jobs are imported from SLURM every 5 minutes, completed jobs with less than 2 minutes runtime are skipped. A click on the Job-ID will open the job details page.

Job Details#

The job details view provides the job ressources summary, a job footprint in form of a radar plot and a roofline model. The example radar and roofline plot show an ICON R2B5 experiment using 16 compute nodes.

Job radar plot

The radar plot puts the job performance into perspective with the machine limits. These limits on compute nodes of Levante are configured as follows:

Levante Hardware Limits#

Metric

Description

Limit

cpu_load

CPU load

256 (SMT on)

mem_used

Memory usage

256 GB

mem_bw

Memory bandwidth

350 GB/s (STREAM)

ib_bw

Interconnect bandwidth

100 GBit/s (full-duplex)

lustre_bw

Filesystem bandwidth

100 GBit/s

flops_any

Floating point operations

2500 GF/s (SIMD)

Next the roofline model compares achieved performance to a “roofline” graph of peak data streaming bandwidth and peak FLOP/s capacity [1]. The different ceilings show scalar double precision performance (570 GF/s) and SIMD double precision performance (2500 GF/s) obtained from LIKWID bench. Points are colored from blue to red over the runtime with one minute resolution.

Job roofline plot