Adapting job scripts for mistral phase2¶
Since phase1 and phase2 nodes of mistral are equipped with different Intel CPUs, you will have to slightly adapt your existing job scripts in oder to use both partitions. The following table gives an overview on the differences and which partitions are affected.
cores per node
compute, prepost, shared, gpu, miklip
Xeon E5-2680 v3 processor (Haswell - HSW)
Xeon E5-2695 v4 processor (Broadwell - BDW)
As the table indicates just two issues arise if batch scripts should be useable for both phases:
different number of cores per node
different processor frequency
Setting the right CPU frequency for each partition¶
SLURM allows to request that the job step initiated by the
shall be run at the requested frequency (if possible) on the CPUs selected
for that step on the compute node(s). This can be done via
environmental variable SLURM_CPU_FREQ_REQ
If none of these options is set, DKRZ slurm automatically chooses the apropriate frequency for the underlying processor. We therefore recommend to not set the frequency explicitly.
In case that a wrong frequency is defined via environment variavle (e.g. setting SLURM_CPU_FREQ_REQ=2500000 for the BDW nodes in compute2 partition) a warning message on stdout is given like
[DKRZ-slurm WARNING] CPU-frequency chosen (2500000) not supported on partition compute2 - frequency will be set to **nominal** instead!
If you are using a wrong frequency for the srun option
a warning message on stdout is given, but this time the automatic frequency
adaption falls back to the minimal frequency:
[DKRZ-slurm WARNING] CPU-frequency chosen (2501000) not supported on partition compute2 - frequency will fall back to **minimum** instead!
Setting the right number of cores¶
When allocating nodes using the
salloc command one has
to specify the targeted partition and therefore the type of CPU directly.
Nevertheless, jobscripts that were originally written to run on the 24 core
Intel Haswell nodes (ie. in the compute partition) will in general also run
in the compute2 partition but do not make use of the full node!
The most critical sbatch/srun option in this context is
Setting e.g. a value of 24 is appropriate for Haswell nodes but uses
only 2/3 of the cores on Broadwell nodes. Hence, you should pay major attention
to this when adapting your batch scripts for the compute2 partition.
Writing more flexible batch scripts, that are able to run on both kinds of CPU, requires avoidance of sbatch options that prescribe the number of tasks per entity, ie. you should NOT use a prescribed number of nodes or one of the following options:
Instead, please define the total number of tasks that your MPI parallel program
will be using for srun by specifying the option
in combination with the number of logical cpus needed per task, e.g.
-c 2) for a pure MPI parallel program not using HyperThreading.
The following batch script will run an MPI job without HyperThreading
either in the compute or compute2 partition - only depending on the
#SBATCH --partition choice.
#!/bin/bash #SBATCH --job-name=my_job #SBATCH --partition=compute # or compute2 for BDW nodes #SBATCH --ntasks=72 #SBATCH --cpus-per-task=2 #SBATCH --time=00:30:00 #SBATCH --mail-type=FAIL #SBATCH --account=xz0123 #SBATCH --output=my_job.o%j #SBATCH --error=my_job.e%j srun -l --propagate=STACK --cpu_bind=cores ./myprog
When submitted to the compute partition, the job will run on 3 nodes with 24 tasks per node. While in the compute2 partition the same job only takes 2 nodes with 36 tasks per node.
Writing job scripts eligible to run on several partitions¶
--partition option also allows for a comma separate list of names.
In this case the job will run completely on the partition offering earliest
initiation with no regard given to the partition name ordering - i.e. nodes
will not be mixed between the partitions! Be aware that the total number of
tasks should be a multiple of both, 24 and 36, in order to fully populate
all nodes. Otherwise, some nodes might be underpopulated. The example above
might therefore be modified to use
#!/bin/bash #SBATCH --job-name=my_job #SBATCH --partition=compute, compute2 #SBATCH --ntasks=72 #SBATCH --cpus-per-task=2 #SBATCH --mem=0 ...
which in general will decrease the waiting time of the job in the submit queue since more nodes are suitable to schedule the job on.
Attention: there are a few facts one needs to be aware of when using job script that are eligible for more than one partition.
compute2 partition (Broadwell nodes) shows a slightly lesser performance due to the lower CPU-frequency compared to compute partition (Haswell nodes). Runtime limits for the job should therefore be modified.
The number of
--cpus-per-taskneeds to be specified explicitely to enable or disable the usage of Intel HyperThreading. By default
--cpu-per-task=1will be choosen to allow for HyperThreading. Since your code might not benefit of this setting, you should switch to
--cpu-per-task=2instead. For details on Intel® Hyper-Threading please refer to this site.
The memory resources should be specified with care! The compute and compute2 partitions are equipped with different nodes (24 vs. 36 cores and memory ranging from 64 to 256 GB) which results in very different values for the
--mem-per-cpuoption, which is computed automatically by SLURM if not specified explicitely. This calculation might be wrong in case that two partitions are specified! Therefore, one should always use the option
--mem=0such that the job has access to the full memory on each node, independend of the partition that is finally used for the job.