Accounting and Priorities¶
Concept of job priority¶
The individual job priority is computed as a weighted sum of three different factors - see below for details:
the time that the job is waiting in the queue
the share of the project’s compute time that has already been used
a special priority granted as “quality of service” to specific projects or kind of usage
Thus, a job will get an especially high priority if it
has already been long time in the queue (age_factor)
runs under an account that has not yet used its share of compute time (FairShare_factor)
is associated with a high priority for other reasons (e.g. express QOS for small tests)
SLURM job priority calculation¶
On mistral we are using the Multi-factor Job Priority plugin of SLURM in order to influence job priority. The jobs priority at any given time is a weighted sum of the following factors:
age_factor ∈ [0,1] with 1 when age is more than PriorityMaxAge (30 day, 0 hours)
FairShare_factor ∈ [0,1] as explained below
QOS_factor ∈ [0,1] normalized according to ‘sacctmgr show qos’ (e.g. normal = 0, express = 0.1, bench = 1)
with the weights:
The final priority is then calculated as
Job_priority = PriorityWeightAge * age_factor + PriorityWeightFairshare * FairShare_factor + PriorityWeightQOS * QOS_factor
and can be checked with the sprio command:
PRIORITY = AGE + FAIRSHARE + QOS ∈ [0,3000] AGE = Weighted age priority ∈ [0,1000] FAIRSHARE = Weighted fair-share priority ∈ [0,1000] QOS = Weighted quality of service priority ∈ [0,1000]
While squeue has format options (%p and %Q) that display a job’s composite priority, sprio can be used to display a breakdown of the priority components for each job, e.g.
$ sprio JOBID PRIORITY AGE FAIRSHARE QOS 1421556 1175 100 975 100 2015831 274 20 204 50 2017372 258 0 258 0 ...
SLURM accounting storage¶
For each SLURM job the accounting database stores the computing cycles delivered by a machine in the units of allocated_cpus*wall_clock_seconds.
Hence, one Haswell node used for one hour in the compute partition is accounted internally as
1 NodeHour HSW = 3600*48 CPUsec = 172800 CPUsec
while a Broadwell node in the compute2 partition used for one hour in exclusive state is accounted internally as
1 NodeHour BDW = 3600*72 CPUsec = 259200 CPUsec
HLRE-projects are accounted by means of nodehours (as shown at https://luv.dkrz.de/projects/). We expect that since the CPU frequency of Broadwell nodes is less than for Haswell nodes, while the core count is higher, both effects should compensate resulting in a balanced nodehour metric. This might be modified in the future when different accounting weights are introduced for different node types.