SLURM job priority calculation
On Levante we are using the Multi-factor Job Priority plugin of SLURM
in order to influence job priority. The jobs priority at any given
time is a weighted sum of the following factors:
age_factor ∈ [0,1] with 1 when age is more than PriorityMaxAge (30 day, 0 hours)
FairShare_factor ∈ [0,1] as explained below
QOS_factor ∈ [0,1] normalized according to ‘sacctmgr show qos’ (e.g. normal = 0, express = 0.1, bench = 1)
with the weights:
The final priority is then calculated as
Job_priority = PriorityWeightAge * age_factor +
PriorityWeightFairshare * FairShare_factor +
PriorityWeightQOS * QOS_factor
and can be checked with the sprio command:
PRIORITY = AGE + FAIRSHARE + QOS ∈ [0,3000]
AGE = Weighted age priority ∈ [0,1000]
FAIRSHARE = Weighted fair-share priority ∈ [0,1000]
QOS = Weighted quality of service priority ∈ [0,1000]
While squeue has format options (%p and %Q) that display a job’s
composite priority, sprio can be used to display a breakdown of
the priority components for each job, e.g.
$ sprio
JOBID PRIORITY AGE FAIRSHARE QOS
1421556 1175 100 975 100
2015831 274 20 204 50
2017372 258 0 258 0
...
FairShare Factor
While all other factors to calculate the accounts priority are fairly
easy to understand, the FairShare needs to be explained in detail.
The FairShare factor does not involve a fixed allotment (like the
granted NodeHours per project), whereby a user’s access to a machine
is cut off once that allotment is reached.
Instead, the FairShare factor serves to prioritize queued jobs such
that those jobs charging accounts that are under-serviced are
scheduled first, while jobs charging accounts that are over-serviced
are scheduled when the machine would otherwise go idle.
SLURM FairShare factor is therefore mainly based on the ratio of the
amount of computing resources the user’s jobs has already consumed to
the shares of a computing resource that a user has been granted. The
higher the value, the less shares were used compared to what was
granted, and the higher is the placement in the queue.
DKRZ uses a two level share hierarchy. On the top level (parent
level), project accounts are grouped according to their shareholder
grant - e.g. project ba1234 is a community project, while mh1234 is a
mpg project. On the bottom/account level, each project is represented
by the granted NodeHours for the current period. This forms the
Normalized Shares per project:
S_project = (S_parent / S_parent-siblings) * (S_account / S_account-siblings)
As an example take the following settings (given by sshare)
Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
root 1.000000 256867246717 0.000000 1.000000
root root 1 0.000969
community 205 0.198643 1000000 0.000004
ba1234 238938 0.032619 1000000 0.000004 0.999915
rest_community 1216131 0.166024
mpg 205 0.198643 71017642553 0.276476
mh1234 238938 0.032619 16528791499 0.099181 0.121532
mh4321 238938 0.032619 1000000 0.045404 0.381050
rest_mpg ... 977193 0.133405
rest 621 0.601744
Project mh1234 as mpg child has the following shares:
sum of Raw Shares (parent level): S_parent-siblings = 1032
Raw Shares of mpg: S_parent = 205
Norm Shares of mpg: S_1 = 205/1032 = 0.198643 (i.e. ~20 % of the compute resources are for mpg)
sum of Raw Shares (account level): S_account-siblings = 1455069
Raw Shares of mh1234: S_account = 238938
Norm Shares of mh1234: S_2 = 238938/1455069 = 0.164211 (i.e. ~16.4 % of mpg share is for mh1234)
=> S_mh1234 = S_1 * S_2 = 0.032619 (i.e. ~3.3 % of the compute resources are for mh1234)
User can query the actual shares for all projects they belong to via
This shows for each SLURM account the value of S_project in column
‘Norm Shares’. Furthermore, column ‘Raw Usage’ shows the already used
CPUseconds for this account (U_account).
SLURM implements another factor of fairness such that not only the Raw
Usage of an account is used to calculate the FairShare, but also the
Raw Usage of its parent (i.e. the sum of all siblings). This allows
that accounts of different parents with the same charged usage and the
same shares, get different FairShare due to their sibling
accounts. This so called ‘Effective Usage’ is ultimately used to
calculate the Fair-share factor.
U_project-eff = U_account/U_total + ( (U_parent - U_account)/U_total * S_account / S_account-siblings)
For the example above this reads
U_mh1234-eff = U_mh1234/U_total + ( (U_mpg - U_mh1234)/U_total * S_mh1234 / S_mh1234-siblings)
= 16528791499/256867246717 + ( (71017642553 - 16528791499)/256867246717 * 238938 / 1455069 )
= 0.064348 + (0.212128 * 0.164211) = 0.099181
The sibling project mh4321 with much less Raw Usage but same shares gets:
U_mh4321-eff = U_mh4321/U_total + ( (U_mpg - U_mh4321)/U_total * S_mh4321 / S_mh4321-siblings)
= 1000000/256867246717 + ( (71017642553 - 1000000)/256867246717 * 238938 / 1455069 )
= 0.000004 + (0.276472 * 0.164211) = 0.045404
The project ba1234 with same prerequisites as the project mh4321 but different parent gets:
U_ba1234-eff = U_ba1234/U_total + ( (U_bmbf - U_ba1234)/U_total * S_ba1234 / S_ba1234-siblings)
= 1000000/256867246717 + ( (1000000 - 1000000)/256867246717 * 238938 / 1455069 )
= 0.000004 + (0.0 * 0.164211) = 0.000004
Finally, the Fair-share factor of a project is
FS_project = 2**(-U_project-eff/S_project)
Again, for the example above this reads
FS_mh1234 = 2**(-U_mh1234-eff/S_mh1234) = 2**(-0.099181/0.032619) = 0.121532
The sibling project mh4321 gets a better FairShare since it used less resources so far:
FS_mh4321 = 2**(-U_mh4321-eff/S_mh4321) = 2**(-0.045404/0.032619) = 0.381050
while the project ba1234 is prioritized over mh4321 (although both already used the same amount of resources):
FS_ba1234 = 2**(-U_ba1234-eff/S_ba1234) = 2**(-0.000004/0.032619) = 0.999915
Some general examples:
project X does not use any CPUtime so far (Raw Usage = 0) and also
no sibling used any CPUtime: U_project-eff=0 => FS_project = 1
compared to all others (sum of raw usage U_total), project X used
exactly its granted share: U_project-eff = S_project => FS_project
= 0.5
Attention: “half/all/double CPUtime used so far” cases are not
directly mapped by SLURM FairShare! Only the ratio U_project-eff /
S_project is of interest for FairShare!
In general: A FairShare factor of above 0.5 indicates that the project
has consumed less than its allocated share while a FairShare factor
below 0.5 indicates that the project has consumed more than its
allocated share of the computing resources.