Accounting and Priorities#

Concept of job priority#

The individual job priority is computed as a weighted sum of three different factors - see below for details:

the time that the job is waiting in the queue
the share of the project’s compute time that has already been used
a special priority granted as “quality of service” to specific projects or kind of usage

Thus, a job will get an especially high priority if it

has already been in the queue for a long time (age_factor)
runs under an account that has not yet used its share of compute time (FairShare_factor)
is associated with a high priority for other reasons (e.g. a QOS could alter its priority)

Slurm job priority calculation#

On Levante we are using the Multi-factor Job Priority plugin of Slurm in order to influence job priority. The jobs priority at any given time is a weighted sum of the following factors:

age_factor ∈ [0,1] with 1 when age is more than PriorityMaxAge (30 day, 0 hours)
FairShare_factor ∈ [0,1] as explained below
QOS_factor ∈ [0,1] normalized according to ‘sacctmgr show qos’ (e.g. normal = 0, express = 0.1, bench = 1)

with the weights:

PriorityWeightFairshare=1000
PriorityWeightQOS=1000
PriorityWeightAge=1000

The final priority is then calculated as

Job_priority = PriorityWeightAge * age_factor +
               PriorityWeightFairshare * FairShare_factor +
               PriorityWeightQOS * QOS_factor

and can be checked with the sprio command:

PRIORITY = AGE + FAIRSHARE + QOS ∈ [0,3000]
AGE = Weighted age priority ∈ [0,1000]
FAIRSHARE = Weighted fair-share priority ∈ [0,1000]
QOS = Weighted quality of service priority ∈ [0,1000]

While squeue has format options (%p and %Q) that display a job’s composite priority, sprio can be used to display a breakdown of the priority components for each job, e.g.

$ sprio
          JOBID   PRIORITY        AGE  FAIRSHARE  QOS
        1421556       1175        100        975  100
        2015831        274         20        204   50
        2017372        258          0        258    0
        ...

Slurm accounting storage#

For each Slurm job the accounting database stores the computing cycles delivered by a machine in the units of allocated_cpus * wall_clock_seconds.

Hence, one node with 256 logical CPUs used for one hour in the compute partition is accounted internally as

1 NodeHour = 3600 * 256 CPUsec = 172800 CPUsec

HLRE-projects are accounted by means of nodehours (as shown at https://luv.dkrz.de/projects/).

FairShare Factor#

While all other factors to calculate the accounts priority are fairly easy to understand, the FairShare needs to be explained in detail.

The FairShare factor does not involve a fixed allotment (like the granted NodeHours per project), whereby a user’s access to a machine is cut off once that allotment is reached.

Instead, the FairShare factor serves to prioritize queued jobs such that those jobs charging accounts that are under-serviced are scheduled first, while jobs charging accounts that are over-serviced are scheduled when the machine would otherwise go idle.

Slurm FairShare factor is therefore mainly based on the ratio of the amount of computing resources the user’s jobs has already consumed to the shares of a computing resource that a user has been granted. The higher the value, the less shares were used compared to what was granted, and the higher is the placement in the queue.

DKRZ uses a two level share hierarchy. On the top level (parent level), project accounts are grouped according to their shareholder grant - e.g. project ba1234 is a community project, while mh1234 is a mpg project. On the bottom/account level, each project is represented by the granted NodeHours for the current period. This forms the Normalized Shares per project:

S_project = (S_parent / S_parent-siblings) * (S_account / S_account-siblings)

As an example take the following settings (given by sshare)

             Account       User Raw Shares Norm Shares    Raw Usage Effectv Usage  FairShare
-------------------- ---------- ---------- -----------  ----------- ------------- ----------
root                                          1.000000 256867246717      0.000000   1.000000
 root                      root          1    0.000969
 community                             205    0.198643      1000000      0.000004
  ba1234                            238938    0.032619      1000000      0.000004   0.999915
  rest_community                   1216131    0.166024
 mpg                                   205    0.198643  71017642553                 0.276476
  mh1234                            238938    0.032619  16528791499      0.099181   0.121532
  mh4321                            238938    0.032619      1000000      0.045404   0.381050
  rest_mpg ...                      977193    0.133405
 rest                                  621    0.601744

Project mh1234 as mpg child has the following shares:

sum of Raw Shares (parent level): S_parent-siblings = 1032
Raw Shares of mpg: S_parent = 205
Norm Shares of mpg: S_1 = 205/1032 = 0.198643 (i.e. ~20 % of the compute resources are for mpg)

sum of Raw Shares (account level): S_account-siblings = 1455069
Raw Shares of mh1234: S_account = 238938
Norm Shares of mh1234: S_2 = 238938/1455069 = 0.164211 (i.e. ~16.4 % of mpg share is for mh1234)

=> S_mh1234 = S_1 * S_2 = 0.032619 (i.e. ~3.3 % of the compute resources are for mh1234)

User can query the actual shares for all projects they belong to via

$ sshare

This shows for each Slurm account the value of S_project in column ‘Norm Shares’. Furthermore, column ‘Raw Usage’ shows the already used CPUseconds for this account (U_account).

Slurm implements another factor of fairness such that not only the Raw Usage of an account is used to calculate the FairShare, but also the Raw Usage of its parent (i.e. the sum of all siblings). This allows that accounts of different parents with the same charged usage and the same shares, get different FairShare due to their sibling accounts. This so called ‘Effective Usage’ is ultimately used to calculate the Fair-share factor.

U_project-eff = U_account/U_total + ( (U_parent - U_account)/U_total * S_account / S_account-siblings)

For the example above this reads

U_mh1234-eff = U_mh1234/U_total + ( (U_mpg - U_mh1234)/U_total * S_mh1234 / S_mh1234-siblings)
         = 16528791499/256867246717 + ( (71017642553 - 16528791499)/256867246717 * 238938 / 1455069 )
         = 0.064348 + (0.212128 * 0.164211) = 0.099181

The sibling project mh4321 with much less Raw Usage but same shares gets:

U_mh4321-eff = U_mh4321/U_total + ( (U_mpg - U_mh4321)/U_total * S_mh4321 / S_mh4321-siblings)
         = 1000000/256867246717 + ( (71017642553 - 1000000)/256867246717 * 238938 / 1455069 )
         = 0.000004 + (0.276472 * 0.164211) = 0.045404

The project ba1234 with same prerequisites as the project mh4321 but different parent gets:

U_ba1234-eff = U_ba1234/U_total + ( (U_bmbf - U_ba1234)/U_total * S_ba1234 / S_ba1234-siblings)
         = 1000000/256867246717 + ( (1000000 - 1000000)/256867246717 * 238938 / 1455069 )
         = 0.000004 + (0.0 * 0.164211) = 0.000004

Finally, the Fair-share factor of a project is

FS_project = 2**(-U_project-eff/S_project)

Again, for the example above this reads

FS_mh1234 = 2**(-U_mh1234-eff/S_mh1234) = 2**(-0.099181/0.032619) = 0.121532

The sibling project mh4321 gets a better FairShare since it used less resources so far:

FS_mh4321 = 2**(-U_mh4321-eff/S_mh4321) = 2**(-0.045404/0.032619) = 0.381050

while the project ba1234 is prioritized over mh4321 (although both already used the same amount of resources):

FS_ba1234 = 2**(-U_ba1234-eff/S_ba1234) = 2**(-0.000004/0.032619) = 0.999915

Some general examples:

project X does not use any CPUtime so far (Raw Usage = 0) and also no sibling used any CPUtime: U_project-eff=0 => FS_project = 1
compared to all others (sum of raw usage U_total), project X used exactly its granted share: U_project-eff = S_project => FS_project = 0.5

Attention: “half/all/double CPUtime used so far” cases are not directly mapped by Slurm FairShare! Only the ratio U_project-eff / S_project is of interest for FairShare!

In general: A FairShare factor of above 0.5 indicates that the project has consumed less than its allocated share while a FairShare factor below 0.5 indicates that the project has consumed more than its allocated share of the computing resources.