Getrusage

Getrusage is a tool developed at DKRZ for printing of detailed resource usage statistics available via getrusage(2). To use getrusage on Mistral the following module file needs to be loaded:

$ module load getrusage

Thereafter, runtime diagnostics about resource usage (e.g. wall-clock time, max rss (i.e. peak memory usage) etc.) can be collected and printed for a given command, for example:

$ getrusage ls -l

With the option -o the measures can be written into the specified output file, for example:

$ getrusage -o ls_rusage.txt ls -l
$ less ls_rusage.txt

For all details of the getrusage command please refer to the manual pages or use the --help option:

$ man getrusage
$ getrusage --help

In following we provide a short guidance on how to use getrusage to get detailed resource consumption of a parallel program. A script getrusage_aggregate is provided to compute minimum, maximum, sum and mean for each metric.

  • Change the run script currently invoking the parallel program (named model in this example) like this:

    srun [srun_options...] model [model_args...]
    

    to

    mkdir -p rusage
    srun [srun_options...] bash -c 'exec getrusage -o rusage/model.${SLURM_PROCID} model [model_args...]'
    

    If model_args contains strings special to the executing shell, some quoting might be necessary. In case you are currently using a command file to specify executable and arguments for each task (e.g. MPMD setup), you can replace a line like this:

    ./model [model_args...]
    

    in your command file with this:

    bash -c 'exec getrusage -o "rusage/model.${SLURM_PROCID}" ./model [model_args...]'
    

    After running the program, the directory rusage contains files named model.<MPI_rank> with full set of measures captured for each process in the parallel job.

  • To reduce the metrics to aggregates add the following line to your job-script:

    getrusage_aggregate rusage/model.*
    

    This will, by default, only print the minimal, maximal, mean and total memory use of the job.

  • To e.g. add the time used by the processes to the report, try the following:

    getrusage_aggregate --aggregate-key='wall-clock time' rusage/model.*
    

    Other supported keys are listed on the man pages of getrusage_aggregate:

    $ man getrusage_aggregate
    $ getrusage_aggregate --help
    
  • In case one doesn’t care about individual processes resource usage, one might remove the logs after generating the report:

    rm -rf rusage