Profiling with ARM MAP

ARM MAP is a source-level profiler and can show how much time was spent on each line of code. To see the source code in MAP compile your program with the debug flag; for the most compilers this is -g. Don’t just use a debug build; you should always keep optimization flags turned on when profiling.

Getting started

  • Load the environment, e.g.

    module load intelmpi intel arm-forge
  • Prepare the code for profiling, e.g.

    mpiifort -g -ffast-math -O3 -xHost -cpp mmult.f90 -o mmult_f90.exe
  • For statically-linked programs, you may need to compile with extra flags to ensure that the executable still has all the information MAP needs, e.g. adding --Wl,--eh-frame-hdr to the compile line

  • Modify job script to prefix the srun command, e.g.

    map --profile srun ./mmult_f90.exe
  • Submit job, e.g.

    sbatch mmult.sub
  • View result


Tips and tricks

If the overhead produced by MAP is too high (in general not more than 5%) you might change the default sampling behaviour by the following environment variables

  • ALLINEA_SAMPLER_INTERVAL; the default value is 20ms period and will be increased automatically if your jobs runs for longer times leading to worse profiling results. Hence, restrict the runtime to an absolute required minimum and adjust the sampling interval accordingly, e.g. setting 1ms to get a rate of 1000Hz

  • ALLINEA_SAMPLER_NUM_SAMPLES; the MAP profiler by default collects 1000 samples per process in order to reduce the amount of memory needed while running the job. For fine grained profiles this might be increased to ensure that the sampling rate will not be decreased automatically for long running jobs

  • ALLINEA_SAMPLER_NO_TIME_MPI_CALLS; Set this to prevent MAP from timing the time spent in MPI calls - this might be of interest if you are searching for vectorization or memory access issues

Analysing MAP profiles

watch out for these typical problems related to given metrics view:

  • increasing memory usage - memory leak (Memory usage)

  • work imbalance - big MPI portion (blue coloured blocks in timeline)

  • low vectorization - not using CORE-AVX2 options (CPU floating-point vector)

  • high memory access - bad data locality / cache usage (CPU memory access)

  • serial I/O parts (orange coloured blocks in timeline)

Please refer to the MAP users guide for detailed description of the GUI - download