Profiling with ARM MAP¶
ARM MAP is a source-level profiler and can show how much time was spent
on each line of code. To see the source code in MAP compile your program
with the debug flag; for the most compilers this is -g
. Don’t just use
a debug build; you should always keep optimization flags turned on when profiling.
Getting started¶
Load the environment, e.g.
module load intelmpi intel arm-forge
Prepare the code for profiling, e.g.
mpiifort -g -ffast-math -O3 -xHost -cpp mmult.f90 -o mmult_f90.exe
For statically-linked programs, you may need to compile with extra flags to ensure that the executable still has all the information MAP needs, e.g. adding
--Wl,--eh-frame-hdr
to the compile lineModify job script to prefix the srun command, e.g.
map --profile srun ./mmult_f90.exe
Submit job, e.g.
sbatch mmult.sub
View result
map mmutl_f90_Yp_YYYY-MM-DD-HH-MM.map
Tips and tricks¶
If the overhead produced by MAP is too high (in general not more than 5%) you might change the default sampling behaviour by the following environment variables
ALLINEA_SAMPLER_INTERVAL; the default value is 20ms period and will be increased automatically if your jobs runs for longer times leading to worse profiling results. Hence, restrict the runtime to an absolute required minimum and adjust the sampling interval accordingly, e.g. setting 1ms to get a rate of 1000Hz
ALLINEA_SAMPLER_NUM_SAMPLES; the MAP profiler by default collects 1000 samples per process in order to reduce the amount of memory needed while running the job. For fine grained profiles this might be increased to ensure that the sampling rate will not be decreased automatically for long running jobs
ALLINEA_SAMPLER_NO_TIME_MPI_CALLS; Set this to prevent MAP from timing the time spent in MPI calls - this might be of interest if you are searching for vectorization or memory access issues
Analysing MAP profiles¶
watch out for these typical problems related to given metrics view:
increasing memory usage - memory leak (Memory usage)
work imbalance - big MPI portion (blue coloured blocks in timeline)
low vectorization - not using CORE-AVX2 options (CPU floating-point vector)
high memory access - bad data locality / cache usage (CPU memory access)
serial I/O parts (orange coloured blocks in timeline)
Please refer to the MAP users guide for detailed description of the GUI - download