Intel VTune Profiler
Intel VTune Profiler is a performance analysis tool for single core and threading performance, i.e. for single node performance. For MPI analysis with multiple nodes, VTune produces a separate analysis for each node. More comprehensive MPI performance analysis is possible e.g. with Intel Traceanalyzer or Scalasca.
Available
Puhti: 2022.3
License
Usage is possible for both academic and commercial purposes.
Usage
Intel VTune Profiler is provided via the intel-oneapi-vtune
module. One sets up the environment by loading the module:
module load intel-oneapi-vtune
If you want to get source code level information, compile your code with the debugging information option -g
.
Results collection
Performance analysis can be started either from VTune GUI, or with the VTune command line tool. In HPC systems
one uses normally the command line tool vtune
within a bash job. The first analysis that we suggest to try is
"performance snapshot". Here is a sample batch job script that can be used to collect it
(please modify the script according to your application and project!):
#!/bin/bash
#SBATCH --job-name=VTune_example
#SBATCH --account=<project_name>
#SBATCH --partition=small
#SBATCH --time=00:15:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=4000
# set the number of threads based on --cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load intel-oneapi-vtune
srun vtune -collect performance-snapshot -- ./my_application
By default, VTune writes the analysis results in a directory named r000ps
within the current working directory,
where the number is incremented automatically when multiple collections are run. The last two letters refer to the analysis
type. One can use also a custom results directory with the -r results_dir_name
option.
When analysing MPI applications (and running with multiple MPI tasks), one should add the -trace-mpi
option:
#SBATCH ...
srun vtune -collect performance-snapshot --trace-mpi -r results_dir_name -- ./my_application
In the case of MPI jobs the profiler will generate a separate directory for each node. In order to reduce the amount of data collected, one can onsider collecting data only for a subset of the tasks by launching VTune inside a wrapper script:
#SBATCH ...
export VTUNE_CL="vtune -collect performance-snapshot -trace-mpi -result-dir results_dir_name --"
cat << EOF > vtune_wrapper
#!/bin/bash
# Launch VTune only for one MPI rank per node
if [ $SLURM_LOCALID -eq 0 ]
then
exec $VTUNE_CL \$*
else
exec \$*
fi
EOF
chmod +x ./vtune_wrapper
srun ./vtune_wrapper ./my_application
rm -rf ./vtune_wrapper
Analysing the results on command line
The command line tool can be used to create reports from collected results
using the -report
option:
vtune -report summary -r results_dir_name
The results are printed to standard output or to a file using -report-output output_filename
option.
VTune supports large number of different reports, e.g. "hotspots", "hardware events", and one can also compare differences between two reports:
vtune -report hotspots -r results_dir_name_00 -r results_dir_name_01
By default the report time is grouped by functions, however it is also possible to
have it grouped by source lines (-group-by source-line
) or by module
(-group-by module
).
Finally, it is possible to display the CPU time for call stacks
(-report callstacks
) or display a call tree and provide the CPU time for
each function (-report top-down
).
Analysing the results using GUI
Results can be viewed using the vtune-gui
application, which we recommend to launch via the Desktop application in the Puhti Web interface. You may also copy the full results directory
to your workstation for local analysis.
A particular result set can be opened by giving the name of the results directory as an argument to vtune-gui
:
vtune-gui results_dir_name
Known issues
Sometimes vtune-gui
fails to start with an error "Failed to launch VTune Amplifier GUI...". If that happens, one should kill
all VTune processes that are left behind and try again:
killall -9 -r vtune
vtune-gui