Intel VTune Profiler

Intel VTune Profiler is a performance analysis tool for single core and threading performance, i.e. for single node performance. For MPI analysis with multiple nodes, VTune produces a separate analysis for each node. More comprehensive MPI performance analysis is possible e.g. with Intel Traceanalyzer or Scalasca.

Available

Puhti:

License

Usage is possible for both academic and commercial purposes.

Usage

Intel VTune Profiler is provided via the intel-vtune module. One sets up the environment by loading the module:

module load intel-vtune

If you want to get source code level information, compile your code with optimizations enabled and add also the debugging information option -g.

Basic hotspot analysis is the first analysis type you should try. Here is a sample batch job script that can be used to profile parallel applications (please modify the script according to your application and project!):

#!/bin/bash
#SBATCH --job-name=VTune_example
#SBATCH --account=<project_name>
#SBATCH --partition=small
#SBATCH --time=00:15:00
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=4000

# set the number of threads based on --cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

module load intel-vtune

srun amplxe-cl -r results_dir_name -collect hotspots -- ./my_application

For a Python application replace the last line by:

srun amplxe-cl -r results_dir_name -collect hotspots -- python3 python_script

In the case of MPI and hybrid jobs the profiler will generate a separate directory for each node and inside a separate subdirectory for each task. In order to reduce the amount of data collected, one can onsider collecting data only for a subset of the tasks https://software.intel.com/content/www/us/en/develop/articles/using-intel-advisor-and-vtune-amplifier-with-mpi.html.

Generating Reports

The command line tool can be used to create reports from collected results using the -report option:

amplxe-cl -report hotspots -r results_dir_name

The results are printed to stdout or to a file (using -report-output output option).

By default the report time is grouped by functions, however it is possible to have it grouped by source lines (-group-by source-line) or by module (-group-by module). It also possible to analyse the differences between two different runs or two different MPI tasks by generating a report showing the differences between two result directories:

amplxe-cl -report hotspots -r results_dir_name_00 -r results_dir_name_01

Finally, it is possible to display the CPU time for call stacks (-report callstacks) or display a call tree and provide the CPU time for each function (-report top-down).

For some configurations the data collection may fail with the error: Stack size provided to sigaltstack is too small. Please increase the stack size to 64K minimum. In this case please run the profiling job again, but with the environment variable AMPLXE_RUNTOOL_OPTIONS set to --no-altstack. For more details about the issue, please see the official Intel documentation.

Analysing the Results Using GUI

Results can be viewed using the amplxe-gui application. Unfortunately it does not work on Puhti, so it is recommended for a user to install and use the GUI locally.

You can inspect the results of a profile run by giving the name of the results directory as an argument to amplxe-gui. For example, the results of the previous example can be viewed with the command:

amplxe-gui results_dir_name

Please see Intel’s documentation for more information on installing and using the GUI: https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top.html