Scalasca
Scalasca is an open-source software tool that supports the performance optimization of parallel programs by measuring and analyzing their runtime behavior via event traces. The analysis identifies potential performance bottlenecks - in particular those concerning communication and synchronization - and offers guidance in exploring their causes. Scalasca supports applications using MPI, OpenMP, POSIX threads, or MPI+OpenMP/Pthreads parallelization.
Available
- Puhti: 2.6
- Mahti: 2.6
License
Usage is possible for both academic and commercial purposes.
Usage
Using Scalasca involves three steps:
- Instrumentation
- Execution measurement collection & analysis
- Analysis report examination
Instrumentation
Scalasca uses the Score-P measurement infrastructure for instrumentation of the target application. Score-P can be used also as a stand-alone tool without Scalasca.
In order to instrument an application, you need to recompile the
application using the Score-P instrumentation command scorep
, which is added
as a prefix to the original compile and link commands:
or setting in a Makefile for C/C++ codes:
or, similarly, for Fortran codes:
Measurement collection and analysis
While applications instrumented by Score-P can be executed directly
with a measurement configuration defined via environment variables,
the scan
convenience command provided by
Scalasca can be used to control certain aspects of the Score-P
measurement environment during the execution of the target
application. To produce a performance measurement using an
instrumented executable, prefix srun
with the scan
command in the batch job script:
By default, a flat profile is collected. Upon completion, measurement
results are stored in the experiment directory, which by default is
composed of the prefix scorep_
, the target application executable
name, the run configuration (e.g., number of MPI ranks and/or OpenMP
threads), and a few other parameters of the measurement
configuration. For example, in the above example
scorep_my_app_40_sum
.
One can also collect event trace data. As tracing can produce huge
amounts of data, it is recommended to first estimate the size of
trace, and possibly filter out some functions from the measurement.
Estimate can be obtained with scorep-score
command:
$ scorep-score -r scorep_my_app_40_sum/profile.cubex
Estimated aggregate size of event trace: 1022kB
Estimated requirements for largest trace buffer (max_buf): 129kB
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 4097kB
(hint: When tracing set SCOREP_TOTAL_MEMORY=4097kB to avoid intermediate flushes
or reduce requirements using USR regions filters.)
flt type max_buf[B] visits time[s] time[%] time/visit[us] region
ALL 131,431 20,196 12.81 100.0 634.28 ALL
MPI 95,054 8,076 8.65 67.5 1071.04 MPI
USR 24,168 8,056 3.38 26.3 418.96 USR
COM 12,168 4,056 0.78 6.1 193.47 COM
SCOREP 41 8 0.00 0.0 48.00 SCOREP
MPI 94,000 8,000 0.17 1.3 20.97 MPI_Sendrecv
USR 12,000 4,000 0.00 0.0 0.25 swap_fields
COM 12,000 4,000 0.00 0.0 0.53 exchange
USR 12,000 4,000 3.33 26.0 832.89 evolve
MPI 826 14 0.01 0.1 823.21 MPI_Recv
...
In order to filter out the measurement of swap_fields
and evolve
,
one can create a file scorep.filter
with the contents:
and check the effect of filtering with -f
option:
$ scorep-score -f scorep.filter -r scorep_my_app_40_sum/profile.cubex
Estimated aggregate size of event trace: 835kB
Estimated requirements for largest trace buffer (max_buf): 105kB
...
One could now proceed with the trace collection by setting
SCOREP_FILTERING_FILE
environment variable and by passing options
-q
and -t
to scan
command:
...
#SBATCH --ntasks=40
module load scalasca
export SCOREP_FILTERING_FILE=scorep.filter
scan -q -t srun ./my_app
After the trace collection is finished, Scalasca will carry out
trace-analysis for identifying various performance bottlenecks.
With tracing enabled, the experiment directory would be
scorep_my_app_40_trace
.
Analysis report examination
The Scalasca analysis report explorer square
cannot currently be run on CSC
supercomputers. However, user may install Scalasca on their local
workstation, and copy the experiment directory there for analysis,
e.g.:
For large traces, one may copy only the post-processed trace analysis
result file scorep_my_app_40_trace/scout.cubex
.
The OTF2 formatted event trace scorep_my_app_40_trace/trace.otf2
can
be analyzed also with Intel Trace Analyzer.