Performance Analysis using the Vampir Toolchain
Robert Henschel (HPA-IU)
David Cronk (CS-UTK)
Thomas William (PSW-ZIH)
OverviewMorning Session (Innovation Center, Room 105)• 09:00 – 10:15 Overview: Event Based Program Analysis• 10:15 – 10:45 Break• 10:45 – 11:45 Instrumentation and Runtime Measurement• 11:45 – 13:00 Lunch break
Afternoon Session• 13:00 – 13:45 Using PAPI Performance Counters• 13:45 – 14:00 Break• 14:00 – 15:00 Trace Visualization• 15:00 – 15:30 Break• 15:30 – 18:00 Hands On (Wrubel Computing Center, Building
WCC, Room 107)
TU DRESDEN, ZIH, AND HPCWe do have computers in Germany too (although quiet old ones)
Dresden University of Technology
• Founded in 1828• One of the oldest technical
universities in Germany• 14 faculties and a number of
specialized institutes• More than 35000 Students, about
4000 Employees, 438 professors• International courses of studies,
bachelor, masters• One of the largest faculties for
computer science in Germany• 110 million Euro annual third party
funding• http://tu-dresden.de
Center for Information Services and HPC (ZIH)
• Central Scientific Unit at TU Dresden
• Competence Center for „Parallel Computing and Software Tools“
• Strong commitment to support real users
• Development of algorithms and methods: Cooperation with users from all departments
• Providing infrastructure and qualified service for TU Dresden and Saxony
Structure of ZIH• Management
– Director: Prof. Dr. Wolfgang E. Nagel
– Assistant directors: Dr. Peter Fischer (COO), Dr. Matthias S. Müller (CTO)
• Administration (7 Employees)
• Departments (ca. 100 Employees; incl. Trainees)
– Department of interdisciplinary function support and coordination (IAK)
– Department of networking and communication services (NK)
– Department of central systems and services (ZSD)
– Department of innovative methods of computing (IMC)
– Department of programming and software tool-kits (PSW)
– Department of distributed and data intensive computing (VDR)
HPC-Component
Main Memory 6,5 TBPC-Farm
HPC-SAN
Hard-disk -
capacity :
68 TB
PC-SAN
Hard-disk -
capacity :
68 TB
PetaByte-
Tapestorage
capacity :
1 PB
8 GB/s 4 GB/s4 GB/s
1,8 GB/s
installed in 2006
Today‘s Main HPC Infrastructure
Areas of Expertise• Research topics
– Architecture and performance analysis of High Performance Computers
– Programming methods and techniques for HPC systems
– Grid Computing– Software tools to support programming and
optimization– Modeling algorithms of biological processes– Mathematical models, algorithms, and
efficient implementations
• Role of mediator between vendors, developers, and users
• Pick up and preparation of new concepts, methods, and techniques
• Teaching and Education
Performance Analysis Tools• The Vampir performance analysis toolkit
– Vampir: Scalable event trace visualization– VampirTrace: Instrumentation and run-time data collection– Open Trace Format (OTF): Event trace data format
Vampir-Team
http://www.tu-dresden.de/zih/ptools
http://www.vampir.eu
Matthias Jurenz
Dr. Andreas Knüpfer
Matthias Lieber
Holger Mickler
Dr. Hartmut Mix
Dr. Matthias Müller
Ronny Brendel
Jens Doleschal
Ronald Geisler
Daniel Hackenberg
Robert Henschel
Prof. Wolfgang E. Nagel
Michael Peter
Heide Rohling
Matthias Weber
Thomas William
Performance Analysis Tools
EVENT BASED PROGRAM ANALYSIS
Why performance analysis?
• Moore's Law still in charge, no need to tune performance?
• Increasingly difficult to get close to peak performance– for sequential computation
• memory wall
• optimum pipelining, ...
– for parallel interaction• Amdahl's law
• synchronization with single late-comer, ...
• Efficiency is important because of limited resources
• Scalability is important to cope with next bigger simulation
OVERVIEW
Basics about Parallelization
Performance Analysis with Profiling
Instrumentation and Tracing
Motivation• Reasons for parallel programming:
– Higher Performance• Solve the same problem in shorter time• Solve larger problems in the same time
– Higher Capability• Solve problems that cannot be solved on a single processor• Larger memory on parallel computers• Time constraints limit the possible problem size
( Weather forecast, turn around within working day)
• In both cases performance is one of the major concerns:– Also consider sequential performance within the parallel
sections
Parallelization Strategies• General strategy for parallelization:
– Distribute the work to many workers
Limitations:– Not all tasks can be split into smaller sub-tasks– Dependencies between sub-tasks– Coordination overhead– (same as for human teams)
Algorithms:– Different algorithms for the same problem differ in terms of
parallelization– Different “best” algorithms for serial vs. parallel execution or
for different parallelization schemes
BASICS ABOUT PARALLELIZATION
Speed-up
• Definition of speed-up S
p
S
T
TS
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8
# C P U s
Sp
ee
d-u
p
Id e a l S p e e d -u p
R e a l S p e e d -u p
Ts: Serial Execution
Tp: Parallel Execution Time with P CPUs
Speed-up versus number of used processors:
Actual speed-up often lower than optimal one due to aforementioned limitations.
Parallel Efficiency
• Alternative definition: parallel efficiency E
PT
T
P
SE
P
S
0
0 ,2
0 ,4
0 ,6
0 ,8
1
1 ,2
1 2 3 4 5 6 7 8
# C P U s
Pa
ra
lle
l E
ffic
ien
cy
Id e a l Pa r a lle l
Ef f ic ie n c y
R e a l Pa r a lle l
Ef f ic ie n c y
Parallel efficiency versus number of used processors:
TS:Serial Execution Time
TP: Parallel Execution Time with P CPUs
Amdahl’s law
• Fundamental limit of parallelization
)()1(
1
)1(
1P
FS
S
FF
S
P
0
2
4
6
8
1 0
1 2
1 4
1 6
1 8
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6
# C P Us
Ma
xim
um
Sp
ee
d-u
p
Id e a l
F= 9 9 %
F= 9 5 %
F= 9 0 %
F= 8 0 %
•Only a fraction F of the algorithm is parallel with speed-up Sp
•A fraction (1-F) is serial
Then the maximum resulting speed-up is:
Amdahl’s law• If you know your desired speed up S you can calculate F:
– F gives you the percentage of your program that has to be executed parallel in order to achieve a speed up S (asymptotically).
– In order estimate the resulting effort you need to know in which parts of your program (1-F) of the time is spent.
– This is even before considering the actual parallelization method• Might add new serial sections• Brings coordination overhead• Will not scale arbitrarily high, i.e. the parallel section will stay > 0
SF
11
Amdahl’s law, example
• Example program with some sub-routines calling one another:
– For a maximum speed-up of 2 one needs to parallelize Calc and Multiply.
– For a maximum speed-up of 5 all need to be parallelized!
# c a lls T im e (% ) A c c u m u la te d
T im e (% )
C a ll
1 5 5 6 4 8 3 1 .2 2 3 1 .2 2 C a lc
6 0 3 6 4 8 2 2 .2 4 5 3 .4 6 M u lt ip ly
1 5 5 6 4 8 1 0 .0 5 6 3 .5 1 M a tm u l
2 1 4 5 2 8 9 .3 3 7 2 .8 4 C o p y
6 0 3 6 4 8 7 .8 7 8 0 .7 1 F in d
General Parallelization Strategy
• Therefore, successful parallelization requires:– Finding the actual hot-spots of work – Sufficient potential for parallelization– Parallelization strategy that introduces minimum coordination
overhead
• There are no general rules! Things that help to achieve high performance:– Know your application– Know your compiler– Understand the performance tool– Know the characteristics of the hardware
PERFORMANCE ANALYSIS WITH PROFILING
Profiling• Profiling gives an overview about the distribution of
run time• Usually on the level of subroutines, also at line-by-line
level• Rather low overhead• Usually good enough to find computation hot spots• Little details to detect performance problems and
their causes
• More sophisticated ways of profiling:– Based on hardware performance counters– Phase-based profiles– Call-path profiles
Profiling
• Profile Recording
– Of aggregated information (Time, Counts, …)
– About program and system entities
• Functions, loops, basic blocks
• Application, processes, threads, …
• Methods of Profile Creation
– PC sampling (statistical approach)
– Direct measurement (deterministic approach)
Profiling with gprof
– Compile with profiling support• Using -pg for GNU, -p –g for Intel
• Optimization -O3 might obscure the output somewhat
– Execute normally
• Used to be only for sequential programms
• Parallel only with the GMON_OUT_PREFIX trick
%> mpicc –p -g -O2 heat-mpi-slow-big.c -o heat-mpi-slow-big
%> export GMON_OUT_PREFIX=ggg
%> mpirun -np 4 heat-mpi-slow-big
%> ls
ggg.11762 ggg.11763 ggg.11764 ggg.11765
Profiling with gprof
– Pre-process profiling output with gprof:
• Text output
• There are also GUI front-ends like – pgprof (PGI)
– kprof (KDE)
– For a single rank:
– Combine results for all ranks:
%> gprof [–b] heat-mpi-slow-big ggg.11765 | less
%> gprof -s heat-mpi-slow-big ggg.*
%> gprof [–b] heat-mpi-slow-big gmon.sum | less
Profiling with gprof– Flat profile for one of four ranks:
– Flat profile for all four ranks combined:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
100.00 2.08 2.08 1 2.08 2.08 Algorithm
0.00 2.08 0.00 1 0.00 0.00 CalcBoundaries
0.00 2.08 0.00 1 0.00 0.00 DistributeNodes
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
100.00 8.59 8.59 4 2.15 2.15 Algorithm
0.00 8.59 0.00 4 0.00 0.00 CalcBoundaries
0.00 8.59 0.00 4 0.00 0.00 DistributeNodes
Profiling with gprof
– Annotated call graph for one of four ranks:Call graph
granularity: each sample hit covers 4 byte(s) for 0.48% of 2.08 seconds
index % time self children called name
2.08 0.00 1/1 main [2]
[1] 100.0 2.08 0.00 1 Algorithm [1]
-----------------------------------------------
<spontaneous>
[2] 100.0 0.00 2.08 main [2]
2.08 0.00 1/1 Algorithm [1]
0.00 0.00 1/1 DistributeNodes [4]
0.00 0.00 1/1 CalcBoundaries [3]
-----------------------------------------------
0.00 0.00 1/1 main [2]
[3] 0.0 0.00 0.00 1 CalcBoundaries [3]
-----------------------------------------------
0.00 0.00 1/1 main [2]
[4] 0.0 0.00 0.00 1 DistributeNodes [4]
-----------------------------------------------
Profiling
• Simple profiling is a good starting point• Reveals computational hot spots
• Hides away outlier values in the average
• More details needed for • Parallel analysis and identification of performance problems
• Finding optimization opportunities
• Advanced profiling tools:• TAU http://www.cs.uoregon.edu/research/tau/
• HPCToolkit http://hpctoolkit.org/
INSTRUMENTATION AND TRACING
Event Tracing
• Collect more detailed information for more insight
• Do not summarize run-time information
• Collect individual events with properties during run-time
• Event Tracing can be used for:– Visualization (VampirSuite)
– Automatic analysis (Scalasca)
– Debugging or for re-play (VampirSuite + Scalasca)
Tracing
• Recording of run-time events (points of interest)
– During program execution
– Enter leave of functions/subroutines
– Send/receive of messages, synchronization
– More …
– Saved as event records
• Timestamp, process, thread, event type
• Event specific information
• Sorted by time stamp
– Collected via instrumentation & trace library
Profiling vs Tracing
• Tracing Advantages
– Preserve temporal and spatial relationships (context)
– Allow reconstruction of dynamic behavior on any required abstraction level
– Profiles can be calculated from trace
• Tracing Disadvantages
– Traces can become very large
– May cause perturbation
– Instrumentation and tracing is complicated
• Event buffering, clock synchronization, …
Common Event Types
• Enter/leave of function/routine/region
– Time stamp, process/thread, function ID
• Send/receive of P2P message (MPI)
– Time stamp, sender, receiver, length, tag, communicator
• Collective communication (MPI)
– Time stamp, process, root, communicator, # bytes
• Hardware performance counter values
– Time stamp, process, counter ID, value
• Etc.
Parallel Trace
10010 P 1 ENTER 5
10090 P 1 ENTER 6
10110 P 1 ENTER 12
10110 P 1 SEND TO 3 LEN 1024 ...
10330 P 1 LEAVE 12
10400 P 1 LEAVE 6
10520 P 1 ENTER 9
10550 P 1 LEAVE 9
...
10020 P 2 ENTER 5
10095 P 2 ENTER 6
10120 P 2 ENTER 13
10300 P 2 RECV FROM 3 LEN 1024 ...
10350 P 2 LEAVE 13
10450 P 2 LEAVE 6
10620 P 2 ENTER 9
10650 P 2 LEAVE 9
...
DEF TIMERRES 1000000000
DEF PROCESS 1 `Master`
DEF PROCESS 1 `Slave`
DEF FUNCTION 5 `main`
DEF FUNCTION 6 `foo`
DEF FUNCTION 9 `bar`
DEF FUNCTION 12 `MPI_Send`
DEF FUNCTION 13 `MPI_Recv`
Instrumentation
• Instrumentation: Process of modifying programs to detect and report events by calling instrumentation functions.
– Instrumentation functions provided by trace library
– Call == notification about run-time event
– There are various ways of instrumentation
Source Code Instrumentation
int foo(void* arg){
enter(6);
if (cond){
leave(6);
return 1;
}
leave(6);
return 0;
}
Manually or Automatically
int foo(void* arg){
if (cond){
return 1;
}
return 0;
}
Source Code Instrumentation
Manually
– Large effort, error prone
– Difficult to manage
Automatically
– Via source to source translation
– Program Database Toolkit (PDT)
http://www.cs.uoregon.edu/research/pdt/
– OpenMP Pragma And Region Instrumentor (Opari)
http://www.fz-juelich.de/zam/kojak/opari/
Wrapper Function Instrumentation
• Provide wrapper functions• Call instrumentation function for notification
• Call original target for functionality
• Via preprocessor directives:
– Via library preload:
• preload instrumented dynamic library
– Suitable for standard libraries (e.g. MPI, glibc)
#define MPI_Init WRAPPER_MPI_Init
#define MPI_Send WRAPPER_MPI_Send
The MPI Profiling Interface
– Each MPI function has two names:
• MPI_xxx and PMPI_xxx
– Selective replacement of MPI routines at link time
wrapper library
user program
MPI library
MPI_Send
PMPI_SendMPI_Send
MPI_Send
MPI_Send
MPI_SendMPI_Send
Compiler Instrumentation
• gcc -finstrument-functions -c foo.c
• Many compilers support instrumentation:
(GCC, Intel, IBM, PGI, NEC, Hitachi, Sun Fortran, …)
• No source modification
void __cyg_profile_func_enter( <args> );
void __cyg_profile_func_exit( <args> );
Dynamic Instrumentation
• Modify binary executable in memory
• Insert instrumentation calls
• Very platform/machine dependent, expensive
• DynInst project (http://www.dyninst.org)
– Common interface
– Alpha/Tru64, MIPS/IRIX, PowerPC/AIX, Sparc/Solaris, x86/Linux+Windows, ia64/Linux
Instrumentation & Trace Overhead
15 ticksw/o
568526059dummy
937451300299id+timer
633219120119f.id
637278121120f.symbol
638115117117f.addr.
DynInstGCCPDTmanual
overhead for empty function call
Trace Libraries
• Provide instrumentation functions
• Receive events of various types
• Collect event properties
– Time stamp
– Location (thread, process, cluster node, MPI rank)
– Event specific properties
– Perhaps hardware performance counter values
• Record to memory buffer, flush eventually
• Try to be fast, minimize overhead
Trace Files & Formats
• TAU Trace Format (Univ. of Oregon)
• Epilog (ZAM, FZ Jülich)
• STF (Pallas, now Intel)
• Open Trace Format (OTF)
– ZIH, TU Dresden in coop. with Oregon & Jülich
– Single/multiple files per trace with
– Fast sequential and random access
– Including API for writing/reading
– Supports auxiliary information
– See http://www.tu-dresden.de/zih/otf/
Interoperability
Other Tools
• TAU profiling (University of Oregon, USA)– Extensive profiling and tracing for parallel applications and
visualization, camparison, etc.
http://www.cs.uoregon.edu/research/tau/
• Paraver (CEPBA, Barcelona, Spain)– Trace based parallel performance analysis and visualization
http://www.cepba.upc.edu/paraver/
• Scalasca (FZ Jülich)– Tracing and automatic detection of performance problems
http://www.scalasca.org
• Intel Trace Collector & Analyzer– Very similar to Vampir