Holger Brunst
e-mail: [email protected]
Center for Information Services and High Performance Computing (ZIH)
Performance Analysis on IU’s HPCSystems using Vampir
Trace Visualization
Outline
Introduction
Prominent Display Types
Performance Analysis Examples
New Vampir GUI
Post-mortem Event-based Performance Analysis
Performance optimization remains one of the key issues in parallelprogramming
Strong need for performance analysis, analysis process still noteasy
Profilers do not give detailed insight into timing behavior of anapplication
Detailed online analysis pretty much impossible because of intrusionand data amount
Tracing is an option to capture the dynamic behavior of parallelapplications
Performance analysis done on a post-mortem basis
Background
Performance visualization and analysis tool
Targets the visualization of dynamic processeson massively parallel (compute-) systems
Available for major Unix based OSand for Windows
Development started more than 15 years ago at Research CentreJülich, ZAM
Since 1997, developed at TU Dresden(first: collaboration with Pallas GmbH,from 2003-2005: Intel Software & Solutions Group,since January 2006: TU Dresden, ZIH / GWT-TUD)
Visualization components (Vampir) are commercial
– Motivation
– Advantages/Disadvantages
Monitor components (VampirTrace) are Open Source
Time
Application
Components
CPU
1
2
3
4Application
CPU
VampirTrace
ApplicationCPU
VampirTrace
ApplicationCPU
VampirTrace
ApplicationCPU
VampirTrace
ApplicationCPU
VampirTrace10,000
.
.
.
VampirTrace
Trace
Data
(OTF)
Vampir
Trace Data
Part m
OTF Trace
Part 1
OTF Trace
Part 2
OTF Trace
Part 3
.
.
.
OTF Trace
Part 4
VampirServer
Task 1 Task n << m …
Flavors
Vampir
stabilized sequential version
similar set of displays and options as presented here
less scalable
no ongoing active development
VampirServer
distributed analysis engine
allows server and client on the same workstation as well
new features
windows port in progress
VampirServer
Parallel/distributed server
runs in (part of) production environment
no need to transfer huge traces
parallel I/O
Lightweight client on local workstation
receive visual content only
already adapted to display resolution
moderate network load
Scalability
data volumes > 100 GB
number of processes > 10.000
Outline
Introduction
Prominent Display Types
Performance Analysis Examples
Vampir for Windows
Main Displays
Global Timeline
Process Timeline + Counter
Counter Timeline
Summary Timeline
Summary Chart (aka. Profile)
Message Statistics
Collective Communication Statistics
I/O Statistics
Call Tree, ...
Most Prominent Displays: Global Timeline
Time Axis
MPI Processes
Red: MPIRoutines
Other Colors:ApplicationRoutines
Black Lines:MPI Messages
Thumbnail
FunctionGroups
Most Prominent Displays: Single Process Timeline
Time Axis
Call StackLevel
MPII/O
ApplicationRoutines
PerformanceCounter
Other Tools
TAU profiling (University of Oregon, USA)
extensive profiling and tracing for parallel application andvisualization, camparison, etc.
http://www.cs.uoregon.edu/research/tau/
KOJAK (JSC, FZ Jülich)
very scalable performance tracing
automatic performance analysis and classification
http://www.fz-juelich.de/jsc/kojak/
Paraver (CEPBA, Barcelona, Spain)
trace based parallel performance analysis and visualization
http://www.cepba.upc.edu/paraver/
Outline
Introduction
Prominent Display Types
Performance Analysis Examples
New Vampir GUI
Approaching Performance Problems
Trace Visualization
Vampir provides a number of display types
each provides many customization options
Advice
make a hypothesis about performance problems
consider application's internal workings if known
select the appropriate display
use statistic displays in conjunction with timelines
Finding Performance Bottlenecks
Four Categories
1. Computation
2. Communication
3. Memory, I/O, …
4. Tracing itself
Finding Performance Bottlenecks
Computation
unbalanced workload distribution: single late comer(s)
strictly serial parts of program: idle processes/threads
very frequent tiny function calls: call overhead
sparse loops
LM-MUSCAT Air Quality Model: Load Imbalance
Many Processeswaiting forindividual
procecsses tofinish chemistry
computation.
LM-MUSCAT Air Quality Model: Bad CPU Partitioning
MeteorologyProcesses are
waiting most ofthe time
LM-MUSCAT Air Quality Model: Good CPU Partitioning
More Processesfor Chemistry-Transport, less
for Meteorology:better balance
LM-MUSCAT Air Quality Model: Load Imbalance
High load imbalanceonly during (simulated)
sunrise
Examine theruntime
behavior of theApplication
SPEC OMP Benchmark fma3d: Unbalanced Threads
Not well balancedOpenMP threads
WRF Weather Model: MPI/OpenMP - Idle Threads
Idle Threads
Finding Performance Bottlenecks
Communication
communication as such (domination over computation)
late sender, late receiver
point-to-point messages instead ofcollective communication
unmatched messages
overcharge of MPI buffers
bursts of large messages (bandwidth)
frequent short messages (latency)
unnecessary synchronization (barrier)
High Performance Linpack Using Open MPI
Everythinglooks ok here
HPL with Alternative MPI Implementation
Several slowMessages.MPI Problem?
HPL with Alternative MPI Implementation
Transfer Rateonly 1.63 MB/s!
Tracking downPerformanceProblems toindividualEvents
Finding Performance Bottlenecks
Memory bound computation
inefficient L1/L2/L3 cache usage
TLB misses
detectable via HW performance counters
I/O bound computation
slow input/output
sequential I/O on single process
I/O load imbalance
Exception handling
Performance Counters: Floating Point Exceptions
WRF Weather Model: Floating Point Exceptions
FPU exceptions lead to longruntime of routine ADVECT.Timeline interval: 77.7ms
Other optimization, no FPUexceptions: only 10ms forthe same program section
WRF Weather Model: Low I/O Performance
Transfer Rateonly 389 kB/s!
WRF Weather Model: Slow Metadata Operations
128 Processescall open – takesmore than 4seconds
Semtex CFD Application: Serial I/O
Process 0 isperforming I/O
…
… while 127processes are
waiting
Complex Cell Application: RAxML (1)
RAxML (Randomized Accelerated Maximum Likelihood)with 8 SPEs, ramp-up phase
Complex Cell Application: RAxML (2)
RAxML with 8 SPEs, 4000 ns windowenlargement of a small loop
shifted start of loop, constant runtime
Complex Cell Application: RAxML (3)
RAxML with 8 SPEs, 4000 ns windowenlargement of a small loop (modified)synchronous start, memory contention
Complex Cell Application: RAxML (4)
RAxML with 16 SPEs, load imbalance
Finding Performance Bottlenecks
Tracing
measurement overhead
– esp. grave for tiny function calls
– solve with selective instrumentation
long, asynchronous trace buffer flushes
too many concurrent counters
– more data
heisenbugs
Trace Buffer Flush
Outline
Introduction
Prominent Display Types
Performance Analysis Examples
New Vampir GUI
Product Family
Vampir for UNIX:
VampirClassic (all-in-one,single threaded, UnixOpenMotif based)
VampirServer (parallelized(MPI) client/server programapproach)
Windows for Windows:
Based on VampirServer’sparallel service engine
New Windows GUI to theharnessed VampirServerservices
Vampir Classic
All in one, single threaded Motif app.
Vampir for Windows HPC Server
Threaded
service DLL
Windows
GUIAPI
Vampir Server
Parallelized
service engine
Visualization
(Motif)Sockets
Holger Brunst
New GUI Layout
Global Time Selection with Summary
Shared Chart AreaShared Chart Area
Chart Selection
Chart Overview
Chart Arrangement
Windows Event Tracing
Windows HPC Server 2008
Microsoft MPI (MS-MPI) integrated with
Event Tracing for Windows (ETW) infrastructure
Allows MPI tracing
No “special” builds needed. Just run application with extra mpiexec flag(-trace)
High-precision CPU clock correction for MS-MPI (mpicsync)
Tracing prerequisites:
User must be in Administrator or Performance Log group
Jobs should be executed exclusively in the Windows HPCServer 2008 Scheduler to avoid confusion/conflict of the tracedata
SLIDE 44HOLGER BRUNST
Creation of OTF Traces
Run myApp with tracing enabled
Time-Sync the ETL logs
Convert the ETL logs to OTF
Copy OTF files to head node
Rank 0 node
ETW
MS-MPITrace(.etl)
MS-MPI
MS-MPI
MS-MPI
myApp.exe
mpicsync
etl2otf
copyFormattedTrace (.otf)
Rank 1node
Rank Nnode
…
HEAD NODE
\\share\
userHome\
myApp.exe
Trace\
trace.etl_otf.otf
trace.etl_otf.0.def
trace.etl_otf.1.events
trace.etl_oft.2.events
…
Creation of OTF Traces
The four steps arecreated as individualtasks in a cluster job.
The task options allowto choose the numberof cores for the joband otherparameters.
In “Dependency” theright order ofexecution of the taskscan be ensured.
Creation of OTF Traces
File system prerequisites:
“\\share\userHome” is the shared user directory throughout thecluster
MPI executable “myApp.exe” is available in shared directory
“\\share\userHome\Trace” is the directory where the OTF files arecollected
Launch program with –tracefile option
mpiexec –wdir \\share\userHome\ -tracefile%USERPROFILE%\trace.etl myApp.exe
wdir sets the working directory, myApp.exe has to be here
%USERPROFILE% translates to the local home directory, e.g.“C:\Users\userHome”, on each node
the eventlog file (.etl) is stored locally in this directory
Creation of OTF Traces
Time-Sync the Log files on throughout all nodes
mpiexec –cores 1 –wdir %USERPROFILE% mpicsync trace.etl
“- cores 1”: run only one instance of mpicsync on each node
Format Log files to OTF files
mpiexec –cores 1 –wdir %USERPROFILE% etl2otf trace.etl
Copy all OTF files from nodes to trace directory on share
mpiexec –cores 1 –wdir %USERPROFILE% cmd /c copy /y“*_otf*” “\\share\userHome\Trace”
SLIDE 48HOLGER BRUNST
Summary
Hybrid MPI/OpenMP trace file with 1024 cores
256 MPI ranks
4 OpenMP threads per rank
Some feedback from users:
“This was very impressive to see live as they had never seentheir application profiled at this scale, and vampir pointed us atthe problem straight away”
“I was impressed by how detailed MPI functions are visualized ina zoomable fashion into micro seconds scale. We have someparallel C# programs currently run on Windows cluster of up to128 core. I will use Vampir to test on other applications I have.”
Work in progress with regular updates
Completion of charts
Additional information sources from ETW
TeamRonny Brendel
Dr. Holger Brunst
Jens Doleschal
Matthias Jurenz
Dr. Andreas Knüpfer
Matthias Lieber
Christian Mach
Holger Mickler
Dr. Hartmut Mix
Dr. Matthias Müller
Prof. Wolfgang E. Nagel
Michael Peter
Matthias Weber
Thomas William
Thank You
http://www.vampir.eu
http://www.tu-dresden.de/zih/vampirtrace