Histograms detect imbalances
Variable sizes capture variance
Lawrence LivermoreNational Laboratory
Center for Applied Scientific Computing
NC STATE UNIVERSITYDepartment of Computer Science
An Open Framework for Scalable, Reconfigurable Performance AnalysisTodd Gamblin‡, Prasun Ratn*†, Bronis R. de Supinski*, Martin Schulz*, Frank Mueller†, Robert J. Fowler‡, Daniel Reed‡
*Lawrence Livermore National Laboratory, †North Carolina State University, ‡Renaissance Computing Institute
This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.This work was also supported in part by NSF grants CNS-0410203, CCF-0429653 and CAREER CCR-0237570, as well as the SciDAC Performance Engineering Research Institute, DE-FC02-06ER25764.
UCRL-POST-236200
ScalaTrace: Reconfigurable,Scalable Performance Analysis
Size of machines is rapidly increasing(130,000+ processors)
Tools will be overwhelmed with data Need scalable, online measurement and analysis
Future Directions
Large Scale MPI Applications Implicit Program Behavior
Interactive User Tools and Visualization Understanding of Program Behavior and Optimizations
Feedback, Steering & G
lobal Configuration
Dat
a C
olle
ctio
n &
Ana
lysi
s
…MPI tasks
User Interaction& Storage
ReductionNodes
Local Configuration
Data Input: from Tracer, Profiler, Child Nodes
Data Output: to Parent Node, Tool
Communication pathsOPS Attr.1 Attr.2 Attr.N…
extensible, self-describing formatAttribute examples: PRSDs, timing, …Reducer
Transformer
Annotater
Analyzer
Compressor PRSDs Timing
Loop Detection
Annotator
MPI PRSD Time Loop
MPI PRSD Time
MPI PRSD Time
MPI PRSD Time
LoopInformation
Time
PRSD
Types of tool components
LOCAL Node INSTANTIATION
GLOBALREDUCTION
Trace representation using Power Regular Section Descriptors (PRSDs)
Idea: preserve time in compressed traces
• Encode time deltas instead of timestamps
• Create delta histograms automatically
• Dynamically balance histograms
•Time depends on path taken
•Distinguish histograms by path
Description of problem
Bins generated for synthetic input spanentire range with similar sample counts
Number of histograms per record depends onnumber of possible call paths
Sample bimodal distribution from UMT2K collectives
... MPI _Allreduce ( ... ) ; for ( ... ) { for ( ... ) { MPI _Send ( ... ) ; MPI _Recv ( ... ) ; } MPI _Barrier ( ... ) ; } ...
Replay accuracy (NAS Benchmarks and UMT2K)
IS
LU
FT
IS
FT
Trace sizes (NAS Benchmarks and UMT2K)
MG
BT
Near-constant trace size
DT, EP, LU
Sub-linear trace sizeCG, MG, FT
Non-scalable trace sizeBT, IS, UMT2K
Accurate replay: DT, EP,FT, LU, IS, UMT2K
Replay inaccurate in MPITime: CG, MG
Replay inaccurate inCompute Time: BT
ScalaReplay: Replay UsingHistogram Timing Annotations
Evolutionary Load-Balance Analysiswith Scalable Data Collection
Idea: Normalize measurements andmodels based on applicationsemantics.
Progress loops• Typically outer loops in SPMD codes• Indicate absolute progress towards some
domain-specific goal• Basis for comparison of load over time
Effort Loops• Variable-time loops, represent load• Data-dependent execution
Load Modeling Framework
Visualization
Effort Region Analysis
Progress EventAnnotator
MPI PRSD
Compressor
Trace merge WaveletCompression
MPI PRSD Progress
LocalEffort
Records
PRSDs
MPI
PRSD
MPI Tracer
Load-annotatedTrace
MPI PRSD Progress
Progress instrumentation• User marks progress loop explicitly
Effort modeled with code regions• Dynamically detected at runtime• Split MPI-op trace at collectives and wait operations• User can further divide code into phases with instrumentation
Models dislocation dynamics in crystals• Dislocations discretized as nodes & arms• Recursive spatial domain decomposition• Balancer subdivides nodes/arms along x, then y, then z
Com
press
. . .
. . .
. . .
Scalable Compression Using Wavelet TransformDistributed per-process effort measurements areconsolidated onto fewer processes
Parallel wavelet transform
Transformed data is locally compressed bythresholding, run-length and entropy coding
Compressed representation is merged in parallel
• Load measurements are reconstructed on client• Visualization tool allows intuitive data browsing
Com
press
Com
press
Near-constant low-overhead MPI traces
Annotate with additional reconfigurable data, e.g.
Time using adaptive histograms (left)
Progress rates, load imbalance (right)
Compression Ratio
Merge Time
Error
Process IdProgress
Process IdProgress
Process IdProgressProcess IdProgress
Process IdProgress
Process IdProgress
Checkpoint
ForceComputation
Effort for 64-node ParaDiS RunExact Reconstructed
Re-partition
Load Balance in ParaDiSz
x
y
Flexible framework for application-specific tools
Ability to select compression schemes fordifferent fields, data types
Foster combination, interaction between datacollection, analysis mechanisms
Adaptive, self-tuning runtime systems
Near term:
Adaptive wavelet transform topology,equivalence class detection
Full time-sequence compression