Post on 21-Jul-2020
transcript
Language-Centric Performance Analysis of OpenMPPrograms with Aftermath
Andi Drebes
The University of ManchesterSchool of Computer Science
Advanced Processor Technologiesandi.drebes@manchester.ac.uk
Joint work with:Jean-Baptiste Brejon, Antoniu Pop, Karine Heydemann, Albert Cohen
IWOMP 2016
Analysis of OpenMP Programs
Hardware
Run-time OS
Application
Andi Drebes – Aftermath: Language-Centric Performance Analysis 1 / 10
Analysis of OpenMP Programs
Hardware
Run-time OS
Application
Andi Drebes – Aftermath: Language-Centric Performance Analysis 1 / 10
Analysis of OpenMP Programs
Programming model
#pragma omp task depend(...){ ... }
#pragma omp parallel forfor(int i = 0; i < N; i++){ ... }
Application Hardware
Run-time OS
Andi Drebes – Aftermath: Language-Centric Performance Analysis 1 / 10
New Tools for Performance Analysis
Frequent topics for performance analysis:I Amount of parallelism and load balacingI Duration of execution phasesI Synchronization overhead (e.g., barriers)I Choice of an appropriate loop scheduleI Data distribution on NUMA systemsI Relate hardware events to loops / tasks
Our tools: Aftermath & Aftermath-OpenMPI Aftermath: Graphical tool for performance analysisI Aftermath-OpenMP: Instrumented LLVM/clang run-time
Andi Drebes – Aftermath: Language-Centric Performance Analysis 2 / 10
New Tools for Performance Analysis
Frequent topics for performance analysis:I Amount of parallelism and load balacingI Duration of execution phasesI Synchronization overhead (e.g., barriers)I Choice of an appropriate loop scheduleI Data distribution on NUMA systemsI Relate hardware events to loops / tasks
Our tools: Aftermath & Aftermath-OpenMPI Aftermath: Graphical tool for performance analysisI Aftermath-OpenMP: Instrumented LLVM/clang run-time
Andi Drebes – Aftermath: Language-Centric Performance Analysis 2 / 10
Outline
1. Overview of Trace-based Analysis
2. Overview of Aftermath’s GUI
3. Demo
4. Overhead of Tracing
5. Summary & Conclusion
Trace-based Analysis with Aftermath
Application HardwareAftermath-OpenMP
Run-timeOS
Trace file
Andi Drebes – Aftermath: Language-Centric Performance Analysis 3 / 10
Trace-based Analysis with Aftermath
Application HardwareAftermath-OpenMP
Run-timeOS
Trace file
Aftermath
Visualizations &Exploration
Statistics &Accurate Numbers
ProgrammingModel-centric
Analysis
Andi Drebes – Aftermath: Language-Centric Performance Analysis 3 / 10
Terminology
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
Terminology
Loop construct#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
Terminology
Loop construct
0 99Iteration space
Loop
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
Terminology
Loop construct
0 99Iteration space
LoopC0 C1 C2 C3 C4 C5 C6 C7 C8 C9
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
Terminology
Loop construct
0 99Iteration space
LoopC0 C1 C2 C3 C4 C5 C6 C7 C8 C9
Chunk
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
Terminology
Loop construct
0 99Iteration space
LoopC0 C1 C2 C3 C4 C5 C6 C7 C8 C9
Chunk
Worker 0
Worker 1
Worker 2
C0 C3 C6 C9
C1 C4 C7
C2 C5 C8
[0-9] [30-39] [60-69] [90-99]
[10-19] [40-49] [70-79]
[20-29] [50-59] [80-89]
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
Terminology
Loop construct
0 99Iteration space
LoopC0 C1 C2 C3 C4 C5 C6 C7 C8 C9
Chunk
Worker 0
Worker 1
Worker 2
C0 C3 C6 C9
C1 C4 C7
C2 C5 C8
[0-9] [30-39] [60-69] [90-99]
[10-19] [40-49] [70-79]
[20-29] [50-59] [80-89]
Iteration set
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
Terminology
Loop construct
0 99Iteration space
LoopC0 C1 C2 C3 C4 C5 C6 C7 C8 C9
Chunk
Worker 0
Worker 1
Worker 2
C0 C3 C6 C9
C1 C4 C7
C2 C5 C8
[0-9] [30-39] [60-69] [90-99]
[10-19] [40-49] [70-79]
[20-29] [50-59] [80-89]
Iteration set
Worker 0
Worker 1
Worker 2
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
Terminology
Loop construct
0 99Iteration space
LoopC0 C1 C2 C3 C4 C5 C6 C7 C8 C9
Chunk
Worker 0
Worker 1
Worker 2
C0 C3 C6 C9
C1 C4 C7
C2 C5 C8
[0-9] [30-39] [60-69] [90-99]
[10-19] [40-49] [70-79]
[20-29] [50-59] [80-89]
Iteration set
Worker 0
Worker 1
Worker 2
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
Terminology
Loop construct
0 99Iteration space
LoopC0 C1 C2 C3 C4 C5 C6 C7 C8 C9
Chunk
Worker 0
Worker 1
Worker 2
C0 C3 C6 C9
C1 C4 C7
C2 C5 C8
[0-9] [30-39] [60-69] [90-99]
[10-19] [40-49] [70-79]
[20-29] [50-59] [80-89]
Iteration set
Worker 0
Worker 1
Worker 2
Iteration period Iteration period
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
Aftermath: Overview of the GUI
Andi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Detailed Text viewDetailed Text view
Time lineTime line
Filte
rsFi
lters
Stat
istic
sSt
atis
tics
Andi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Time lineAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Time
Proc
esor
s Activityduring
execution
Time lineAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Time
Proc
esor
s
Sequential Execution(orange)
Time line: Run-time statesAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Time
Proc
esor
s Parallel loop(green)
Time line: Run-time statesAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Time
Proc
esor
s
BarrierSynchronization
(dark red)
Time line: Run-time statesAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Time
Proc
esor
s No activity(background visible)
Time line: Run-time statesAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Time
Proc
esor
s
Time line: Loop constructsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Time
Proc
esor
s
Time line: Loop constructsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
State statisticsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
State statisticsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Histogram showing duration of iteration periodsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Histogram showing duration of iteration periodsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Detailed text view for parallel loopsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Detailed text view for parallel loopsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Filter for loop constructsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Aftermath: Overview of the GUI
Filter for loop constructsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
Demo: NPB’s MG benchmark
Benchmark: NPB MGI NPB 2.3 C implementation from the Omni Compiler ProjectI C input class (512× 512 elements)
Test platformI SGI UV 2000 (Xeon E5-4640)I 192 cores (Hyperthreading disabled)I 24 NUMA nodes, 756 GiB RAMI LLVM/clang 3.8.0I Aftermath-OpenMP for trace generation
Andi Drebes – Aftermath: Language-Centric Performance Analysis 6 / 10
DEMO
Demo: Summary
Execution phasesI Parallel initializations + Main ComputationI Sequential execution in between
Time spent in barriersI States on time line / statistics panel
Load imbalanceI Sufficient parallelismI High load imbalance, but not due to partitioning / scheduleI Same NUMA node→ Aprox. same execution time
SolutionI Change allocation scheme: one big allocationI Reduce number of workers: #iters = n × #workersI Result: 35× speedup
Andi Drebes – Aftermath: Language-Centric Performance Analysis 7 / 10
Demo: Summary
Execution phasesI Parallel initializations + Main ComputationI Sequential execution in between
Time spent in barriersI States on time line / statistics panel
Load imbalanceI Sufficient parallelismI High load imbalance, but not due to partitioning / scheduleI Same NUMA node→ Aprox. same execution time
SolutionI Change allocation scheme: one big allocationI Reduce number of workers: #iters = n × #workersI Result: 35× speedup
Andi Drebes – Aftermath: Language-Centric Performance Analysis 7 / 10
Demo: Summary
Execution phasesI Parallel initializations + Main ComputationI Sequential execution in between
Time spent in barriersI States on time line / statistics panel
Load imbalanceI Sufficient parallelismI High load imbalance, but not due to partitioning / scheduleI Same NUMA node→ Aprox. same execution time
SolutionI Change allocation scheme: one big allocationI Reduce number of workers: #iters = n × #workersI Result: 35× speedup
Andi Drebes – Aftermath: Language-Centric Performance Analysis 7 / 10
Demo: Summary
Execution phasesI Parallel initializations + Main ComputationI Sequential execution in between
Time spent in barriersI States on time line / statistics panel
Load imbalanceI Sufficient parallelismI High load imbalance, but not due to partitioning / scheduleI Same NUMA node→ Aprox. same execution time
SolutionI Change allocation scheme: one big allocationI Reduce number of workers: #iters = n × #workersI Result: 35× speedup
Andi Drebes – Aftermath: Language-Centric Performance Analysis 7 / 10
Overhead of Tracing
CG EP FT LU MG sparselu strassen alignment fft sort Geometricmean (abs.)
10
5
0
5
10
15
20
0.88 0.60-0.66
0.351.77
0.01
5.79
-0.030.24
4.07
0.46
NPB-2.3 (loop-based) BOTS 1.1.2 (task-based)
Relative Increase of Execution Time [%](mean for 50 runs / error bars: standard deviation)
Test systemI SGI UV 2000 (192 cores, 24 NUMA nodes)
Missing benchmarksI Outlier: floorplan (+380% execution time; very small tasks)I Segfaults (BT, nqueens, uts) / Excessive Execution time (IS) /
Verification Failure (health)
Andi Drebes – Aftermath: Language-Centric Performance Analysis 8 / 10
Overhead of Tracing
CG EP FT LU MG sparselu strassen alignment fft sort Geometricmean (abs.)
10
5
0
5
10
15
20
0.88 0.60-0.66
0.351.77
0.01
5.79
-0.030.24
4.07
0.46
NPB-2.3 (loop-based) BOTS 1.1.2 (task-based)
Relative Increase of Execution Time [%](mean for 50 runs / error bars: standard deviation)
Test systemI SGI UV 2000 (192 cores, 24 NUMA nodes)
Missing benchmarksI Outlier: floorplan (+380% execution time; very small tasks)I Segfaults (BT, nqueens, uts) / Excessive Execution time (IS) /
Verification Failure (health)Andi Drebes – Aftermath: Language-Centric Performance Analysis 8 / 10
Using Aftermath & Aftermath-OpenMP
Drop-in replacement for libomp with wrapper script:$ aftermath-openmp-trace -o events.ost -- <program> <args>
$ aftermath events.ost
Source code and tutorial:http://www.openstream.info/aftermath
Virtual Machine(Aftermath + Aftermath-OpenMP + sample traces + documentation):http://www.openstream.info/vm
Andi Drebes – Aftermath: Language-Centric Performance Analysis 9 / 10
Using Aftermath & Aftermath-OpenMP
Drop-in replacement for libomp with wrapper script:$ aftermath-openmp-trace -o events.ost -- <program> <args>
$ aftermath events.ost
Source code and tutorial:http://www.openstream.info/aftermath
Virtual Machine(Aftermath + Aftermath-OpenMP + sample traces + documentation):http://www.openstream.info/vm
Andi Drebes – Aftermath: Language-Centric Performance Analysis 9 / 10
Summary
AftermathI Reactive graphical user interface for trace analysisI Programming model-centric analysis: Loops and tasks
Aftermath-OpenMPI Instrumented LLVM/clang OpenMP run-timeI Low tracing overhead
Future workI Dependent tasksI Automate recurring analyses
On-line resourceshttp://www.openstream.info/aftermath (Main website)http://www.openstream.info/vm (VM image)
Andi Drebes – Aftermath: Language-Centric Performance Analysis 10 / 10
Summary
AftermathI Reactive graphical user interface for trace analysisI Programming model-centric analysis: Loops and tasks
Aftermath-OpenMPI Instrumented LLVM/clang OpenMP run-timeI Low tracing overhead
Future workI Dependent tasksI Automate recurring analyses
On-line resourceshttp://www.openstream.info/aftermath (Main website)http://www.openstream.info/vm (VM image)
Andi Drebes – Aftermath: Language-Centric Performance Analysis 10 / 10
Summary
AftermathI Reactive graphical user interface for trace analysisI Programming model-centric analysis: Loops and tasks
Aftermath-OpenMPI Instrumented LLVM/clang OpenMP run-timeI Low tracing overhead
Future workI Dependent tasksI Automate recurring analyses
On-line resourceshttp://www.openstream.info/aftermath (Main website)http://www.openstream.info/vm (VM image)
Andi Drebes – Aftermath: Language-Centric Performance Analysis 10 / 10
Summary
AftermathI Reactive graphical user interface for trace analysisI Programming model-centric analysis: Loops and tasks
Aftermath-OpenMPI Instrumented LLVM/clang OpenMP run-timeI Low tracing overhead
Future workI Dependent tasksI Automate recurring analyses
On-line resourceshttp://www.openstream.info/aftermath (Main website)http://www.openstream.info/vm (VM image)
Andi Drebes – Aftermath: Language-Centric Performance Analysis 10 / 10