Post on 16-Oct-2021
transcript
John Mellor-Crummey
Department of Computer ScienceCenter for High Performance Software Research
Taming High PerformanceComputing with
Compiler Technology
www.cs.rice.edu/~johnmc/presentations/rice-4-04.pdf
2
High Performance Computing Applications
• Scientific inquiry ranging from elementaryparticles to cosmology
• Pollution modeling and remediation planning
• Storm forecasting and climate prediction
• Advanced vehicle design
• Computational chemistry and drug design
• Molecular nanotechnology
• Cryptology
• Nuclear weapons stewardship
3
High Performance Applications
Algorithms
Architectures Data Structures
• Effective parallelizations ↔ scalability
• Single-processor performance can differ by integer factors
4
Status of Highly-parallel Systems
“[Scalable, highly-parallel, microprocessor-based systems] remain in the research andexperimental stage primarily because we
lack adequate software technology,application-development tools,
and, ultimately,well-developed applications.”
— “Information Technology Research: Investing in our Future” PITAC Report to the President, 1999
5
Challenges for Highly Parallel Computing
• Effective algorithms for complex problems
• Programming models and compilers
• Application development tools
• Operating systems for large-scale machines
• Design better high-performance architectures
6
Current Research Themes
• Compiler support for data parallel programming—Implicitly and explicitly parallel global address space languages
• Technology for auto-tuning software—Automatically tailor code to a microprocessor architecture
• Performance analysis tools—Understanding application behavior on current systems
• Performance modeling—How will applications perform at different scales and on future
systems
• Compiler technology for scientific scripting languages—R language for statistical programming
7
Outline
• Motivation
• Compiler technology for HPC
☛Compiling data-parallel languages
— Semi-automatic synthesis of performance models
• Challenges for the future
• Other work
8
Compiling data-parallel languages
• Introduction
— Data parallelism
— Compiling HPF-like languages
• Rice dHPF compiler
— Data partitioning research
— Analysis and code generation
• Experimental results
9
Data Parallelism
• Apply the same operation to many data elements
—need not be synchronous
—need not be completely uniform
• Applicable to many problems in science andengineering
10
Data Parallel Programming Alternatives
• Hand-coded parallelizations using library-based models—complete applicability
—difficult to design and implement
—all responsibility for tuning falls to the developer
• Application frameworks—easy to use
—limited applicability
• Single-threaded data-parallel languages—much more flexible than application frameworks
—much simpler to use than hand-coded parallelizations
—compilers significantly determines performance– offload details of tuning from the developer
– compilers are enormously complex
– out of luck if the compiler doesn’t deliver performance
11
Same answers assequential program
Partition computationInsert communicationManage storage
Parallel Machine
HPF Program Compilation
Fortran program+ data partitioning
Data Parallel Compilation
High Performance Fortran
Partitioning of data drives partitioning ofcomputation, communication, and synchronization
12
DO i = 2, n - 1 DO j = 2, n - 1 A(i,j) = .25 *(B(i-1,j) + B(i+1,j)+ B(i,j-1) + B(i,j+1))
CHPF$ processors P(3,3)CHPF$ distribute A(block, block) onto PCHPF$ distribute B(block, block) onto P
Processors
P(0,0)
P(2,2)
Data for A, B
(BLOCK,BLOCK)distribution
Example HPF Program
13
Compiling HPF-like Languages
• Partition data
• Select mapping of computation to processors
• Analyze communication requirements
• Partition computation by reducing loop bounds
• Insert communication
14
The Devil is in the Details …
• Good data and computation partitionings are a must
—without good partitionings, parallelism suffers!
• Excess communication undermines scalability
—both frequency and volume must be right!
• Single processor efficiency is critical
—must use caches effectively
—node code must be amenable to optimization
Goal Compiler and runtime techniques that
enable simple and natural programming, yet deliver the performance of hand-coded parallelizations
15
Achievements • parallelize sequential codes with minimal rewriting • near hand-coded performance for tightly coupled codes
Rice dHPF Compiler
Innovations
• Sophisticated data partitionings
• Abstract set-based framework for communicationanalysis, code generation
• Sophisticated computation partitionings
—partial replication to reduce communication
• Comprehensive optimizations
16
recurrences make parallelization difficult with BLOCK partitionings
do j = 1, n do i = 2,n a(i,j) = … a(i-1,j)
Data Partitioning
• Good parallel performance requires suitable partitioning
• Tightly-coupled computations are problematic
• Line-sweep computations: e.g., ADI integration
17
Coarse-Grain Pipelining
Processor 0
Processor 1
Processor 2
Processor 3
Partial serialization induces wavefront parallelism
with block partitioning
Compute along partitioned dimensions
18
Coarse-Grain Pipelining
Processor 0
Processor 1
Processor 2
Processor 3
Partial serialization induces wavefront parallelism
with block partitioning
Compute along partitioned dimensions
19
Hand-codedmultipartitioning}
}Compiler-generated
coarse-grainpipelining
Parallelizing Line Sweeps
20
Processor 0
Processor 1
Processor 2
Processor 3
Diagonal Multipartitioning
• Each processor owns 1 tile between each pair ofcuts along each distributed dimension
• Enables full parallelism for a sweep along anypartitioned dimension
21
Processor 0
Processor 1
Processor 2
Processor 3
Diagonal Multipartitioning
• Each processor owns 1 tile between each pair ofcuts along each distributed dimension
• Enables full parallelism for a sweep along anypartitioned dimension
22
Generalized Multipartitioning
• Partitioning constraints—# tiles in each λ - 1 dimensional hyperplane is a multiple of p—no more cuts than necessary
• Objective function: minimize communication volume
—pick the configuration of cuts to minimize total cross section
IPDPS 2002 Best paper in Algorithms; JPDC 2003
• Mapping constraints
— load balance: in a hyperplane, each proc has same # tiles
—neighbor: in any particular direction, the neighbor of a givenprocessor is the same
Given an n-dimensional data domain and p processors, select—which λ dimensions to partition, 2 ≤ λ ≤ n; how many cuts in each
23
Choosing the Best Partitioning
• Enumerate all elementary partitionings
—candidates depend on factorization of p
• Evaluate their communication cost
• Select the minimum cost partitioning
Number of choices for pickinga pair of dimensions
to partition with a number of cutsdivisible by a particular prime factor
Possible unique factors of p
( )( )( )
ppo
dd logloglog11
21
+
−
complexity:
very fast in practice.
worst case: p is a product of unique prime factors
24
Mapping Tiles with Modular Mappings
0 0
00
00
Basic Tile Shape
Modular Shift
Integral # of shapes
Integral # of shapes
Modular Shift
25
3 types of Sets
DataIterationsProcessors
3 types of Mappings
iterations data
dataprocessors
processorsiterations
Layout:Reference:CompPart:
Formal Compilation Framework
• Representation
—integer tuples with Presburger arithmetic for constraints
• Analysis: Use set equations to compute set(s) of interest
— iterations allocated to a processor
—communication sets
• Code generation: Synthesize loops from set(s), e.g.
—parallel (SPMD) loop nests
—message packing and unpacking
[Adve & Mellor-Crummey, PLDI98]
26
processors P(3,3)distribute A(block, block) onto Pdistribute B(block, block) onto PDO i = 2, n - 1 DO j = 2, n - 1 A(i, j) = .25 *( B(i-1, j) + B(i+1, j) + B(i, j-1) + B(i, j+1) )
P(x,y)
Local section for P(x,y)(and iterations executed)
Non-local dataaccessed
Iterations that accessnon-local data
} 2930yj230y
1920xi220x :j][i, {
+≤≤+&
+≤≤+
data / loop partitioning
20
30
P(0,0) P(1,0) P(2,0)
P(0,1) P(1,1) P(2,1)
P(0,2) P(1,2) P(2,2)
Why Symbolic Sets?
27
symbolic N
Layout := { [pid] -> [i] : 25 *pid + 1 ≤ i ≤ 25 *pid + 25 }
Loop := { [i] : 1 ≤ i ≤ N }
CPSubscript := { [i] [i-1] }
RefSubscript := { [i] [i-2] }
real A(100) distribute A(BLOCK) on P(4) do i = 1, N ... = A(i-1) + A(i-2) + ... ! ON_HOME A(i-1) enddo
CompPart := (Layout o CPSubscript -1) ∩ Loop
DataAccessed = CompPart o RefSubscript
NonLocal Data Accessed = DataAccessed - Layout
Integer-Set Framework: Example
28
Optimizations using Integer Sets
• Partially replicate computation to reduce communication
—66% lower message volume, 38% faster: NAS BT @ 64 procs
• Coalesce communication sets for multiple references
—41% lower message volume, 35% faster: NAS SP @ 64 procs
• Split loops into “local-only” and “off-processor” loops
—10% fewer Dcache misses, 9% faster: NAS SP @64 procs
• Processor set constraints on communication sets
—12% fewer Icache misses, 7% faster: NAS SP @ 64 procs
PACT 2002 Best student paper(with Daniel Chavarria-Miranda)
29
• NAS SP & BT benchmarks from NASA Ames—use ADI to solve the Navier-Stokes equation in 3D
—forward & backward line sweeps on each dimension, each timestep
• Compare four variants—MPI hand-coded multipartitioning (NASA)
—dHPF: multipartitioned
—dHPF: 2D partitioning, coarse-grain pipelining
—PGI’s pghpf: 1D partitioning with transpose
• Platform—SGI Origin 2000: 128 250 MHz procs.
—SGI compilers + SGI MPI
Experimental Evaluation
30
Efficiency for NAS SP (1023 ‘B’ size)
> 2x multipartitioning comm. volume
similar comm. volume, more serialization
31
Efficiency for NAS BT (1023 ‘B’ size)
> 2x multipartitioning comm. volume
Platform: SGI Origin 2000
32
NAS BT Parallelizations
Hand-coded3D Multipartitioning
Compiler-generated3D Multipartitioning
Execution Traces for NAS BT Class 'A' - 16 processors, SGI Origin 2000
33
Observations
• High performance requires perfection
—parallelism and load-balance
—communication frequency
—communication volume
—scalar performance
• Data-parallel compiler technology can
—ease the programming burden
—yield near hand-coded performance
34
Data-parallel Related Work
• Linear equations/set-based compilation
—[Pugh et al; Ancourt et al; Amarasinghe & Lam]
• Commercial HPF compilers
—xlHPF, pghpf, xHPF
• HPF/JA
—14 Teraflops on a code for the Earth Simulator
• Lots of research compiler efforts
—e.g. Polaris, CAPTOOLS
None support partially-replicated computationNone support multipartitioningNone achieve linear scaling on tightly-coupled codes
35
Outline
• Motivation
• Compiler technology for HPC
— Data-parallel programming systems
☛Semi-automatic synthesis of performance models
• Challenges for the future
• Other work
36
Why Performance Modeling?
• Insight into applications
—barriers to scalability
—insight into optimizations
• Mapping applications to systems
—Grid resource selection & scheduling
—intelligent run-time adaptation
• Workload-based design of future systems
37
Modeling Challenges
• Performance depends on:
—architecture specific factors
—application characteristics
—input data parameters
• Difficult to model execution time directly
• Collecting data at scale is expensive
38
Approach
Separate contribution of application characteristics
• Measure the application-specific factors
—static analysis
—dynamic analysis
• Construct scalable models
• Explore interactions with hardware
Use binary analysis and instrumentation forlanguage and programming model independence
[Marin & Mellor-Crummey SIGMETRICS 04]
39
Toolkit Design Overview
ObjectCode
BinaryInstrumenter
InstrumentedCode
Execute
BBCounts
CommunicationVolume &Frequency
MemoryReuse
Distance
BinaryAnalyzer
Control flow graph
Loop nestingstructure
BB instruction mixPost Processing Tool
Architectureneutral model Scheduler
ArchitectureDescription
PerformancePredictionfor Target
ArchitectureStatic Analysis
DynamicAnalysis
Post Processing
40
Building Scalable Models
• Collect data from multiple runs
—n+1 runs to compute a model of degree n
• Approximation function:
F(X) = cn*Bn(X)+cn-1*Bn-1(X)+…+c0*B0(X)
• A set of basis functions
• Include constraints
• Goal: determine coefficients
Use quadratic programming
41
Execution Frequency Modeling Example
Execution Frequency Model
0
200000
400000
600000
800000
1000000
0 5 10 15 20 25 30 35 40 45
Problem Size
Fre
qu
en
cy
Collected data
289200
23 …
18316013110453020202445784Count
1815952X
42
Execution Frequency Modeling Example
Execution Frequency Model
0
200000
400000
600000
800000
1000000
0 5 10 15 20 25 30 35 40 45
Problem Size
Fre
qu
en
cy
Collected data
Model degree 0
Y=41416, Err=131%
289200
23 …
18316013110453020202445784Count
1815952X
43
Execution Frequency Modeling Example
Execution Frequency Model
0
200000
400000
600000
800000
1000000
0 5 10 15 20 25 30 35 40 45
Problem Size
Fre
qu
en
cy
Collected data
Model degree 0
Model degree 1
Y=41416, Err=131%
Y=16776*X-42366,Err=60.4%
289200
23 …
18316013110453020202445784Count
1815952X
44
Execution Frequency Modeling Example
Execution Frequency Model
0
200000
400000
600000
800000
1000000
0 5 10 15 20 25 30 35 40 45
Problem Size
Fre
qu
en
cy
Collected data
Model degree 0
Model degree 1
Model degree 2
Y=41416, Err=131%
Y=16776*X-42366,Err=60.4%
Y=482*X2+1446*X+964,Err=0%
289200
23 …
18316013110453020202445784Count
1815952X
45
Predict Schedule Latency for an Architecture
• Input:
—basic block and edge execution frequency
• Methodology:
—recover executed paths
—SPARC instructions ➔ generic RISC
—instantiate scheduler for architecture
—construct schedule for executed paths
—determine inefficiencies
46
Toolkit Design Overview
ObjectCode
BinaryInstrumenter
InstrumentedCode
Execute
BBCounts
CommunicationVolume &Frequency
MemoryReuse
Distance
BinaryAnalyzer
Control flow graph
Loop nestingstructure
BB instruction mixPost Processing Tool
Architectureneutral model Scheduler
ArchitectureDescription
PerformancePredictionfor Target
ArchitectureStatic Analysis
DynamicAnalysis
Post Processing
47
Memory Reuse Distance
• MRD: # unique data blocks referenced sincetarget block last accessed
memory block
MRD
• I1: 1 cold miss
• I2: 2 cold misses, 1 @ distance 2
• I3: 1 @ distance 0, 2 @ distance 1
reference
0
B
I3
2
B
I2
1
A
I3
∞C
I2
1
A
I3
∞B
I2
∞A
I1
48
Memory reuse distance
49
Modeling Memory Reuse Distance
• More complex than execution frequency
—cold misses
—histogram of reuse distances– number of bins not constant
• Average reuse distance is misleading
—1 access with distance 10,000
—3 accesses with distance 0
—cache has 1024 blocks
2500 average
50
Modeling Memory Reuse Distance
2 13 40
50%
30%20%
Reuse distance
No
rmal
ized
freq
uen
cy
51
Modeling Memory Reuse Distance
52
Predict Number of Cache Misses
• Instantiate model for problem size 100
74%
96%
53
Prediction: NAS BT 3.0 Mem Hier Utilization
NAS BT 3.0 Memory Hierarchy Utilization
0
50
100
150
200
250
300
0 20 40 60 80 100 120 140 160 180 200
Mesh size
Mis
s c
ou
nt
/ C
ell
/ T
ime s
tep
L1 measuredL1 predictedL2 measured(x10)L2 predicted(x10)TLB measured(x10)TLB predicted(x10)
54
Prediction: NAS BT 3.0 Time on SGI Origin
0500
10001500200025003000350040004500
Measured time Schedulerlatency
L1 miss penalty
NAS BT 3.0 from SPARC to SGI Origin
0
1000
2000
3000
4000
5000
6000
7000
0 50 100 150 200Mesh size
Cycle
s /
Cell /
Tim
e s
tep
Measured time Scheduler latency L1 miss penaltyL2 miss penalty TLB miss penalty Predicted time
55
Open Performance Modeling Issues
• Short term
—Better modeling of memory subsystem– # outstanding loads to accurately predict memory latency
—Explore modeling of irregular applications
• Long term
—Model parallel applications– Present modeling applies between synchronization points
– Combine with manually constructed parallel models
– Semi-automatically recover parallel trends
—Understand dynamic parallelism
56
Modeling Related Work
• Reuse distance—Cache utilization [Beyls & D’Hollander]
—Investigating optimizations [Ding et al.]
• Program instrumentation—EEL, QPT [Ball, Larus, Schnarr]
• Scalable analytic models—[Vernon et al; Hoisie et al.]
• Cross-architecture models at scale—[Snavely et al.; Cascaval et al.]
• Simulation (trace-based and execution-driven)
None yield semi-automatically derived scalable models
57
HPC Compiler Challenges for the Future
• Programming systems for large-scale machines—Abstraction and greater expressiveness are needed
—Potential parallelism must be readily accessible– implicit parallelism or explicit element-wise parallelism
—Locality and latency tolerance are both critical for performance
—Dynamic self-scheduled parallelism will be necessary
—Failure will occur and must be expected and handled
• Support for “self-tuning software” for complex architectures
• Compiler-based tools—Debugging and performance analysis of large-scale software
on dynamic systems is a major open problem
• Insight into hardware design—Understanding impact of proposed designs on whole programs
58
Past Work
• Multiprocessor synchronization
—locks, synchronous barriers [ASPLOS89, TOCS91]
—reader-writer synchronization [PPOPP91]
—fuzzy barriers [IJPP94]
• Parallel debugging
—execution replay [JPDC90, TOC87]
—software instruction counter [ASPLOS89]
—detecting data races [WPDD93, SC91, SC90]
• Parallel programming environments
—Parascope [PIEEE 93], Dsystem [TPDT94]
• Parallel applications
—molecular dynamics [JCC92]
59
Ongoing Work
• Global address space parallel languages
—Co-array Fortran [LCPC03]
• Performance analysis
— [TJS02, LACSI01, ICS01, SIGMETRICS01]
• Improving node performance
—irregular mesh and particle codes [ICS99, IJPP00]
—sparse matrices [LACSI02, IJHPCA04]
—multigrid [ICS01]
—dense matrices [LACSI03]
• Grid computing [IJHPCA01]
• Library-based domain languages [JPDC01]