Reservoir Labs Harvard 04/12/2011
N. Vasilache, B. Meister, M. Baskaran, A.Hartono, R. Lethin
1
The R-Stream High-Level Program Transformation Tool
Reservoir Labs Harvard 04/12/2011
�• R-Stream Overview
�• Compilation Walk-through
�• Performance Results
�• Getting R-Stream
2
Outline
Reservoir Labs Harvard 04/12/2011
Power efficiency driving architectures
3
GPP
SIMD
Memory
SIMD
DMA
FPGASIMD
SIMD
GPP
SIMD
Memory
SIMD
DMA
FPGASIMD
SIMD
GPP
SIMD
Memory
SIMD
DMA
FPGASIMD
SIMD
GPP
SIMD
Memory
SIMD
DMA
FPGASIMD
SIMD
HeterogeneousProcessing
DistributedLocal Memories
ExplicitlyManaged
Architecture
NUMA
BandwidthStarved
Hierarchical(including board,chassis, cabinet)
MultipleExecutionModels
MixedParallelism
Types
MultipleSpatial
Dimensions
Reservoir Labs Harvard 04/12/2011
�• Expressing it• Annotations and pragma dialects for C• Explicitly (e.g., new languages like CUDA and OpenCL)
�• But before expressing it, how can programmers find it?• Manual constructive procedures, art, sweat, time
– Artisans get complete control over every detail• Automatically
– Operations research problem– Like scheduling trucks to save fuel– Model, solve , implement– Faster, sometimes better, than a human
4
Computation choreography
Our focus
Reservoir Labs Harvard 04/12/2011
�• Naïve approach• Model
– Tasks, dependences– Resource use, latencies– Machine model with connectivity, resource capacities
• Solve with ILP– Minimize overall task length– Subject to dependences, resource use
• Problems– Complexity: task graph is huge!– Dynamics: loop lengths unknown.
�• So we do something much more cool.
5
How to do automatic scheduling?
Reservoir Labs Harvard 04/12/2011 6
Program Transformations Specification
iteration space of a statement S(i,j)
j
i
22: ZZ
t1
t2
�• Schedule maps iterations to multi-dimensional time:• A feasible schedule preserves dependences�• Placement maps iterations to multi-dimensional space:• UHPC in progress, partially done�• Layout maps data elements to multi-dimensional space:• UHPC in progress�• Hierarchical by design, tiling serves separation of concerns
Reservoir Labs Harvard 04/12/2011 7
Loop transformations
for(i=0; i<N; i++)for(j=0; j<N; j++)S(i,j);
for(j=0; j<N; j++)for(i=0; i<N; i++)S(i,j);
for(i=N-1; i>=0; i--)for(j=0; j<N; j++)S(j,i);
for(i=0; i<N; i++)for(j= *i; j<N+ *i; j++)S(i,j- *i);
ji
ji0110
),(
ji
ji1001
),(
ji
ji101
),(
for(i=0; i< *N; i+= )for(j=0; j<N; j++)S(i/ ,j); j
iji
100
),(
permutation
reversal
skewing
scaling
unimodular
Reservoir Labs Harvard 04/12/2011 8
Loop fusion and distribution
for(i=0; i<N; i++)for(j=0; j<N; j++)S1(i,j);
for(j=0; j<N; j++)S2(i,j)
for(i=0; i<N; i++)for(j=0; j<N; j++)S1(i,j);S2(i,j)
fusion
distribution
1000010000001000
),(1 ji
ji
1000010100001000
),(2 ji
ji
1000010000001000
),(1 ji
ji
1100010000001000
),(2 ji
ji
fusion
distribution
Reservoir Labs Harvard 04/12/2011 9
Enabling technology is new compiler math
Uniform Recurrence Equations [Karp et al. 1970]
Polyhedral Model [1980-]
Many: Feautrier, Darte, Vivien, Wilde, Rajopadhye, etc,....
Many: Lamport, Allen/Kennedy, Banerjee, Irigoin, Wolfe/Lam, Pugh, Pingali, e
Vectorization, SMP, locality optimizations
Dependence summary: direction/distance vectors
New computational techniques based on polyhedral representations
Unimodular transformations
Exact dependence analysis
Loop synthesis via polyhedral scanning
Loop Transformations and Parallelization [1970-]
Mostly linear-algebraic
General affine transformations
Systolic Array Mapping
Reservoir Labs Harvard 04/12/2011
R-Stream model: polyhedra
n = f();for (i=5; i<= n; i+=2) {
A[i][i] = A[i][i]/B[i];for (j=0; j<=i; j++) {if (j<=10) {�… A[i+2j+n][i+3]�…
}}
1130
001
002
011
11
0
nji
AA
Affine and non-affine transformationsOrder and place of operations and data
10
}12;;0;5,|,{ 2 kiijijniZkZji
Loop code represented (exactly or conservatively) with polyhedrons High-level, mathematical view of a mappingBut targets concrete properties: parallelism, locality, memory footprint
Reservoir Labs Harvard 04/12/2011
�• Parametric imperfect loop nests
�• Subsumes classical transformations
�• Compacts the transformation search space
�• Parallelization, locality optimization (communication avoiding)
�• Preserves semantics
�• Analytic joint formulations of optimizations
�• Not just for affine static control programs
11
Polyhedral slogans
Reservoir Labs Harvard 04/12/2011
�• Killer math
�• Scalability of optimizations/code generation
�• Mostly confined to dependence preserving transformations
�• Code can be radically transformed – outputs can look wildly different
�• Modeling indirections, pointers, non-affine code.
�• Many of these challenges are solved
12
Polyhedral model – challenges in building a compiler
Reservoir Labs Harvard 04/12/2011
R-Stream blueprint
13
Polyhedral Mapper
Raising Lowering
Scalar RepresentationEDG CFront End
PrettyPrinter
Machine Model
Reservoir Labs Harvard 04/12/2011
Inside the polyhedral mapper
14
Jolylib, �…
GDG representation
Tactics Module
ParallelizationLocality
OptimizationTiling Placement Comm.
Generation
MemoryPromotion
SyncGeneration
LayoutOptimization
PolyhedralScanning
�…
Reservoir Labs Harvard 04/12/2011
Inside the polyhedral mapper
15
Jolylib, �…
GDG representation
Tactics Module
ParallelizationLocality
OptimizationTiling Placement Comm.
Generation
MemoryPromotion
SyncGeneration
LayoutOptimization
PolyhedralScanning
�…
Optimization modules engineered to expose �“knobs�” that could be used by auto-tuner
Reservoir Labs Harvard 04/12/2011
Driving the mapping: the machine model
�• Target machine characteristics that have an influence on how the mapping should be done
• Local memory / cache sizes• Communication facilities: DMA, cache(s)• Synchronization capabilities• Symmetrical or not• SIMD width• Bandwidths
�• Currently: two-level model (Host and Accelerators) �• XML schema and graphical rendering
Reservoir Labs Harvard 04/12/2011 17
Machine model example: multi-Tesla
OpenMP morphCUDA morph
Host
1 thread per GPU
XML file
Reservoir Labs Harvard 04/12/2011 18
Mapping process
2- Task formation:- Coarse-grain atomic tasks- Master/slave side operations
- Local / global data layout optimization- Multi-buffering (explicitly managed)- Synchronization (barriers)- Bulk communications- Thread generation -> master/slave- CUDA-specific optimizations
1- Scheduling:Parallelism, locality, tilability
3- Placement:Assign tasks to blocks/threads
Dependencies
Reservoir Labs Harvard 04/12/2011 19
Program Transformations Specification
iteration space of a statement S(i,j)
j
i
22: ZZ
t1
t2
�• Schedule maps iterations to multi-dimensional time:• A feasible schedule preserves dependences�• Placement maps iterations to multi-dimensional space:• UHPC in progress, partially done�• Layout maps data elements to multi-dimensional space:• UHPC in progress�• Hierarchical by design, tiling serves separation of concerns
Reservoir Labs Harvard 04/12/2011 20
Model for scheduling trades 3 objectives jointly
Loop Fusion
MoreParallelism
MoreLocality
Loop Fission
SufficientOccupancy
MemoryCoalescing
FewerGlobal
MemoryAccesses
+ successive thread
contiguity
BetterEffective
Bandwidth
Patent pending
+ successive thread
contiguity
Reservoir Labs Harvard 04/12/2011
Optimization with BLAS vs. global optimization
/* Optimization with BLAS */for loop {�…BLAS call 1�…BLAS call 2�…�…BLAS call n�…
}
Outer loop(s)
Retrieve data Z from diskStore data Z back to diskRetrieve data Z from disk !!!
Numerous cache misses/* Global Optimization*/doall loop {�…for loop {�…[read from Z]�…[write to Z]�…[read from Z]
}�…
}
Loop fusion can
improve locality
Can parallelize outer loop(s)
VS.
Global optimization can expose better parallelism and locality
Reservoir Labs Harvard 04/12/2011
• Significant parallelism is needed to fully utilize all resources• Locality is also critical to minimize communication• Parallelism can come at the expense of locality
�• Our approach: R-Stream compiler exposes parallelism via affine scheduling that simultaneously augments locality using loop fusion
Tradeoffs between parallelism and locality
High on-chip parallelism
Limited bandwidthat chip border
Reuse data once loaded on chip =
locality
Reservoir Labs Harvard 04/12/2011
Parallelism/locality tradeoff example
/** Original code:* Simplified CSLC LMS*/for (k=0; k<400; k++) {for (i=0; i<3997; i++) {z[i]=0;for (j=0; j<4000; j++)z[i]= z[i]+B[i][j]*x[k][j];
}for (i=0; i<3997; i++)w[i]=w[i]+z[i];
}
doall (i=0; i<400; i++)doall (j=0; j<3997; j++)z_e[j][i]=0
doall (i=0; i<400; i++)doall (j=0; j<3997; j++)for (k=0; k<4000; k++)z_e[j][i]=z_e[j][i]+B[j][k]*x[i][k];
doall (i=0; i<3997; i++)for (j=0; j<400; j++)w[i]=w[i]+z_e[i][j];
doall (i=0; i<3997; i++)z[i] = z_e[i][399];
Max. parallelism(no fusion)
Array z gets expanded, to introduce another level of parallelism
Data accumulation
2 levels of parallelism, but poor data reuse (on array z_e)
Maximum distribution destroys locality
Reservoir Labs Harvard 04/12/2011
Parallelism/locality tradeoff example (cont.)
doall (i=0; i<3997; i++)for (j=0; j<400; j++) {z[i]=0;for (k=0; k<4000; k++)z[i]=z[i]+B[i][k]*x[j][k];w[i]=w[i]+z[i];}
Max. fusion
Very good data reuse (on array z), but only 1 level of parallelism
Aggressive loop fusion destroys parallelism (i.e., only 1 degree of parallelism)/*
* Original code:* Simplified CSLC LMS*/for (k=0; k<400; k++) {for (i=0; i<3997; i++) {z[i]=0;for (j=0; j<4000; j++)z[i]= z[i]+B[i][j]*x[k][j];
}for (i=0; i<3997; i++)w[i]=w[i]+z[i];
}
Reservoir Labs Harvard 04/12/2011
Parallelism/locality tradeoff example (cont.)
doall (i=0; i<3997; i++) {doall (j=0; j<400; j++) {z_e[i][j]=0;for (k=0; k<4000; k++)z_e[i][j]=z_e[i][j]+B[i][k]*x[j][k];
}for (j=0; j<400; j++)w[i]=w[i]+z_e[i][j];
}doall (i=0; i<3997; i++)z[i]=z_e[i][399];
Parallelism with partial fusion
2 levels of parallelism with good data reuse (on array z_e)
Data accumulation
Expansion of array zPartial fusion doesn’t decrease parallelism
/** Original code:* Simplified CSLC LMS*/for (k=0; k<400; k++) {for (i=0; i<3997; i++) {z[i]=0;for (j=0; j<4000; j++)z[i]= z[i]+B[i][j]*x[k][j];
}for (i=0; i<3997; i++)w[i]=w[i]+z[i];
}
Reservoir Labs Harvard 04/12/2011
Parallelism/locality tradeoffs: performance numbers
Code with a good balance between parallelism and fusion performs best
In explicitly managed memory/scratchpad architectures this is even more true
Reservoir Labs Harvard 04/12/2011
�• R-Stream uses a heuristic based on an objective function with several cost coefficients:
• slowdown in execution if a loop p is executed sequentially rather than in parallel• cost in performance if two loops p and q remain unfused rather than fused
�• These two cost coefficients address parallelism and locality in a unified and unbiased manner (as opposed to traditional compilers)
�• Fine-grained parallelism, such as SIMD, can also be modeled using similar formulation
R-Stream: affine scheduling and fusion
edges loope
loops eel
ll fupw
cost of unfusingtwo loops
slowdown in sequential execution
minimize
Patent Pending
Reservoir Labs Harvard 04/12/2011
Parallelism + locality + spatial locality
28
edges loope
loops eel
ll fupw
benefits of improved localitybenefits of parallel execution
New algorithm (unpublished) balances contiguity to enhance coalescing for GPU and SIMDization
modulo data-layout transformations
Hypothesis that auto-tuning should adjust these parameters
Reservoir Labs Harvard 04/12/2011
�• R-Stream Overview
�• Compilation Walk-through
�• Performance Results
�• Getting R-Stream
29
Outline
Reservoir Labs Harvard 04/12/2011
What R-Stream does for you – in a nutshell
30
�• Input• Sequential
– Short and simple textbook C code
– Just add a “#pragmamap” and R-Stream figures out the rest
Reservoir Labs Harvard 04/12/2011
What R-Stream does for you – in a nutshell
31
�• Output• OpenMP + CUDA code
– Hundreds of lines of tightly optimized GPU-side CUDA code
– Few lines of host-side OpenMP C code
• Very difficult to hand-optimize
• Not available in any standard library
R-Stream
�• Input• Sequential
– Short and simple textbook C code
– Just add a “#pragmamap” and R-Stream figures out the rest
• Gauss-Seidel 9 points stencil– Used in iterative PDE solvers
– scientific modeling (heat, fluid flow, waves, etc.)
– Building block for faster iterative solvers like Multigrid or AMR
Reservoir Labs Harvard 04/12/2011
What R-Stream does for you – in a nutshell
32
�• Output• OpenMP + CUDA code
– Hundreds of lines of tightly optimized GPU-side CUDA code
– Few lines of host-side OpenMP C code
R-Stream
�• Input• Sequential
– Short and simple textbook C code
– Just add a “#pragmamap” and R-Stream figures out the rest
Will be illustrated inthe next few slides
• Achieving up to– 20 GFLOPS in
GTX 285– 25 GFLOPS in
GTX 480
Reservoir Labs Harvard 04/12/2011 33
Finding and utilizing available parallelism
Extracting and mapping parallel loops
GPUSM N
SM 2
SM 1
Off-chip Device memory (Global, constant, texture)
Shared Memory
InstructionUnit
SP 1
Registers
…SP 2
Registers
SP M
Registers
ConstantCache
TextureCache
R-Stream AUTOMATICALLY finds and forms parallelism
Excerpt of automatically generated code
Reservoir Labs Harvard 04/12/2011 34
Memory compaction on GPU scratchpad
GPUSM N
SM 2
SM 1
Off-chip Device memory (Global, constant, texture)
Shared Memory
InstructionUnit
SP 1
Registers
…SP 2
Registers
SP M
Registers
ConstantCache
TextureCache
Excerpt of automatically generated code
R-Stream AUTOMATICALLYmanages local scratchpad
Reservoir Labs Harvard 04/12/2011 35
GPU DRAM to scratchpad coalesced communication
GPUSM N
SM 2
SM 1
Off-chip Device memory (Global, constant, texture)
Shared Memory
InstructionUnit
SP 1
Registers
…SP 2
Registers
SP M
Registers
ConstantCache
TextureCache
Coalesced GPU DRAM accesses
R-Stream AUTOMATICALLYchooses parallelism to favor coalescing
Excerpt of automatically generated code
Reservoir Labs Harvard 04/12/2011 36
Host-to-GPU communication
GPUSM N
SM 2
SM 1
Off-chip Device memory (Global, constant, texture)
Shared Memory
InstructionUnit
SP 1
Registers
…SP 2
Registers
SP M
Registers
ConstantCache
TextureCache
Host memory
CPU
PCI Express
Excerpt of automatically generated code
R-Stream AUTOMATICALLYchooses partition and sets up host to GPU communication
Reservoir Labs Harvard 04/12/2011 37
Multi-GPU mapping
Host memory
CPU
Host memory
CPU
GPU GPU GPU
GPU memory
GPU memory
GPU memory
Mappingacross all GPUs
R-Stream AUTOMATICALLY finds another level of parallelism, across GPUs
Excerpt of automatically generated code
Reservoir Labs Harvard 04/12/2011 38
Multi-GPU mapping
Host memory
CPU
Host memory
CPU
GPU GPU GPU
GPU memory
GPU memory
GPU memory
Multi-streamingof host-GPU communication
Excerpt of automatically generated code
R-Stream AUTOMATICALLYcreates n-way software pipelines for communications
Reservoir Labs Harvard 04/12/2011
Program
39
Future capabilities – mapping to CPU-GPU clusters
DRAM
CPU
DRAM
CPU
GPU GPU GPU
DRAM DRAM DRAM
CPU + GPU CPU + GPU
High Speed Interconnect (e.g. InfiniBand)
MPI Process
OpenMP process launching CUDA
MPI Process
OpenMP process launching CUDA
Reservoir Labs Harvard 04/12/2011
�• R-Stream Overview
�• Compilation Walk-through
�• Performance Results
�• Getting R-Stream
40
Outline
Reservoir Labs Harvard 04/12/2011
�• Main comparisons:• R-Stream High-Level C Transformation Tool 3.1.2• Intel MKL 10.2.1
Experimental evaluation
Radar code MKL calls
Configuration 1: MKL
Radar code
GCC
Configuration 2: Low-level compilers
ICC
Radar code
Configuration 3: R-Stream
Optimized radar code
GCCICC
R-Stream
Reservoir Labs Harvard 04/12/2011
�• Intel Xeon workstation:• Dual quad-core E5405 Xeon processors (8 cores total) • 9GB memory�• 8 OpenMP threads�• Single precision floating point data�• Low-level compilers and the used flags:• GCC: -O6 -fno-trapping-math -ftree-vectorize -msse3 -fopenmp• ICC: -fast -openmp
Experimental evaluation (cont.)
Reservoir Labs Harvard 04/12/2011
�• Beamforming algorithms:• MVDR-SER: Minimum Variance Distortionless Response using
Sequential Regression• CSLC-LMS: Coherent Sidelobe Cancellation using Least Mean Square• CSLC-RLS: Coherent Sidelobe Cancellation using Robust Least
Square�• Expressed in sequential ANSI C�• 400 radar iterations�• Compute 3 radar sidelobes (for CSLC-LMS and CSLC-RLS)
Radar benchmarks
Reservoir Labs Harvard 04/12/2011
MVDR-SER
Reservoir Labs Harvard 04/12/2011
CSLC-LMS
Reservoir Labs Harvard 04/12/2011
CSLC-RLS
Reservoir Labs Harvard 04/12/2011
�• #pragma rstream map
�• void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X],
�• int pX, int pY, int pZ) {
�• double temp;
�• int i, j, k;
�• for (k=4; k<pZ-4; k++) {
�• for (j=4; j<pY-4; j++) {
�• for (i=4; i<pX-4; i++) {
�• temp = C0 * U2[k][j][i] +
�• C1 * (U2[k-1][j][i] + U2[k+1][j][i] +
�• U2[k][j-1][i] + U2[k][j+1][i] +
�• U2[k][j][i-1] + U2[k][j][i+1]) +
�• C2 * (U2[k-2][j][i] + U2[k+2][j][i] +
�• U2[k][j-2][i] + U2[k][j+2][i] +
�• U2[k][j][i-2] + U2[k][j][i+2]) +
�• C3 * (U2[k-3][j][i] + U2[k+3][j][i] +
�• U2[k][j-3][i] + U2[k][j+3][i] +
�• U2[k][j][i-3] + U2[k][j][i+3]) +
�• C4 * (U2[k-4][j][i] + U2[k+4][j][i] +
�• U2[k][j-4][i] + U2[k][j+4][i] +
�• U2[k][j][i-4] + U2[k][j][i+4]);
�• U1[k][j][i] =
�• 2.0f * U2[k][j][i] - U1[k][j][i] +
�• V[k][j][i] * temp;
�• } } } }
47
3D Discretized wave equation input code (RTM)
25-point 8th
order (in space) stencil
Reservoir Labs Harvard 04/12/2011 48
3D Discretized wave equation input code (RTM)
Not so naïve …
Communication autotuningknobs
ThreadIdx.x divergence is expensive
Reservoir Labs Harvard 04/12/2011
�• R-Stream Overview
�• Compilation Walk-through
�• Performance Results
�• Getting R-Stream
49
Outline
Reservoir Labs Harvard 04/12/2011
�• Ongoing development also supported by DOE, Reservoir• Improvements in scope, stability, performance
�• Installations/evaluations at US government laboratories
�• Forward collaboration with Georgia Tech on Keeneland• HP SL390 - 3 FERMI GPU, 2 Westmere/node
�• Basis of compiler for DARPA UHPC Intel Corporation Team
50
Current status
Reservoir Labs Harvard 04/12/2011
�• Per developer seat licensing model
�• Support (releases, bug fixes, services)
�• Available with commercial grade external solvers
�• Government has limited rights / SBIR data rights
�• Academic source licenses with collaborators
�• Professional team, continuity, software engineering
51
Availability
Reservoir Labs Harvard 04/12/2011
�• Teaches you about parallelism and the polyhedral model:• Generates correct code for any transformation
– The transformation may be incorrect if specified by the user• Imperative code generation has been the bottleneck till 2005
– 3 thesis written on the topic and it’s still not completely covered– It is good to have an intuition of how these things work
�• R-Stream has meaningful metrics to represent your program• Maximal amount of parallelism given minimal expansion• Tradeoffs between coarse-grained and fine-grained parallelism• Loop types (doall, red, perm, seq)�• Help with algorithm selection�• Tiling of imperfectly nested loop nests�• Generates code for explicitly managed memory
52
R-Stream Gives More Insight About Programs
Reservoir Labs Harvard 04/12/2011
�• R-Stream is a great transformation tool, it is also a great learning tool:• Takes “simple C” input code• Applies multiple transformations and lists them explicitly• Can dump code at any step in the process• It’s the tool I wish I had during my PhD:
– To be fair, I already had a great set of tools�• R-Stream can be used in multiple modes:• Fully automatic + compile flag options• Autotuning mode (more than just tile size and unrolling … )• Scripting / Programmatic mode (Beanshell + interfaces)• Mix of these modes + manual post-processing
53
How to Use R-Stream Successfully
Reservoir Labs Harvard 04/12/2011
�• Akin to traditional gcc / icc compilation with flags:• Predefined transformations can be parameterized
– Except less/no phase ordering issues• Except you can see what each transformation does at each step• Except you can generate compilable and executable code at (almost)
any step in the process• Except you can control code generation for compactness or
performance
54
Use Case: Fully Automatic Mode + Compile Flag Options
Reservoir Labs Harvard 04/12/2011
�• More advanced than traditional approaches:• Knobs go far beyond loop unrolling + unroll-and-jam + tiling • Knobs are based on well-understood models• Knobs target high-level properties of the program
– Amount of parallelism, amount of memory expansion, depth of pipelining of communication and computations …
• Knobs are dependent on target machine, program and state of the mapping process:– Our tool has introspection
�• We are really building a hierarchical autotuning transformation tool
55
Use Case: Autotuning Mode
Reservoir Labs Harvard 04/12/2011
�• Beanshell access to optimizations
�• Can direct and review the process of compilation• automatic tactics (affine scheduling)
�• Can direct code generation
�• Access to “tuning parameters”
�• All options / commands available on command line interface
56
Use Case: “Power user” interactive interface
Reservoir Labs Harvard 04/12/2011
�• R-Stream simplifies software development and maintenance
�• Porting: reduces expense and delivery delays
�• Does this by automatically parallelizing loop code• While optimizing for data locality, coalescing, etc.
�• Addresses• Dense loop-intensive computations
�• Extensions• Data-parallel programming idioms• Sparse representations• Dynamic runtime execution
57
Conclusion
Reservoir Labs Harvard 04/12/2011
�• Per-developer seat, floating, cloud-based licensing
�• Discounts for academic users
�• Research collaborations with academic partners
�• For more information:• Call us at 212-780-0527, or• See Rich, Ann• E-mail us at
– {sales,lethin,johnson}@reservoir.com
58
Contact us