CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.

transcript

CR18: Advanced Compilers

L06: Code Generation

Tomofumi Yuki

Code Generation

Completing the transformation loop

Problem: how to generate code to scan a

polyhedron? a union of polyhedra? how to generate tiled code? how to generate parametrically tiled

Evolution of Code Gen

Ancourt & Irigoin 1991 single polyhedron scanning

LooPo: Griebl & Lengauer 1996 1st step to unions of polyhedra scan bounding box + guards

Omega Code Gen 1995 generate inefficient code (convex hull +

guards) then try to remove inefficiencies

Evolution of Code Gen

LoopGen: Quilleré-Rajopadhye-Wilde 2000 efficiently scanning unions of polyhedra

CLooG: Bastoul 2004 improvements to QRW algorithm robust and well maintained

implementation AST Generation: Grosser 2015

Polyhedral AST generation is more than scanning polyhedra

scanning is not enough!

Scanning a Polyhedron

Scanning Polyhedra with DO Loops [1991]

Problem: generate bounds on loops outermost loop: constants and params inner loop: + surrounding iterators

Approach: Fourier-Motzkin elimination projecting out variables

Single Polyhedron Example

What is the loop nest for lex. scan?

ji≤N

i-j≥0for i = 0 .. N for j = 0 .. i S;

Single Polyhedron Example

What is the loop nest for permuted case? j as the outer loop

ji≤N

i-j≥0for j = 0 .. N for i = j .. N S;

Scanning Unions of Polyhedra Consider scanning two statements

Naïve approach: bounding box

S1: [N]->{ S1[i]->[i] : 0≤i<N }S2: [N]->{ S2[i]->[i+5] : 0≤i<N }

S1: [N]->{ [i] : 0≤i≤N }S2: [N]->{ [i] : 5≤i≤N+5 }

for (i=0 .. N+5) if (0<=i && i<=N) S1; if (5<=i && i<=N+5) S2;

Slightly Better than BBox

Make disjoint domains

But this is also problematic: code size can quickly grow

for (i=0 .. i<=4) S1;for (i=4 .. i<=N) S1; S2;for (i=N+1 .. i<=N+5) S2;

S1: [N]->{ S1[i]->[i] : 0≤i<N }S2: [N]->{ S2[i]->[i] : 0≤i<M }

QRW Algorithm

Key: Recursive Splitting Given a set of n-D domains to scan

start at d=1 and context=parameters 1. Restrict the domains to the context 2. Project the domains to outer d-

dimensions 3. Make the projections disjoint 4. Recurse for each disjoint projection

d=d+1, context=a piece of the projection 5. Sort the resulting loops

Example

Scan the following domains

d=1context=universe

Step1: Projection

Step2: Separation

Step3: Recurse

Step4: Sort

Example

d=1context=universe

Step1: Projection

Step2: Separation

Step3: Recurse

Step4: Sort

Example

d=1context=universe

Step1: Projection

Step2: Separation

Step3: Recurse

Step4: Sort

for (i=0..1) ...

for (i=2..6) ...

Example

d=2context=0≤i≤2

Step1: Projection

Step2: Separation

Step3: Recurse

Step4: Sort

for (i=0..1) ...for (i=0..1) for (j=0..4) S1;

Example

Step1: Projection

Step2: Separation

Step3: Recurse

Step4: Sort

for (i=2..6) ...

Example

Step1: Projection

Step2: Separation

Step3: Recurse

Step4: Sort

for (i=2..6) ...

Example

Step1: Projection

Step2: Separation

Step3: Recurse

Step4: Sort

for (i=2..6) L2 L1 L3 L4

CLooG: Chunky Loop Generator A few problems in QRW Algorithm

high complexity code size is not controlled

CLooG uses: pattern matching to avoid costly

polyhedral operations during separation may stop the recursion at some depth

and generate loops with guards to reduce size

Tiled Code Generation

Tiling with fixed size we did this earlier

Tiling with parametric size problem: non-affine!

Tiling Review

What does the tiled code look like?for (i=0; i<=N; i++) for (j=0; j<=N; j++) S;

for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) Sfor (ti=0; ti<=floor(N,ts); ti++) for (tj=0; tj<=floor(N,ts); tj++) for (i=ti*ts; i<min(N+1,(ti+1)*ts); i++) for (j=tj*ts; j<min(N+1,(tj+1)*ts); j++) S

with tile size ts

Two Approaches

Use fix sized tiling if the tile is a constant, stays affine pragmatic choice by many tools

Use non-polyhedral code generation much better for tuning tile sizes make sense for semi-automatic tools

Difficulties in Tiled Code Gen This is still a very simplified view

In practice, we tile after transformation skewing, etc.

Let’s see the tiled iteration space with tvis

for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) S

Full Tiles, Inset / Outset

Partial tiles have a lot of control overhead

Challenges for parametric tiled code gen make sure to scan the outset but also separate the inset use efficient point loops for inset

All with out polyhedral analysis

Point Loops for Full/Partial Tile Full Tile Point Loop

Partial/Empty Tile Point Loop

for (i=ti; i<ti+si; i++) for (j=tj; j<tj+sj; j++) for (k=tk; k<tk+sk; k++) ...

for (i=max(ti,...); i<min(ti+si,...); i++) for (j=max(tj,...); j<min(tj+sj,...); j++) for (k=max(tk,...); k<min(tk+sk,...); k++) if (...) ...

Progression of Parametric Tiling Perfectly nested, single loop

TLoG [Renganarayana et al. 2007] Multiple levels of tiling

HiTLoG [Renganarayana et al. 2007] PrimeTile [Hartono 2009]

Parallelizing the tiles DynTile [Hartono 2010] D-Tiling [Kim 2011]

Computing the Outset

We start with some domain expand in each dimension by (symbolic) tile size – 1 except for upper bounds

{[i,j]: 0≤i≤10 and i≤j≤i+10}

{[i,j]: -(ts-1) ≤i≤10 and -(ts-1)+i≤j and -(ts-1)+j≤i+10}

Computing the Inset

We start with some domain shrink in each dimension by (symbolic) tile size – 1 except for lower bounds

{[i,j]: 0≤i≤10 and i≤j≤i+10}

{[i,j]: 0≤i≤10-(ts-1) and i≤j-(ts-1) and j≤i+10-(ts-1)}

Syntactic Manipulation

We cannot use polyhedral code generators so back to modifying AST

Modify the loop bounds to get loops that visit outset get guards to switch point-loops

Up to here is HiTLoG/PrimeTile

Problem: Parallelization

After tiling, there is parallelism However, it requires skewing of tiles

We need non-polyehdral skewing The key equation:

d: number of tiled dimensions ti: tile origins ts: tile sizes

D-Tiling

The equation enables skewing of tiles If one of time or tile origins are

unknown, can be computed from the others

Generated Code: (tix is d-1th tile origin)for (time=start:end) for (ti1=ti1LB:ti1UB) … for (tix=tixLB:tixUB) { tid = f(time, ti1, …, tix); //compute tile ti1,ti2,…,tix,tid }

Distributed Memory Parallelization Problems implicitly handled by the

shared memory now need explicit treatment

Communication Which processors need to send/receive? Which data to send/receive? How to manage communication buffers?

Data partitioning How do you allocate memory across

nodes?

MPI Code Generator

Distributed Memory Parallelization Tiling based Parameterized tile sizes C+MPI implementation

Uniform dependences as key enabler Many affine dependences can be

uniformized Shared memory performance carried

over to distributed memory Scales as well as PLuTo but to multiple

Related Work (Polyhedral)

Polyhedral Approaches Initial idea [Amarasinghe1993] Analysis for fixed sized tiling

[Claßen2006] Further optimization [Bondhugula2011]

“Brute Force” polyhedral analysis for handling communication No hope of handling parametric tile size Can handle arbitrarily affine programs

Outline

Introduction “Uniform-ness” of Affine Programs

Uniformization Uniform-ness of PolyBench

MPI Code Generation Tiling Uniform-ness simplifies everything Comparison against PLuTo with

PolyBench Conclusions and Future Work

Affine vs Uniform

Affine Dependences: 　　 f = Ax+b Examples

(i,j->j,i) (i,j->i,i) (i->0)

Uniform Dependences: f = Ix+b Examples

(i,j->i-1,j) (i->i-1)

Uniformization

(i->0) (i->0)

(i->i-1)

Uniformization

Uniformization is a classic technique “solved” in the 1980’s has been “forgotten” in the multi-core

era Any affine dependence can be

uniformized by adding a dimension

[Roychowdhury1988] Nullspace pipelining

simple technique for uniformization many dependences are uniformized

Uniformization and Tiling

Uniformization does not influence tilability

PolyBench [Pouchet2010]

Collection of 30 polyhedral kernels Proposed by Pouchet as a benchmark for

polyhedral compilation Goal: Small enough benchmark so that

individual results are reported; no averages

Kernels from: data mining linear algebra kernels, solvers dynamic programming stencil computations

Uniform-ness of PolyBench

5 of them are “incorrect” and are excluded

Embedding: Match dimensions of statements

Phase Detection: Separate program into phases Output of a phase is used as inputs to

the other

Stage Uniform at

AfterEmbeddin

AfterPipelining

After Phase

Detection

Number of Fully UniformPrograms

8/25 (32%)

13/25 (52%)

21/25 (84%)

24/25 (96%)

Outline

Introduction Uniform-ness of Affine Programs

Uniformization Uniform-ness of PolyBench

MPI Code Generation Tiling Uniform-ness simplifies everything Comparison against PLuTo with

PolyBench Conclusions and Future Work

Basic Strategy: Tiling

We focus on tilable programs

Dependences in Tilable Space All in the non-positive direction

Wave-front Parallelization

All tiles with the same color can run in parallel

Assumptions

Uniform in at least one of the dimensions

The uniform dimension is made outermost Tilable space is fully permutable

One-dimensional processor allocation Large enough tile sizes

Dependences do not span multiple tiles Then, communication is extremely

simplified

Processor Allocation

Outermost tile loop is distributed

P0 P1 P2 P3i1

Values to be Communicated

Faces of the tiles (may be thicker than 1)

P0 P1 P2 P3

Naïve Placement of Send and Receive Codes Receiver is the consumer tile of the

values

P0 P1 P2 P3

Problems in Naïve Placement Receiver is in the next wave-front time

P0 P1 P2 P3

Problems in Naïve Placement Receiver is in the next wave-front time Number of communications “in-flight”

= amount of parallelism MPI_Send will deadlock

May not return control if system buffer is full

Asynchronous communication is required Must manage your own buffer required buffer = amount of parallelism

i.e., number of virtual processors

Proposed Placement of Send and Receive codes Receiver is one tile below the consumer

P0 P1 P2 P3

Placement within a Tile

Naïve Placement: Receive -> Compute -> Send

Proposed Placement: Issue asynchronous receive (MPI_Irecv) Compute Issue asynchronous send (MPI_Isend) Wait for values to arrive

Overlap of computation and communication

Only two buffers per physical processor

Overlap

Recv Buffer

Send Buffer

Evaluation

Compare performance with PLuTo Shared memory version with same

strategy Cray: 24 cores per node, up to 96 cores Goal: Similar scaling as PLuTo Tile sizes are searched with educated

guesses PolyBench

7 are too small 3 cannot be tiled or have limited

parallelism 9 cannot be used due to

PLuTo/PolyBench issue

Performance Results

Linear extrapolation from speed up of 24

cores Broadcast cost at most 2.5 seconds

CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.

Documents