+ All Categories

E Lno

Date post: 10-Jun-2015
Category:
Upload: kun-ling
View: 989 times
Download: 0 times
Share this document with a friend
Popular Tags:
26
Section E Loop Nest Optimization
Transcript
Page 1: E Lno

Section ELoop Nest Optimization

Page 2: E Lno

Loop Nest Optimizer (LNO) Overview

Performs transformations on a loop nest

Scope of work: each top-level loop nest

Does not build control flow graph Driven by data dependency analysis Uses alias and use-def information provided by scalar

optimizer

Annotate data dependency information for use by code generator Innermost loop only

Requires modeling of hardware resources

Page 3: E Lno

Dependence TestingDependence

Given two references R1 and R2, R2 depends on R1 if they may access the same memory location and there is a path from R1 to R2

– true dependence, anti dependence, output dependence

Access Array Each array subscript in terms of loop index variables

Access Vector A vector of all the subscripts’ access arrays

Dependence Testing (input are access arrays) refer to “Efficient and Exact Data Dependence Analysis”, Dror

Maydan, et al., PLDI’91. output: dependence vector, each dimension representing a loop

level

Do I = 1, N

DO J = 1, N

...a(2I + J, 3J)...

access arrays are: (2, 1), (0, 3)

access vector: [( 2,1), (0, 3)]

Page 4: E Lno

Three Classes of Optimizations by LNO

1. Transformations for Data Cache

2. Transformations that help other optimizations

3. Vectorization and Parallellization

Page 5: E Lno

LNO Transformations for Data Cache

Cache blocking

Transform loop to work on sub-matrices that fit in cache

Loop interchange

Array Padding

Reduce cache conflicts

Prefetches generation

Hide the long latency of cache miss references

Loop fusion

Loop fission

Page 6: E Lno

Cache Blocking

Matrix B misses all the time

n3+2n2 cache misses (ignoring line size)

for (i=0; i<n; i++)for (j=0; j<n; j++)

for (k=0; k<n; k++)c[i][j] = c[i][j] + a[i][k]*b[k][j];

* =

A B C

Page 7: E Lno

Cache Blocking

Use sub-matrices that fit entirely in cache.

* =

A B C

A11 A12

A21 A22

B11

B21

C11

C21

C12

C22

B12

B22

C11 = A11 * B11 + A12 * B21

For sub-matrices of size b, (n/b)*n2+2n2 cache missesinstead of n3+2n2

Page 8: E Lno

Loop Interchange (permutation)

Improve Data Locality

Unimodular Transformation combined with cache blocking, loop reversal and loop skewing, etc.

to improve the overall data locality of the loop nests.

Enables Vectorization and Parallelization

DO I = 1, N

DO J = 1, M

A(I, J) = B(I, J) + C

No spatial reuse. Cache miss

for every reference of A and B

DO J = 1, M

DO I = 1, N

A(I, J) = B(I, J) + C

Miss once every 16

iterations (element size:

4 bytes, cache line size:

64 bytes)

Page 9: E Lno

Software PrefetchMajor Considerations for Software Prefetch

what to prefetch – for references that most likely cause cache misses

when to prefetch – neither too early nor too late

avoid useless prefetches – register pressure, cache pollution, memory bandwidth

Major Phases in Our Prefetch Engine

Process_Loop – build internal structures, etc.

Build_Base_LGs – build locality group (references that likely access same cache line)

Volume – compute data volume for each loop, from innermost to outermost

Find_Loc_Loops – determine which locality groups need prefetch

Gen_Prefetch – generate prefetch instructions

Page 10: E Lno

Software Prefetch

Prefetch N Cache Lines Ahead for a(i), prefetch a(i + N*line_size) -LNO:prefetch_ahead=N (default 2)

One Prefetch for Each Cache Line Versioning in loop

Combine with unrolling

DO I = 1, N if I % 16 == 0 then loop body with prefetch code else loop body without prefetch code

DO I = 1, N, 16

prefetch b(I + 2*16)

a = a + b(I)

a = a + b(I+1)

...

a = a + b(I+15

Page 11: E Lno

Loop Fission and Fusion

Loop Fission Enables loop interchange Enables vectorization and parallelization Reduce conflict misses

Loop Fusion Reduce loop overhead Improve data reuse Larger loop body

DO I = 1, N

a(I) = b(I) + C

ENDDO

DO I = 1, N

d(I) = a(I) + E

ENDDO

DO I = 1, N

a(I) = b(I) + C

d(I) = a(I) + E

ENDDO

Page 12: E Lno

LNO Transformations that Help Other Optimizations

Scalar Expansion / Array Expansion Reduce inter-loop dependencies Enable parallelization

Scalar Variable Renaming Loop nests can be optimized separately Less constraints for register allocation

Array Scalarization Improves register allocation

Hoist Messy Loop BoundsOuter loop unrollingArray Substitution (Forward and Backward)Loop UnswitchingHoist IFGather-scatterMove invariant array references out of loopsInter-iteration CSE

Page 13: E Lno

Outer Loop Unrolling Form larger loop bodies Reduce loop overhead

For (i=0; i<n; i++)for (j=0; j<m; j++)

a[i][j] = a[i][j] + x*b[j] * c[j];

for (i=0; i<n-1; i+=2)for (j=0; j<m; j++) {

a[i][j] = a[i][j] + x*b[j] * c[j];a[i+1][j] = a[i+1][j] + x*b[j] * c[j];

}for (j=0; j<m; j++)

a[i][j] = a[i][j] + x*b[j] * c[j];

1 add2 mults3 loads1 store

2 add4 mults4 loads2 store

Page 14: E Lno

Gather-Scatter

do i = 1, nif (c(i) .gt. 0.0) then

a(i) = c(i) / b(i)c(i) = c(i) * b(i)

end ifend do

do i = 1, nderef_gs(inc_0+1) = iif (c(i) .gt. 0.0) then

inc_0 = inc_0 + 1end if

end dodo ind_0 = 0, inc_0-1

i_gs = deref_gs(ind_0+1)a(i_gs) = c(i_gs)/b(i_gs)c(i_gs) = c(i_gs)*b(i_gs)

end do

Page 15: E Lno

Forward and Backward Array Substitution

DO i = 1, N

DO j = 1, N

s = C(i,j)

DO k = 1,N

s = s + A(i, k) * B(k, j)

ENDDO

C(i, j) = s

ENDDO

ENDDO

DO i = 1, N

DO j = 1, N

DO k = 1,N

C(i,j) = C(i,j) + A(i, k) * B(k, j)

ENDDO

ENDDO

ENDDO

Page 16: E Lno

Hoist IF

Remove the loop by replicating the matching iterations

DO i = 1, N

if (i == winner && F(i))

G(i)

ENDDO

if (F(winner))

G(winner)

Page 17: E Lno

Loop Unswitching

Move IFs with invariant conditions out of the loop

DO i = 1, N

if (cond)

G(i)

ENDDO

If (cond)

DO i = 1, N

G(i)

ENDDO

Page 18: E Lno

Inter-Iteration CSE

DO I = 1, N

c(I) = a(I) + b(I)

d(I) = a(I+1) + b(I+1)

ENDDO

temp1 = a(1) + b(1)

DO I = 1, N

temp = temp1

temp1 = a(I+1) + b(I+1)

c(I) = temp

d(I) = temp1

ENDDO

Page 19: E Lno

LNO Parallelization

SIMD code generation

Highly dependent on the SIMD instructions in target

Generate vector intrinsics

Based on the library functions available

Automatic parallelization

Leverage OpenMP support in rest of backend

Page 20: E Lno

VectorizationApplied to the Innermost Loop

Any statement not involved in a dependence cycle may be vectorized

General Vectorization Implementation Constraints Checking Dependence Analysis for Innermost Loop

– build statement dependence graph– detect dependence cycles (strongly connected components

(SCCs)) Techniques to Enable Vectorization

– applies when dependence cycle exists (see next slide). Rewrite the Loop to its Vectorized Version

DO I = 1, N

a(I) = a(I) + C

Can be vectorized

DO I = 1, N

a(I+1) = a(I) + C

Dependence Cycle! Cannot be vectorized

Page 21: E Lno

Techniques to Enable Vectorization Scalar Expansion

Loop Fission

Other Approaches

Loop interchange, array renaming, etc

DO I = 1, N

a(I+1) = a(I) + C //cycle

b(I) = b(I) + D //no cycle

ENDDO

DO I = 1, N

a(I+1) = a(I) + C //cycle, loop not vectorizable

ENDDO

DO I = 1, N

b(I) = B(I) + D //no cycle, vectorizable loop

ENDDO

DO I = 1, N

T = a(I)

a(I) = b(I)

b(I) = T

ENDDO

DO I = 1, N

t(I) = a(I)

a(I) = b(I)

b(I) = t(I)

ENDDO

expand scalar T to array t

Page 22: E Lno

Extra Considerations for SIMD Special type of vectorization

Array reference must be contiguous e.g., A[2*I] and A[2*(I+1)] are not contiguous Loop versioning may be required for F90 arrays to guarantee

contiguity

Alignment Sometimes no benefit if not aligned to 128 bit boundary peeling may be required

May need a remainder loop

DO I = 1, N

a(I) = a(I) + C

ENDDO

DO I = 1, N-N%4, 4

a(I:I+3) = a(I:I+3) + C

ENDDO

! remainder

DO I = N-N%4, N

a(I) = a(I) + C

ENDDO

Page 23: E Lno

SIMD with Reduction

Replicate accumulator for each computation thread

sum = 0

DO i = 1, N

sum = sum + A(i)

ENDDO

sum0 = 0

sum1 = 0

sum2 = 0

sum3 = 0

DO i = 1, N, 4

sum0 = sum0 + A(i)

sum1 = sum1 + A(i+1)

sum2 = sum2 + A(i+2)

sum3 = sum3 + A(i+3)

ENDDO

sum = sum0 + sum1 + sum2 + sum3

Page 24: E Lno

Generating Vector Intrinsic

Fission is usually needed to isolate the instrinsics

for(i=0; i< N; i++){

a[i] = a[i] + 3.0;

u[i]=cos(ct[i]);

}

vcos(&ct[0], &u[0], N, 1, 1);

for(i= 0; i<N; i++)

a[i] = a[i] + 3.0;

Page 25: E Lno

LNO Phase Structure1. Pre-optimization

2. Fully Unroll Short Loops [lnopt_main.cxx]

3. Build Array Dependence Graph [be/com/dep_graph.cxx]

4. Miscellaneous Optimizations

hoist varying lower bounds [access_main.cxx]

form min/max [ifminmax.cxx]

dead store eliminate arrays [dead.cxx]

array substitutions (forward and backward) [forward.cxx]

loop reversal [reverse.cxx]

1. Loop Unswitching [cond.cxx]

2. Cache Blocking [tile.cxx]

3. Loop Interchange [permute.cxx]

4. Loop Fusion [fusion.cxx]

5. Hoist Messy Loop Bounds [array_bounds.cxx]

Page 26: E Lno

LNO Phase Structure (continued)

10. Array Padding [pad.cxx]

11. Parallellization [parallel.cxx]

12. Shackling [shackle.cxx]

13. Gather Scatter [fis_gthr.cxx]

14. Loop Fission [fission.cxx]

15. SIMD [simd.cxx]

16. Hoist IF [lnopt_hoistif.cxx]

17. Generate Vector Intrinsics [vintr_fis.cxx]

18. Generate Prefetches [prefetch.cxx]

19. Array Scalarization [sclrze.cxx]

20. Move Invariant Outside the Loop [minvariant.cxx]

21. Inter-Iteration Common Sub-Expression Elimination [cse.cxx]


Recommended