E Lno

Section ELoop Nest Optimization

Loop Nest Optimizer (LNO) Overview

Performs transformations on a loop nest

Scope of work: each top-level loop nest

Does not build control flow graph Driven by data dependency analysis Uses alias and use-def information provided by scalar

optimizer

Annotate data dependency information for use by code generator Innermost loop only

Requires modeling of hardware resources

Dependence TestingDependence

Given two references R1 and R2, R2 depends on R1 if they may access the same memory location and there is a path from R1 to R2

– true dependence, anti dependence, output dependence

Access Array Each array subscript in terms of loop index variables

Access Vector A vector of all the subscripts’ access arrays

Dependence Testing (input are access arrays) refer to “Efficient and Exact Data Dependence Analysis”, Dror

Maydan, et al., PLDI’91. output: dependence vector, each dimension representing a loop

level

Do I = 1, N

DO J = 1, N

...a(2I + J, 3J)...

access arrays are: (2, 1), (0, 3)

access vector: [( 2,1), (0, 3)]

Three Classes of Optimizations by LNO

1. Transformations for Data Cache

2. Transformations that help other optimizations

3. Vectorization and Parallellization

LNO Transformations for Data Cache

Cache blocking

Transform loop to work on sub-matrices that fit in cache

Loop interchange

Array Padding

Reduce cache conflicts

Prefetches generation

Hide the long latency of cache miss references

Loop fusion

Loop fission

Cache Blocking

Matrix B misses all the time

n3+2n2 cache misses (ignoring line size)

for (i=0; i<n; i++)for (j=0; j<n; j++)

for (k=0; k<n; k++)c[i][j] = c[i][j] + a[i][k]*b[k][j];

* =

A B C

Cache Blocking

Use sub-matrices that fit entirely in cache.

* =

A B C

A11 A12

A21 A22

B11

B21

C11

C21

C12

C22

B12

B22

C11 = A11 * B11 + A12 * B21

For sub-matrices of size b, (n/b)*n2+2n2 cache missesinstead of n3+2n2

Loop Interchange (permutation)

Improve Data Locality

Unimodular Transformation combined with cache blocking, loop reversal and loop skewing, etc.

to improve the overall data locality of the loop nests.

Enables Vectorization and Parallelization

DO I = 1, N

DO J = 1, M

A(I, J) = B(I, J) + C

No spatial reuse. Cache miss

for every reference of A and B

DO J = 1, M

DO I = 1, N

A(I, J) = B(I, J) + C

Miss once every 16

iterations (element size:

4 bytes, cache line size:

64 bytes)

Software PrefetchMajor Considerations for Software Prefetch

what to prefetch – for references that most likely cause cache misses

when to prefetch – neither too early nor too late

avoid useless prefetches – register pressure, cache pollution, memory bandwidth

Major Phases in Our Prefetch Engine

Process_Loop – build internal structures, etc.

Build_Base_LGs – build locality group (references that likely access same cache line)

Volume – compute data volume for each loop, from innermost to outermost

Find_Loc_Loops – determine which locality groups need prefetch

Gen_Prefetch – generate prefetch instructions

Software Prefetch

Prefetch N Cache Lines Ahead for a(i), prefetch a(i + N*line_size) -LNO:prefetch_ahead=N (default 2)

One Prefetch for Each Cache Line Versioning in loop

Combine with unrolling

DO I = 1, N if I % 16 == 0 then loop body with prefetch code else loop body without prefetch code

DO I = 1, N, 16

prefetch b(I + 2*16)

a = a + b(I)

a = a + b(I+1)

...

a = a + b(I+15

Loop Fission and Fusion

Loop Fission Enables loop interchange Enables vectorization and parallelization Reduce conflict misses

Loop Fusion Reduce loop overhead Improve data reuse Larger loop body

DO I = 1, N

a(I) = b(I) + C

ENDDO

DO I = 1, N

d(I) = a(I) + E

ENDDO

DO I = 1, N

a(I) = b(I) + C

d(I) = a(I) + E

ENDDO

LNO Transformations that Help Other Optimizations

Scalar Expansion / Array Expansion Reduce inter-loop dependencies Enable parallelization

Scalar Variable Renaming Loop nests can be optimized separately Less constraints for register allocation

Array Scalarization Improves register allocation

Hoist Messy Loop BoundsOuter loop unrollingArray Substitution (Forward and Backward)Loop UnswitchingHoist IFGather-scatterMove invariant array references out of loopsInter-iteration CSE

Outer Loop Unrolling Form larger loop bodies Reduce loop overhead

For (i=0; i<n; i++)for (j=0; j<m; j++)

a[i][j] = a[i][j] + x*b[j] * c[j];

for (i=0; i<n-1; i+=2)for (j=0; j<m; j++) {

a[i][j] = a[i][j] + x*b[j] * c[j];a[i+1][j] = a[i+1][j] + x*b[j] * c[j];

}for (j=0; j<m; j++)

a[i][j] = a[i][j] + x*b[j] * c[j];

1 add2 mults3 loads1 store

2 add4 mults4 loads2 store

Gather-Scatter

do i = 1, nif (c(i) .gt. 0.0) then

a(i) = c(i) / b(i)c(i) = c(i) * b(i)

end ifend do

do i = 1, nderef_gs(inc_0+1) = iif (c(i) .gt. 0.0) then

inc_0 = inc_0 + 1end if

end dodo ind_0 = 0, inc_0-1

i_gs = deref_gs(ind_0+1)a(i_gs) = c(i_gs)/b(i_gs)c(i_gs) = c(i_gs)*b(i_gs)

end do

Forward and Backward Array Substitution

DO i = 1, N

DO j = 1, N

s = C(i,j)

DO k = 1,N

s = s + A(i, k) * B(k, j)

ENDDO

C(i, j) = s

ENDDO

ENDDO

DO i = 1, N

DO j = 1, N

DO k = 1,N

C(i,j) = C(i,j) + A(i, k) * B(k, j)

ENDDO

ENDDO

ENDDO

Hoist IF

Remove the loop by replicating the matching iterations

DO i = 1, N

if (i == winner && F(i))

G(i)

ENDDO

if (F(winner))

G(winner)

Loop Unswitching

Move IFs with invariant conditions out of the loop

DO i = 1, N

if (cond)

G(i)

ENDDO

If (cond)

DO i = 1, N

G(i)

ENDDO

Inter-Iteration CSE

DO I = 1, N

c(I) = a(I) + b(I)

d(I) = a(I+1) + b(I+1)

ENDDO

temp1 = a(1) + b(1)

DO I = 1, N

temp = temp1

temp1 = a(I+1) + b(I+1)

c(I) = temp

d(I) = temp1

ENDDO

LNO Parallelization

SIMD code generation

Highly dependent on the SIMD instructions in target

Generate vector intrinsics

Based on the library functions available

Automatic parallelization

Leverage OpenMP support in rest of backend

VectorizationApplied to the Innermost Loop

Any statement not involved in a dependence cycle may be vectorized

General Vectorization Implementation Constraints Checking Dependence Analysis for Innermost Loop

– build statement dependence graph– detect dependence cycles (strongly connected components

(SCCs)) Techniques to Enable Vectorization

– applies when dependence cycle exists (see next slide). Rewrite the Loop to its Vectorized Version

DO I = 1, N

a(I) = a(I) + C

Can be vectorized

DO I = 1, N

a(I+1) = a(I) + C

Dependence Cycle! Cannot be vectorized

Techniques to Enable Vectorization Scalar Expansion

Loop Fission

Other Approaches

Loop interchange, array renaming, etc

DO I = 1, N

a(I+1) = a(I) + C //cycle

b(I) = b(I) + D //no cycle

ENDDO

DO I = 1, N

a(I+1) = a(I) + C //cycle, loop not vectorizable

ENDDO

DO I = 1, N

b(I) = B(I) + D //no cycle, vectorizable loop

ENDDO

DO I = 1, N

T = a(I)

a(I) = b(I)

b(I) = T

ENDDO

DO I = 1, N

t(I) = a(I)

a(I) = b(I)

b(I) = t(I)

ENDDO

expand scalar T to array t

Extra Considerations for SIMD Special type of vectorization

Array reference must be contiguous e.g., A[2*I] and A[2*(I+1)] are not contiguous Loop versioning may be required for F90 arrays to guarantee

contiguity

Alignment Sometimes no benefit if not aligned to 128 bit boundary peeling may be required

May need a remainder loop

DO I = 1, N

a(I) = a(I) + C

ENDDO

DO I = 1, N-N%4, 4

a(I:I+3) = a(I:I+3) + C

ENDDO

! remainder

DO I = N-N%4, N

a(I) = a(I) + C

ENDDO

SIMD with Reduction

Replicate accumulator for each computation thread

sum = 0

DO i = 1, N

sum = sum + A(i)

ENDDO

sum0 = 0

sum1 = 0

sum2 = 0

sum3 = 0

DO i = 1, N, 4

sum0 = sum0 + A(i)

sum1 = sum1 + A(i+1)



ENDDO

sum = sum0 + sum1 + sum2 + sum3

Generating Vector Intrinsic

Fission is usually needed to isolate the instrinsics

for(i=0; i< N; i++){

a[i] = a[i] + 3.0;

u[i]=cos(ct[i]);

}

vcos(&ct[0], &u[0], N, 1, 1);

for(i= 0; i<N; i++)

a[i] = a[i] + 3.0;

LNO Phase Structure1. Pre-optimization

2. Fully Unroll Short Loops [lnopt_main.cxx]

3. Build Array Dependence Graph [be/com/dep_graph.cxx]

4. Miscellaneous Optimizations

hoist varying lower bounds [access_main.cxx]

form min/max [ifminmax.cxx]

dead store eliminate arrays [dead.cxx]

array substitutions (forward and backward) [forward.cxx]

loop reversal [reverse.cxx]

1. Loop Unswitching [cond.cxx]

2. Cache Blocking [tile.cxx]

3. Loop Interchange [permute.cxx]

4. Loop Fusion [fusion.cxx]

5. Hoist Messy Loop Bounds [array_bounds.cxx]

LNO Phase Structure (continued)

10. Array Padding [pad.cxx]

11. Parallellization [parallel.cxx]

12. Shackling [shackle.cxx]

13. Gather Scatter [fis_gthr.cxx]

14. Loop Fission [fission.cxx]

15. SIMD [simd.cxx]

16. Hoist IF [lnopt_hoistif.cxx]

17. Generate Vector Intrinsics [vintr_fis.cxx]

18. Generate Prefetches [prefetch.cxx]

19. Array Scalarization [sclrze.cxx]

20. Move Invariant Outside the Loop [minvariant.cxx]

21. Inter-Iteration Common Sub-Expression Elimination [cse.cxx]

Date post:	10-Jun-2015
Category:	Technology
Upload:	kun-ling
View:	989 times
Download:	0 times

E Lno

Technology