Date post: | 10-Jun-2015 |
Category: |
Technology |
Upload: | kun-ling |
View: | 989 times |
Download: | 0 times |
Section ELoop Nest Optimization
Loop Nest Optimizer (LNO) Overview
Performs transformations on a loop nest
Scope of work: each top-level loop nest
Does not build control flow graph Driven by data dependency analysis Uses alias and use-def information provided by scalar
optimizer
Annotate data dependency information for use by code generator Innermost loop only
Requires modeling of hardware resources
Dependence TestingDependence
Given two references R1 and R2, R2 depends on R1 if they may access the same memory location and there is a path from R1 to R2
– true dependence, anti dependence, output dependence
Access Array Each array subscript in terms of loop index variables
Access Vector A vector of all the subscripts’ access arrays
Dependence Testing (input are access arrays) refer to “Efficient and Exact Data Dependence Analysis”, Dror
Maydan, et al., PLDI’91. output: dependence vector, each dimension representing a loop
level
Do I = 1, N
DO J = 1, N
...a(2I + J, 3J)...
access arrays are: (2, 1), (0, 3)
access vector: [( 2,1), (0, 3)]
Three Classes of Optimizations by LNO
1. Transformations for Data Cache
2. Transformations that help other optimizations
3. Vectorization and Parallellization
LNO Transformations for Data Cache
Cache blocking
Transform loop to work on sub-matrices that fit in cache
Loop interchange
Array Padding
Reduce cache conflicts
Prefetches generation
Hide the long latency of cache miss references
Loop fusion
Loop fission
Cache Blocking
Matrix B misses all the time
n3+2n2 cache misses (ignoring line size)
for (i=0; i<n; i++)for (j=0; j<n; j++)
for (k=0; k<n; k++)c[i][j] = c[i][j] + a[i][k]*b[k][j];
* =
A B C
Cache Blocking
Use sub-matrices that fit entirely in cache.
* =
A B C
A11 A12
A21 A22
B11
B21
C11
C21
C12
C22
B12
B22
C11 = A11 * B11 + A12 * B21
For sub-matrices of size b, (n/b)*n2+2n2 cache missesinstead of n3+2n2
Loop Interchange (permutation)
Improve Data Locality
Unimodular Transformation combined with cache blocking, loop reversal and loop skewing, etc.
to improve the overall data locality of the loop nests.
Enables Vectorization and Parallelization
DO I = 1, N
DO J = 1, M
A(I, J) = B(I, J) + C
No spatial reuse. Cache miss
for every reference of A and B
DO J = 1, M
DO I = 1, N
A(I, J) = B(I, J) + C
Miss once every 16
iterations (element size:
4 bytes, cache line size:
64 bytes)
Software PrefetchMajor Considerations for Software Prefetch
what to prefetch – for references that most likely cause cache misses
when to prefetch – neither too early nor too late
avoid useless prefetches – register pressure, cache pollution, memory bandwidth
Major Phases in Our Prefetch Engine
Process_Loop – build internal structures, etc.
Build_Base_LGs – build locality group (references that likely access same cache line)
Volume – compute data volume for each loop, from innermost to outermost
Find_Loc_Loops – determine which locality groups need prefetch
Gen_Prefetch – generate prefetch instructions
Software Prefetch
Prefetch N Cache Lines Ahead for a(i), prefetch a(i + N*line_size) -LNO:prefetch_ahead=N (default 2)
One Prefetch for Each Cache Line Versioning in loop
Combine with unrolling
DO I = 1, N if I % 16 == 0 then loop body with prefetch code else loop body without prefetch code
DO I = 1, N, 16
prefetch b(I + 2*16)
a = a + b(I)
a = a + b(I+1)
...
a = a + b(I+15
Loop Fission and Fusion
Loop Fission Enables loop interchange Enables vectorization and parallelization Reduce conflict misses
Loop Fusion Reduce loop overhead Improve data reuse Larger loop body
DO I = 1, N
a(I) = b(I) + C
ENDDO
DO I = 1, N
d(I) = a(I) + E
ENDDO
DO I = 1, N
a(I) = b(I) + C
d(I) = a(I) + E
ENDDO
LNO Transformations that Help Other Optimizations
Scalar Expansion / Array Expansion Reduce inter-loop dependencies Enable parallelization
Scalar Variable Renaming Loop nests can be optimized separately Less constraints for register allocation
Array Scalarization Improves register allocation
Hoist Messy Loop BoundsOuter loop unrollingArray Substitution (Forward and Backward)Loop UnswitchingHoist IFGather-scatterMove invariant array references out of loopsInter-iteration CSE
Outer Loop Unrolling Form larger loop bodies Reduce loop overhead
For (i=0; i<n; i++)for (j=0; j<m; j++)
a[i][j] = a[i][j] + x*b[j] * c[j];
for (i=0; i<n-1; i+=2)for (j=0; j<m; j++) {
a[i][j] = a[i][j] + x*b[j] * c[j];a[i+1][j] = a[i+1][j] + x*b[j] * c[j];
}for (j=0; j<m; j++)
a[i][j] = a[i][j] + x*b[j] * c[j];
1 add2 mults3 loads1 store
2 add4 mults4 loads2 store
Gather-Scatter
do i = 1, nif (c(i) .gt. 0.0) then
a(i) = c(i) / b(i)c(i) = c(i) * b(i)
end ifend do
do i = 1, nderef_gs(inc_0+1) = iif (c(i) .gt. 0.0) then
inc_0 = inc_0 + 1end if
end dodo ind_0 = 0, inc_0-1
i_gs = deref_gs(ind_0+1)a(i_gs) = c(i_gs)/b(i_gs)c(i_gs) = c(i_gs)*b(i_gs)
end do
Forward and Backward Array Substitution
DO i = 1, N
DO j = 1, N
s = C(i,j)
DO k = 1,N
s = s + A(i, k) * B(k, j)
ENDDO
C(i, j) = s
ENDDO
ENDDO
DO i = 1, N
DO j = 1, N
DO k = 1,N
C(i,j) = C(i,j) + A(i, k) * B(k, j)
ENDDO
ENDDO
ENDDO
Hoist IF
Remove the loop by replicating the matching iterations
DO i = 1, N
if (i == winner && F(i))
G(i)
ENDDO
if (F(winner))
G(winner)
Loop Unswitching
Move IFs with invariant conditions out of the loop
DO i = 1, N
if (cond)
G(i)
ENDDO
If (cond)
DO i = 1, N
G(i)
ENDDO
Inter-Iteration CSE
DO I = 1, N
c(I) = a(I) + b(I)
d(I) = a(I+1) + b(I+1)
ENDDO
temp1 = a(1) + b(1)
DO I = 1, N
temp = temp1
temp1 = a(I+1) + b(I+1)
c(I) = temp
d(I) = temp1
ENDDO
LNO Parallelization
SIMD code generation
Highly dependent on the SIMD instructions in target
Generate vector intrinsics
Based on the library functions available
Automatic parallelization
Leverage OpenMP support in rest of backend
VectorizationApplied to the Innermost Loop
Any statement not involved in a dependence cycle may be vectorized
General Vectorization Implementation Constraints Checking Dependence Analysis for Innermost Loop
– build statement dependence graph– detect dependence cycles (strongly connected components
(SCCs)) Techniques to Enable Vectorization
– applies when dependence cycle exists (see next slide). Rewrite the Loop to its Vectorized Version
DO I = 1, N
a(I) = a(I) + C
Can be vectorized
DO I = 1, N
a(I+1) = a(I) + C
Dependence Cycle! Cannot be vectorized
Techniques to Enable Vectorization Scalar Expansion
Loop Fission
Other Approaches
Loop interchange, array renaming, etc
DO I = 1, N
a(I+1) = a(I) + C //cycle
b(I) = b(I) + D //no cycle
ENDDO
DO I = 1, N
a(I+1) = a(I) + C //cycle, loop not vectorizable
ENDDO
DO I = 1, N
b(I) = B(I) + D //no cycle, vectorizable loop
ENDDO
DO I = 1, N
T = a(I)
a(I) = b(I)
b(I) = T
ENDDO
DO I = 1, N
t(I) = a(I)
a(I) = b(I)
b(I) = t(I)
ENDDO
expand scalar T to array t
Extra Considerations for SIMD Special type of vectorization
Array reference must be contiguous e.g., A[2*I] and A[2*(I+1)] are not contiguous Loop versioning may be required for F90 arrays to guarantee
contiguity
Alignment Sometimes no benefit if not aligned to 128 bit boundary peeling may be required
May need a remainder loop
DO I = 1, N
a(I) = a(I) + C
ENDDO
DO I = 1, N-N%4, 4
a(I:I+3) = a(I:I+3) + C
ENDDO
! remainder
DO I = N-N%4, N
a(I) = a(I) + C
ENDDO
SIMD with Reduction
Replicate accumulator for each computation thread
sum = 0
DO i = 1, N
sum = sum + A(i)
ENDDO
sum0 = 0
sum1 = 0
sum2 = 0
sum3 = 0
DO i = 1, N, 4
sum0 = sum0 + A(i)
sum1 = sum1 + A(i+1)
sum2 = sum2 + A(i+2)
sum3 = sum3 + A(i+3)
ENDDO
sum = sum0 + sum1 + sum2 + sum3
Generating Vector Intrinsic
Fission is usually needed to isolate the instrinsics
for(i=0; i< N; i++){
a[i] = a[i] + 3.0;
u[i]=cos(ct[i]);
}
vcos(&ct[0], &u[0], N, 1, 1);
for(i= 0; i<N; i++)
a[i] = a[i] + 3.0;
LNO Phase Structure1. Pre-optimization
2. Fully Unroll Short Loops [lnopt_main.cxx]
3. Build Array Dependence Graph [be/com/dep_graph.cxx]
4. Miscellaneous Optimizations
hoist varying lower bounds [access_main.cxx]
form min/max [ifminmax.cxx]
dead store eliminate arrays [dead.cxx]
array substitutions (forward and backward) [forward.cxx]
loop reversal [reverse.cxx]
1. Loop Unswitching [cond.cxx]
2. Cache Blocking [tile.cxx]
3. Loop Interchange [permute.cxx]
4. Loop Fusion [fusion.cxx]
5. Hoist Messy Loop Bounds [array_bounds.cxx]
LNO Phase Structure (continued)
10. Array Padding [pad.cxx]
11. Parallellization [parallel.cxx]
12. Shackling [shackle.cxx]
13. Gather Scatter [fis_gthr.cxx]
14. Loop Fission [fission.cxx]
15. SIMD [simd.cxx]
16. Hoist IF [lnopt_hoistif.cxx]
17. Generate Vector Intrinsics [vintr_fis.cxx]
18. Generate Prefetches [prefetch.cxx]
19. Array Scalarization [sclrze.cxx]
20. Move Invariant Outside the Loop [minvariant.cxx]
21. Inter-Iteration Common Sub-Expression Elimination [cse.cxx]