Post on 19-Dec-2015
transcript
Motivational Example Matrix Multiplication:
DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L
T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO ENDDO
Codegen: tries to find parallelism using transformations of loop distribution and statement reordering
codegen will not uncover any vector operations because of the use of the temporary scalar T which creates several dependences carried by each of the loops.
Motivational Example I
Replacing the scalar T with the vector T$ will eliminate most of the dependences:
DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO
Distribution of the I-loop gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO ENDDO
Motivational Example II
T$(1:N) = 0.0
C(1:N,J) = T$(1:N)
Motivational Example III
Interchanging I and K loops, we get: DO J = 1, M
T$(1:N) = 0.0
DO I = 1,N DO K = 1,L
DO K = 1,L DO I = 1,N
T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO
ENDDO
C(1:N,J) = T$(1:N)
ENDDO
Motivational Example IV
Finally, Vectorizing the I-Loop: DO J = 1, M T$(1:N) = 0.0
DO K = 1,L DO I = 1,N
T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J)
ENDDO C(1:N,J) = T$(1:N) ENDDO
Motivational Example - Summary
We used two new transformations:
Loop interchange (replacing I,K loops)
Scalar Expansion (T => T$)
Loop Interchange
Loop Interchange - Switching the nesting order of two loops
DO I = 1, N DO J = 1, MS A(I,J+1) = A(I,J) + B ENDDO ENDDO
Applying loop interchange: DO J = 1, M DO I = 1, NS A(I,J+1) = A(I,J) + B ENDDO ENDDO
leads to: DO J = 1, MS A(1:N,J+1) = A(1:N,J) + B ENDDO
Impossible to vectorize since there is a cyclic dependence from S to itself, carried by the J-Loop.
Loop Interchange: Safety
Safety: not all loop interchanges are safe. S(I,J) – the instance of statement S with the
parameters (iteration numbers) I and J
DO J = 1, M DO I = 1, N S: A(I,J+1) = A(I+1,J) ENDDO ENDDO
S(1,1)
S(1,3)
S(1,2)
S(1,4)
S(2,1)
S(2,3)
S(2,2)
S(2,4)
S(3,1)
S(3,3)
S(3,2)
S(3,4)
S(4,1)
S(4,3)
S(4,2)
S4,4)
J = 1
J = 2
J = 3
J = 4
I = 1 I = 2 I = 3 I = 4
A(2,2) is assigned at S(2,1) and used at S(1,2).
Loop Interchange: Safety
A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.
Distance Vector - Reminder
Consider a dependence in a loop nest of n loops Statement S1 on iteration i is the source
of the dependence Statement S2 on iteration j is the sink of
the dependence The distance vector is a vector of
length n d(i,j) such that: d(i,j)k = jk – ik
Direction Vector - Reminder
Suppose that there is a dependence from statement S1 on iteration i of a loop nest of n loops and statement S2 on iteration j, then the dependence direction vector is D(i,j) is defined as a vector of length n such that
”>“ if d(i,j)k > 0 ”=“ if d(i,j)k = 0
”>“ if d(i,j)k < 0
D(i,j)k=
Direction Vector - Reminder
DO I = 1, N
DO J = 1, M
DO K = 1, L
A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1)
ENDDO
ENDDO
ENDDO
(I+1,J+1,K) - (I,J,K) = (1,1,0) => (<,<,=)
(I+1,J+1,K) - (I,J+1,K+1) = (1,0,-1) => (<,=,>)
Direction Vector - Reminder
We say that the I’th loop carries a dependence with direction-vector d, if the leftmost non-”=“ in d is in the I’th place.
(<,<,=) (<,=,>) It is valid to convert a sequential loop
to a parallel loop if the loop carries no dependence.
Loop Interchange: Safety Theorem 5.1 Let D(i,j) be a direction vector for a
dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops is determined by applying the same permutation to the elements of D(i,j).
DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO ENDDO ENDDO (I+1,J+1,K) - (I,J,K) = (1,1,0) => (<,<,=) =>
(<,=,<)
Loop Interchange: Safety The direction matrix for a nest of loops is a matrix in which
each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row.
DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO ENDDO ENDDO (1,1,0) => (<,<,=) (1,0,-1) => (<,=,>)
< < =
< = >
Loop Interchange: Safety
Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non- "=" direction in any row.
i j k< < =
< = >
j i k< < =
= < >
j k i< = <
= > <
Loop Interchange : profitabilityIn general, our goal is to enhance vectorization by increasing the number
of innermost loops that do not carry any dependence:
DO I = 1, N DO J = 1, MS A(I,J+1) = A(I,J) + A(I-1,J) + B ENDDO ENDDO
Applying loop interchange: DO J = 1, M DO I = 1, NS A(I,J+1) = A(I,J) + A(I-1,J) + B ENDDO ENDDO
leads to: DO J = 1, MS A(1:N,J+1) = A(1:N,J) + A(0:N-1,J) + B ENDDO
I J = < < <
J I < = < <
Loop Interchange: profitability
Profitability depends on architecture DO I = 1, N DO J = 1, M DO K = 1, LS A(I+1,J+1,K) = A(I,J,K) + B ENDDO ENDDO ENDDO For SIMD machines with large number of FU’s: DO I = 1, NS A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B ENDDO
i j k< < =
Loop Interchange: profitability
Since Fortran stores in column-major order we would like to vectorize the I-loop
Thus, after Interchanging the I-loop to be the inner loop, we can transform to:
DO J = 1, M DO K = 1, LS A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO ENDDO
j k i< = <
A(I,1:M,1:L)
Loop Interchange: profitability
MIMD machines with vector execution units want to cut down synchronization costs, therefore, shifting the K-loop to outermost level will get:
PARALLEL DO K = 1, L DO J = 1, M A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO END PARALLEL DO
k j i = < <
Loop Shifting
Motivation: Identify loops which can be moved and move them to “optimal” nesting levels.
Trying all the permutations is not practical. Theorem 5.3 In a perfect loop nest, if loops at level i, i+1,...,i+n carry no dependence, it is always legal to shift these loops inside of loop i+n+1. Furthermore, these loops will not carry any dependences in their new position.
> = > = = == > > > > == = = = > >= = = = = >
> = = > = == > > > > == = > = = >= = = = = >
Loop Shifting
Heuristic – move the loops that carry no dependence to the innermost position:
DO I = 1, N DO J = 1, N DO K = 1, N A(I,J,k) = A(I,J,k+1) ENDDO ENDDO ENDDO
DO K= 1, N DO I = 1, N DO J = 1, N A(I,J,k) = A(I,J,k+1) ENDDO ENDDO ENDDO
I J K = = <
K I J < = =
A(1:N,1:N,k) = A(1:N,1:N,k+1)
Loop Shifting
The heuristic fail in:
DO I = 1, N
DO J = 1, M
A(I+1,J+1) = A(I,J) + A(I+1,J)
ENDDO
ENDDO But we still want to get:
DO J = 1, M
A(2:N+1,J+1) = A(1:N,J) + A(2:N+1,J)
ENDDO
I J < < = <
J I < < < =
Loop Shifting
A recursive heuristec: If the outermost loop carries no dependence, then let
the outermost loop that carries a dependence to be the outermost loop. (the former heuristic.)
If the outermost loop carries a dependence, find the outermost loop L that: L can be safely shifted to the outermost position L carries a dependence d whose direction vector contains
an "=" in every position but the Lth.
And move L to the outermost position.
Loop Shifting
< < = == = < <= < = == = = <
< < = = = = < < < = = = = = = <
< = < = = < = << = = == = = <
< = = < = < < =< = = == = < =
Scalar Expansion DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO Scalar Expansion : DO I = 1, N T$(I) = A(I) A(I) = B(I) B(I) = T$(I) ENDDO T = T$(N) leads to: T$(1:N) = A(1:N) A(1:N) = B(1:N) B(1:N) = T$(1:N) T = T$(N)
Scalar Expansion
However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I-1) A(I) = T ENDDO Scalar expansion gives us: T$(0) = T DO I = 1, N T$(I) = T$(I-1) + A(I) + A(I-1) A(I) = T$(I) ENDDO T = T$(N)
Scalar Expansion – How to
Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions.
Dependences due to reuse of memory location vs. reuse of values: Dependences due to reuse of values must be
preserved Dependences due to reuse of memory location
can be deleted by expansion We will try to find the removable dependence
Reminder - SSA
X= X= X=
=X =X =X
1S 2S 3S
4S
5S 6S 7S
1S 2S 3S
4S
5S 6S 7S
1X 2X 3X
),,( 3214 XXXX
4X 4X 4X
Covering definition A definition X of a scalar S is a covering definition for loop
L if a definition of S placed at the beginning of L reaches no uses of S that occur past X.
DO I = 1, 100S1 T = X(I)S2 Y(I) = T ENDDO DO I = 1, 100 IF (A(I) .GT. 0) THENS1 T = X(I)S2 Y(I) = T ENDIF ENDDO
T=X(I)
Y(I)=T
1S
2S
)(T
Covering definition A covering definition does not always exist:
DO I = 1, 100 IF (A(I) .GT. 0) THENS2 T = X(I) ENDIFS3 Y(I) = T ENDDO
In SSA terms: There does not exist a covering definition for a variable T if the edge out of the first assignment to T goes to a -function later in the loop which merges its values with those for another control flow path through the loop
T=X(I)
Y(I)=T
2S
)(T
3S
Covering definition
Expanded definition of covering definition:
A collection C of covering definitions for T in a loop in which every control path has a covering definition.
DO I = 1, 100 IF (A(I) .GT. 0) THENS1 T = X(I) ELSES2 T = -X(I) ENDIFS3 Y(I) = T ENDDO
T=X(I) T=-X(I)
Y(I)=T
2S
)(T
3S
1 S
Find Covering definitions DO I = 1, 100
IF (A(I) .GT. 0) THENS1 T = X(I) ENDIFS2 Y(I) = T ENDDO To form a collection of covering definitions, we can insert dummy
assignments: DO I = 1, 100 IF (A(I) .GT. 0) THENS1 T = X(I) ELSES2 T = T ENDIFS3 Y(I) = T ENDDO
Algorithm to insert dummy assignments and compute the collection, C, of covering definitions: Central idea: Look for parallel paths to a -
function following the first assignment, and add dummy assignments.
Find Covering definition
T=X(I)
Y(I)=T
2S
)(T
3S
T=X(I)
Y(I)=T
2S
)(T
3S
T=T DO I = 1, 100 IF (A(I) .GT. 0) THEN T = X(I) ELSE T = T ENDIF Y(I) = TENDDO
Scalar Expansion
Given the collection of covering definitions, we can carry out scalar expansion for a normalized loop: Create an array T$ of appropriate length For each S in the covering definition collection C, replace the T on
the left-hand side by T$(I). For every other definition of T and every use of T in the loop body
reachable by SSA edges that do not pass through S0, the -function at the beginning of the loop, replace T by T$(I).
For every use prior to a covering definition (direct successors of S0 in the SSA graph), replace T by T$(I-1).
If S0 is not null, then insert T$(0) = T before the loop. If there is an SSA edge from any definition in the loop to a use
outside the loop, insert T = T$(U) after the loop, were U is the loop upper bound.
T$(0) = TDO I = 1, 100 IF (A(I) .GT. 0) THENS1 T$(I) = X(I) ELSE T$(I) = T$(I-1) ENDIFS2 Y(I) = T$(I)ENDDO
Deletable dependence: Given the covering Definitions, we can identify
deletable dependence: Backward carried antidependences:
Writing to T[i] after reading from T[i-1] at the previous iteration Backward carried output dependences
Writing to T[i] after writing to T[i-1] at the previous iteration Forward carried output dependences
Writing to T[i] and writing to T[i+1] at the next iteration Loop-independent antidependences into the covering
definition Writing to T[i] after reading from T[i-1] at the same iteration
Loop-carried true dependences from a covering definition reading from T[i] after writing to T[i-1] at the previous iteration
Scalar Expansion: Drawbacks Expansion increases memory requirements Solutions:
Expand in a single loopStrip mine loop before expansion:
Forward substitution:
DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) ENDDO
DO I = 1, N,64 DO J = 0,63 T = A(I+J) +A(I+J+1) A(I) = T + B(I+J) ENDDO ENDDO
DO I = 1, N A(I) = A(I) + A(I+1) + B(I) ENDDO
DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) ENDDO
Scalar Renaming
S3 T2$(1:100) = D(1:100) - B(1:100)
S4 A(2:101) = T2$(1:100) * T2$(1:100)
S1 T1$(1:100) = A(1:100) + B(1:100)
S2 C(1:100) = T1$(1:100) + T1$(1:100)
T = T2$(100)
DO I = 1, 100S1 T1 = A(I) + B(I)S2 C(I) = T1 + T1S3 T2 = D(I) - B(I)S4 A(I+1) = T2 * T2 ENDDO
DO I = 1, 100S1 T = A(I) + B(I)S2 C(I) = T + TS3 T = D(I) - B(I)S4 A(I+1) = T * T ENDDO
Scalar Renaming Renaming algorithm partitions all definitions
and uses into equivalent classes using the definition-use graph:Pick definitionAdd all uses that the definition reaches to the
equivalence classAdd all definitions that reach any of the
uses… ..until fixed point is reached
Scalar Renaming: Profitability
Scalar renaming will remove a loop-independent output dependence or antidependence.
(Writing to the same scalar at the same iteration or writing to the scalar after reading from it.)
Relatively cheap to use scalar renaming. Usually done by compilers when calculating live
ranges for register allocation.
Array Renaming
DO I = 1, NS1 A(I) = A(I-1) + XS2 Y(I) = A(I) + ZS3 A(I) = B(I) + C ENDDO
Rename A(I) to A$(I): DO I = 1, NS1 A$(I) = A(I-1) + XS2 Y(I) = A$(I) + ZS3 A(I) = B(I) + C ENDDO
A(1:N) =B(1:N) +C
A$(1:N)=A(0:N-1)+X
Y(1:N) =A$(1:N) +Z
Array Renaming: Profitability determine edges that are removed by array renaming
and analyze effects on dependence graph procedure array_partition:
Assumes no control flow in loop body identifies collections of references to
arrays which refer to the same value identifies deletable output dependences
and antidependences Use this procedure to generate code
Minimize amount of copying back to the “original” array at the beginning and the end