High-Level Transformations for Embedded Computing
Organization of a hypothetical optimizing compiler
DEPENDENCE ANALYSIS Dependence analysis identifies these constraints, which are then used
to determine whether a particular transformation can be applied without changing the semantics of the computation.
A dependence is a relationship between two computations that places constraints on their execution order.
Types dependences: (i) control dependence and (ii) data dependence.
Two statements have a Data Dependence if they cannot be executed simultaneously due to conflicting uses of the same variable.
Types of Data Dependences (1) Flow dependence (also called true
dependence) S1: a = c*10 S2: d = 2*a + c
Anti-dependence S1: e = f*4 + g S2: g = 2*h
Types of Data Dependences (2) Output dependence both statements write
the same variable S1: a = b*c S2: a = d+e
Input Dependence when two accesses to the same location memory are both reads
Dependence Graph nodes represents statements and arcs dependencies between computations
Loop Dependence Analysis
To compute dependence information for loops, the key problem is understanding the use of arrays; scalar variables are relatively easy to manage. To track array behavior, the compiler must analyze the subscript expressions in each array reference.To discover whether there is a dependence in the loop nest, it is sufficient to determine whether any of the iterations can write a value that is read or written by any of the other iterations.
TRANSFORMATIONS Data-Flow Based Loop Transformations Loop Reordering Loop Restructuring Loop Replacement Transformations Memory Access Transformations Partial Evaluation Redundancy Elimination Procedure Call Transformations
Data-Flow Based Loop Transformations (1) A number of classical loop optimizations are based on data-flow
analysis, which tracks the flow of data through a program's variables
Loop-based Strength Reduction Reduction in strength replaces an expression in a loop with
one that is equivalent but uses a less expensive operator
Common use in induction variables expressions
Data-Flow Based Loop Transformations (2)
Loop-invariant Code Motion When a computation appears inside a loop but its
result does not change between iterations, the compiler can move that computation outside the loop
Use for expensive operator
Data-Flow Based Loop Transformations (3)
Loop Unswitching is applied when a loop contains a conditional with a loop-invariant test condition. The loop is then replicated inside each branch of the
conditional, saving the overhead of conditional branching inside the loop, reducing the code size of the loop body, and possibly enabling the parallelization of a branch of the conditional
Loop Reordering Transformations Change the relative order of execution of the
iterations of a loop nest or nests. Expose parallelism and improve memory locality.
Loop Reordering Transformations (1) Loop Interchange enable vectorization by
interchanging an inner, dependent loop with an outer, independent loop;
improve vectorization by moving the independent loop with the largest range into the innermost position;
improve parallel performance by moving an independent loop outwards in a loop nest to increase the granularity of each iteration and reduce the number of barrier synchronizations;
reduce stride, ideally to stride 1; and
increase the number of loop-invariant expressions in the inner loop.
Loop Reordering Transformations (2) Loop Skewing skew iterations execution Useful for Loop Interchange
Skewing factor “i”
Loop Reordering Transformations (3) Loop Reversal Reversal changes the
direction in which the loop traverses its iteration range.
It is often used in conjunction with other iteration space reordering transformations because it changes the dependence vectors
Loop Reordering Transformations (4)
Strip Mining execute a specific number of iterations in parallel fashion Strip mining is a method
of adjusting the granularity of an operation, especially a parallelizable operation
64 parallel iterations
Loop Reordering Transformations (5)
Tiling is the multi-dimensional generalization of strip-mining.
Tiling (also called blocking) is primarily used to improve cache reuse (QC) by dividing an iteration space into tiles and transforming the loop nest to iterate over them
Loop Reordering Transformations (6)
Loop Distribution (also called loop fission or loop splitting) breaks a single loop into many.
It is used to: Create perfect loop nests; Create sub-loops with fewer
dependences; Improve instruction cache and
instruction TLB locality due to shorter loop bodies;
Reduce memory requirements by iterating over fewer arrays; and
Increase register re-use by decreasing register pressure.
Loop Reordering Transformations (7)
Loop Fusion (loop merging) It can improve performance by:
reducing loop overhead; increasing instruction parallelism; improving register, vector, data cache, TLB, or
page locality Improving the load balance of parallel loops
Loop Restructuring Transformations Loop Restructuring transformations that
change the structure of the loop, but leave the computations performed by an iteration of the loop body and their relative order unchanged.
Loop Restructuring Transformations (1) Loop Unrolling replicates the body of a loop some
number of times called the unrolling factor (u) and iterates by step u instead of step 1.
It is a fundamental technique for generating the long instruction sequences required by VLIW machines.
Unrolling can improve the performance by: Reducing loop overhead; Increasing instruction parallelism; and Improving register, data cache, or TLB locality.
Loop Restructuring Transformations (2) Software Pipelining improve instruction
parallelism is software pipelining
In software pipelining, the operations of a single loop iteration are broken into s stages, and a single iteration performs stage 1 from iteration i, stage 2 from iteration i-1, etc. Startup code must be generated before the loop to initialize the pipeline for the last s-1 iterations
Loop unrolling vs. software pipelining
The difference between unrolling and software pipelining: unrolling reduces overhead, while pipelining reduces the startup cost of each iteration.
Loop Restructuring Transformations (3) Loop Coalescing combines a loop nest into a
single loop, with the original indices computed from the resulting single induction variable
Loop Restructuring Transformations (4) Loop Collapsing is a simpler, more efficient,
but less general version of coalescing in which the number of dimensions of the array is actually reduced.
Collapsing eliminates the overhead of multiple nested loops and multi-dimensional array indexing.
Loop Restructuring Transformations (5)
Loop Peeling a small number of iterations are removed from the beginning or end of the loop and executed separately.
Peeling has two uses: for removing dependences created by the first or last few loop iterations, thereby enabling parallelization; and for matching the iteration control of adjacent loops to enable fusion.
Loop Replacement Transformations Transformations that operate on whole loops and
completely alter their structure. Reduction Recognition: A reduction is an operation
that computes a scalar value from an array. Common reductions include computing either the sum or the maximum value of the elements in an array.
Loop Replacement Transformations (2)
Array Statement Scalarization When a loop is expressed in array notation, the compiler can either
convert it into vector operations or scalarize it into one or more serial loops. However, the conversion is not completely straightforward because array notation requires that the operation be performed as if every value on the right-hand side and every sub-expression on the left-hand side were computed before any assignments are performed.
Memory Access Transformations Different speeds between CPU and DRAM
Factors affecting memory performance include: Re-use, denoted by Q and QC, the ratio of uses of an item to
the number of times it is loaded; Parallelism. Vector machines often divide memory into
banks, allowing vector registers to be loaded in a parallel or pipelined fashion.
Working Set Size. If all the memory elements accessed inside of a loop do not fitin the data cache, then items that will be accessed in later iterations may be flushed, decreasing QC.
Memory system performance can be improve using: loop interchange (6.2.1), loop tiling (6.2.6), loop unrolling (6.3.1), loop fusion (6.2.8), and various optimizations that eliminate register saves at procedure calls (6.8).
Memory Access Transformations (1) Array Padding is a transformation whereby unused
data locations are inserted between the columns of an array or between arrays.
Padding is used to ameliorate a number of memory system conflicts, in particular: Bank conflicts on vector machines with banked
memory Cache set or TLB set conflicts Cache misses False sharing of cache lines on shared-memory
multiprocessors lines loaded by the earlier references, precluding re-use.
Memory Access Transformations (2) Scalar Expansion Loops often contain variables that are used as temporaries within
the loop body. Such variables will create an anti-dependence from one iteration to the next, and will have no other loop-carried dependences. Allocating one temporary for each iteration removes the dependence and makes the loop a candidate for parallelization
Scalar expansion can also increase instruction-level parallelism by removing dependences.
Partial Evaluation Partial evaluation refers to the general technique of performing part of a computation at compile time. Constant propagation is one of the most important optimizations that a compiler can perform and a good optimizing compiler will apply it aggressively. Programs typically contain many constants; by propagating them through the program, the compiler can do a significant amount of pre-computation. The propagation reveals many opportunities for other optimizations.
Constant folding is a companion to constant propagation: when an expression contains an operation with constant values as operands, the compiler can replace the expression with the result.
Partial Evaluation (1) Forward substitution is
a generalization of copy propagation. The use of a variable is replaced by its defining expression, which must be live at that point.
Use parallel reduction optimization
Partial Evaluation (2) Strength Reduction: replace an expensive operator
with an equivalent less expensive operator.
Redundancy Elimination Optimizations to improve performance by identifying
redundant computations and removing them. Redundancy-eliminating transformations remove
two kinds of computations: those that are unreachable and those that are useless. A computation is unreachable if it is never
executed; removing it from the program will have no semantic effect on the instructions executed.
A computation is useless if none of the outputs of the program are dependent on it.
Redundancy Elimination (1) Unreachable Code Elimination Useless Code Elimination Dead Variable Elimination Common Sub-expression Elimination
Procedure Call Transformations The optimizations attempt to reduce the overhead of
procedure calls in one of four ways: eliminating the call entirely; eliminating execution of the called procedure's
body; eliminating some of the entry/exit overhead; and avoiding some steps in making a procedure call
when the behavior of the called procedure is known or can be altered.
Procedure Call Transformations (1) Procedure Inlining replaces a procedure call
with a copy of the body of the called procedure
When a call is inlined, all the overhead for the invocation is eliminated.
After the call is inlined, the compiler may be able to prove loop independence,thereby allowing vectorization or parallelization.
Inlining also affects the instruction cache behavior of the program
(b) foo after parameter promotion on max
Procedure Call Transformations (2)