Download - High-Level Transformations for Embedded Computing

High-Level Transformations for Embedded Computing

2

Organization of a hypothetical optimizing compiler

3

DEPENDENCE ANALYSIS Dependence analysis identifies these constraints, which are then used

to determine whether a particular transformation can be applied without changing the semantics of the computation.

A dependence is a relationship between two computations that places constraints on their execution order.

Types dependences: (i) control dependence and (ii) data dependence.

Control Dependence

Two statements have a Data Dependence if they cannot be executed simultaneously due to conflicting uses of the same variable.

4

Types of Data Dependences (1) Flow dependence (also called true

dependence) S1: a = c*10 S2: d = 2*a + c

Anti-dependence S1: e = f*4 + g S2: g = 2*h

5

Types of Data Dependences (2) Output dependence both statements write

the same variable S1: a = b*c S2: a = d+e

Input Dependence when two accesses to the same location memory are both reads

Dependence Graph nodes represents statements and arcs dependencies between computations

6

Loop Dependence Analysis

To compute dependence information for loops, the key problem is understanding the use of arrays; scalar variables are relatively easy to manage. To track array behavior, the compiler must analyze the subscript expressions in each array reference.To discover whether there is a dependence in the loop nest, it is sufficient to determine whether any of the iterations can write a value that is read or written by any of the other iterations.

7

TRANSFORMATIONS Data-Flow Based Loop Transformations Loop Reordering Loop Restructuring Loop Replacement Transformations Memory Access Transformations Partial Evaluation Redundancy Elimination Procedure Call Transformations

8

Data-Flow Based Loop Transformations (1) A number of classical loop optimizations are based on data-flow

analysis, which tracks the flow of data through a program's variables

Loop-based Strength Reduction Reduction in strength replaces an expression in a loop with

one that is equivalent but uses a less expensive operator

Common use in induction variables expressions

9

Data-Flow Based Loop Transformations (2)

Loop-invariant Code Motion When a computation appears inside a loop but its

result does not change between iterations, the compiler can move that computation outside the loop

Use for expensive operator

10

Data-Flow Based Loop Transformations (3)

Loop Unswitching is applied when a loop contains a conditional with a loop-invariant test condition. The loop is then replicated inside each branch of the

conditional, saving the overhead of conditional branching inside the loop, reducing the code size of the loop body, and possibly enabling the parallelization of a branch of the conditional

11

Loop Reordering Transformations Change the relative order of execution of the

iterations of a loop nest or nests. Expose parallelism and improve memory locality.

12

Loop Reordering Transformations (1) Loop Interchange enable vectorization by

interchanging an inner, dependent loop with an outer, independent loop;

improve vectorization by moving the independent loop with the largest range into the innermost position;

improve parallel performance by moving an independent loop outwards in a loop nest to increase the granularity of each iteration and reduce the number of barrier synchronizations;

reduce stride, ideally to stride 1; and

increase the number of loop-invariant expressions in the inner loop.

13

Loop Reordering Transformations (2) Loop Skewing skew iterations execution Useful for Loop Interchange

Skewing factor “i”

Parallel iterations

14

Loop Reordering Transformations (3) Loop Reversal Reversal changes the

direction in which the loop traverses its iteration range.

It is often used in conjunction with other iteration space reordering transformations because it changes the dependence vectors

15

Loop Reordering Transformations (4)

Strip Mining execute a specific number of iterations in parallel fashion Strip mining is a method

of adjusting the granularity of an operation, especially a parallelizable operation

64 parallel iterations

16


Tiling is the multi-dimensional generalization of strip-mining.

Tiling (also called blocking) is primarily used to improve cache reuse (QC) by dividing an iteration space into tiles and transforming the loop nest to iterate over them

17


Loop Distribution (also called loop fission or loop splitting) breaks a single loop into many.

It is used to: Create perfect loop nests; Create sub-loops with fewer

dependences; Improve instruction cache and

instruction TLB locality due to shorter loop bodies;

Reduce memory requirements by iterating over fewer arrays; and

Increase register re-use by decreasing register pressure.

18


Loop Fusion (loop merging) It can improve performance by:

reducing loop overhead; increasing instruction parallelism; improving register, vector, data cache, TLB, or

page locality Improving the load balance of parallel loops

19

Loop Restructuring Transformations Loop Restructuring transformations that

change the structure of the loop, but leave the computations performed by an iteration of the loop body and their relative order unchanged.

20

Loop Restructuring Transformations (1) Loop Unrolling replicates the body of a loop some

number of times called the unrolling factor (u) and iterates by step u instead of step 1.

It is a fundamental technique for generating the long instruction sequences required by VLIW machines.

Unrolling can improve the performance by: Reducing loop overhead; Increasing instruction parallelism; and Improving register, data cache, or TLB locality.

21

Loop Restructuring Transformations (2) Software Pipelining improve instruction

parallelism is software pipelining

In software pipelining, the operations of a single loop iteration are broken into s stages, and a single iteration performs stage 1 from iteration i, stage 2 from iteration i-1, etc. Startup code must be generated before the loop to initialize the pipeline for the last s-1 iterations

22

Loop unrolling vs. software pipelining

The difference between unrolling and software pipelining: unrolling reduces overhead, while pipelining reduces the startup cost of each iteration.

23

Loop Restructuring Transformations (3) Loop Coalescing combines a loop nest into a

single loop, with the original indices computed from the resulting single induction variable

24

Loop Restructuring Transformations (4) Loop Collapsing is a simpler, more efficient,

but less general version of coalescing in which the number of dimensions of the array is actually reduced.

Collapsing eliminates the overhead of multiple nested loops and multi-dimensional array indexing.

25

Loop Restructuring Transformations (5)

Loop Peeling a small number of iterations are removed from the beginning or end of the loop and executed separately.

Peeling has two uses: for removing dependences created by the first or last few loop iterations, thereby enabling parallelization; and for matching the iteration control of adjacent loops to enable fusion.

26

Loop Replacement Transformations Transformations that operate on whole loops and

completely alter their structure. Reduction Recognition: A reduction is an operation

that computes a scalar value from an array. Common reductions include computing either the sum or the maximum value of the elements in an array.

27

Loop Replacement Transformations (2)

Array Statement Scalarization When a loop is expressed in array notation, the compiler can either

convert it into vector operations or scalarize it into one or more serial loops. However, the conversion is not completely straightforward because array notation requires that the operation be performed as if every value on the right-hand side and every sub-expression on the left-hand side were computed before any assignments are performed.

28

Memory Access Transformations Different speeds between CPU and DRAM

Factors affecting memory performance include: Re-use, denoted by Q and QC, the ratio of uses of an item to

the number of times it is loaded; Parallelism. Vector machines often divide memory into

banks, allowing vector registers to be loaded in a parallel or pipelined fashion.

Working Set Size. If all the memory elements accessed inside of a loop do not fitin the data cache, then items that will be accessed in later iterations may be flushed, decreasing QC.

Memory system performance can be improve using: loop interchange (6.2.1), loop tiling (6.2.6), loop unrolling (6.3.1), loop fusion (6.2.8), and various optimizations that eliminate register saves at procedure calls (6.8).

29

Memory Access Transformations (1) Array Padding is a transformation whereby unused

data locations are inserted between the columns of an array or between arrays.

Padding is used to ameliorate a number of memory system conflicts, in particular: Bank conflicts on vector machines with banked

memory Cache set or TLB set conflicts Cache misses False sharing of cache lines on shared-memory

multiprocessors lines loaded by the earlier references, precluding re-use.

30

Memory Access Transformations (2) Scalar Expansion Loops often contain variables that are used as temporaries within

the loop body. Such variables will create an anti-dependence from one iteration to the next, and will have no other loop-carried dependences. Allocating one temporary for each iteration removes the dependence and makes the loop a candidate for parallelization

Scalar expansion can also increase instruction-level parallelism by removing dependences.

31

Partial Evaluation Partial evaluation refers to the general technique of performing part of a computation at compile time. Constant propagation is one of the most important optimizations that a compiler can perform and a good optimizing compiler will apply it aggressively. Programs typically contain many constants; by propagating them through the program, the compiler can do a significant amount of pre-computation. The propagation reveals many opportunities for other optimizations.

Constant folding is a companion to constant propagation: when an expression contains an operation with constant values as operands, the compiler can replace the expression with the result.

32

Partial Evaluation (1) Forward substitution is

a generalization of copy propagation. The use of a variable is replaced by its defining expression, which must be live at that point.

Use parallel reduction optimization

33

Partial Evaluation (2) Strength Reduction: replace an expensive operator

with an equivalent less expensive operator.

34

Redundancy Elimination Optimizations to improve performance by identifying

redundant computations and removing them. Redundancy-eliminating transformations remove

two kinds of computations: those that are unreachable and those that are useless. A computation is unreachable if it is never

executed; removing it from the program will have no semantic effect on the instructions executed.

A computation is useless if none of the outputs of the program are dependent on it.

35

Redundancy Elimination (1) Unreachable Code Elimination Useless Code Elimination Dead Variable Elimination Common Sub-expression Elimination

36

Procedure Call Transformations The optimizations attempt to reduce the overhead of

procedure calls in one of four ways: eliminating the call entirely; eliminating execution of the called procedure's

body; eliminating some of the entry/exit overhead; and avoiding some steps in making a procedure call

when the behavior of the called procedure is known or can be altered.

37

Procedure Call Transformations (1) Procedure Inlining replaces a procedure call

with a copy of the body of the called procedure

When a call is inlined, all the overhead for the invocation is eliminated.

After the call is inlined, the compiler may be able to prove loop independence,thereby allowing vectorization or parallelization.

Inlining also affects the instruction cache behavior of the program

(b) foo after parameter promotion on max

38

Procedure Call Transformations (2)