9.Ilp Dynamic Sched1

7/29/2019 9.Ilp Dynamic Sched1

1/23

CSE502 Computer Architecture

Dr. Ilchul Yoon ([email protected])

Slides adapted from:

Larry Wittie, SBU & John Kubiatowicz, EECS, UC, Berkeley

Lecture 09

Instruction Level Parallelism


2/23

Outline

l ILP Instruction Level Parallelisml Compiler techniques to increase ILPl Loop Unrollingl Static Branch Predictionl Dynamic Branch Predictionl Overcoming Data Hazards with Dynamic Schedulingl (Start) Tomasulo Algorithml Conclusion

2


3/23

Recall from Pipelining Review

l Pipeline CPI = Ideal pipeline CPI + Structural Stalls +Data Hazard Stalls + Control Stalls

l Ideal pipeline CPI: Measure of the maximum performanceattainable by the implementation

l Structural hazards: Hardware cannot support this combinationof instructions

l Data hazards: Instruction depends on result of prior instructionstill in the pipeline

lControl hazards: Caused by delay between fetching ofinstructions and control flow decisions (branches and jumps)

l e.g., in MIPS, j jump, jal call, jr return})

3


4/23

Instruction Level Parallelism

l Instruction-Level Parallelism (ILP)l Overlap the execution of instructions to run programs faster

(improve performance)

l Two approaches to exploit ILP:1. Rely on hardware to help discover and exploit the parallelism

dynamically (e.g., Pentium 4, AMD Opteron, IBM Power)

2. Rely on software technology to find parallelism, statically atcompile-time (e.g., Itanium 2 (IA-64))

4


5/23

Instruction-Level Parallelism (ILP)

l Basic Block (BB) ILP is quite smalll BB: a straight-line code sequence with no branches in except

to the entry and no branches out except at the exit

l average dynamic branch frequency 15% to 25%=> 4 to 7 instructions execute between a pair of branches

l other problem: instructions in a BB likely depend on eachother

l To obtain substantial performance enhancements, wemust exploit ILP across multiple basic blocks (trace

scheduling)

5


6/23

Instruction-Level Parallelism (ILP)

l Simplest: loop-level parallelism to exploit parallelismamong iterations of a loop.

l For example,for (j=0; j


7/23

Loop-Level Parallelism

l Exploit loop-level parallelism by unrolling loop eitherby

1.dynamic via branch prediction or2.static via loop unrolling by compiler(Another way is vectors, to be covered later)

lDetermining dependences is critical !l If 2 instructions are

l Parallel, they can execute simultaneously in a pipeline ofarbitrary depth without causing any stalls (assuming no

structural hazards)

l Dependent, they are not parallel and must be executed inorder, although they may often be partially overlapped

7


8/23

Data Dependence and Hazards

l InstrJ is data dependent (aka true dependence) on InstrI:1. InstrJ tries to read operand before InstrI writes it

2. InstrJ is data dependent on InstrKwhich is dependent on InstrIl If two instructions are data dependent, they cannot execute

simultaneously or be completely overlapped

l Data dependence in instruction sequence data dependence in source code

effect of original data dependence must be preserved

l If data dependence causes a hazard in a pipeline, it is a Read After Write (RAW) hazard RAW is real

8

I: add r1,r2,r3

J: sub r4,r1,r3


9/23

ILP and Data Dependencies, Hazards

l HW/SW must preserve illusion (or program order)l Code must give the same results as if instructions were executed

sequentially in the original order of the source program

l Dependences are a property of programsl The presence of a dependence indicates the potential for a hazard,

but the existence of an actual hazard and length of any stall are

pipeline properties

l Importance of data dependencies1. Indicate the possibility of a hazard2. Determine the order in which results must be calculated3. Set upper bounds on how much parallelism can possibly be exploited to

speedup a program

l HW/SW goal: exploit parallelism by preserving program orderonly where it affects the outcome of the program

9


10/23

Name Dependence #1: Anti-dependence

l Name dependence: when 2 instructions use same register ormemory location, called a name, but no flow of data between the

instructions associated with that name; 2 versions of name

dependence, which may cause WAR and WAW hazards.

l InstrJ writes operand before InstrI reads it

Called an anti-dependence by compiler writers. This results

from reuse of the name r1

l If anti-dependence caused a hazard in the pipeline, thats a WriteAfter Read (WAR) hazard

10

I: sub r4,r1,r3

J: add r1,r2,r3

K: mul r6,r1,r7


11/23

Name Dependence #2: Output dependence

l InstrJ writes operand before InstrI writes it.

l Called an output dependence by compiler writersThis also results from the reuse of name r1

l If anti-dependence caused a hazard in the pipeline, thats a WriteAfter Write (WAW) hazard

l Instructions involved in a name dependence can executesimultaneously, if we can make the instructions do not conflict.

l Register renaming by hardwarel Use different register names by compiler

11

I: sub r1,r4,r3

J: add r1,r2,r3

K: mul r6,r1,r7


12/23

Control Dependencies

l Every instruction is control dependent on some set of branches,and, in general, these control dependencies must be preserved to

preserve program order

if p1 {

S1;};

if p2 {

S2;

}

l S1 is control dependent on proposition p1, and S2 is controldependent on p2, but not on p1.

12


13/23

Carefully Violate Control Dependencies

l Control dependence need NOT always be preservedl Can be violated by executing instructions that should not have

been, if doing so does NOT affect program results

l e.g., speculative execution HW throws away results of bad branchguesses.

l Instead, 2 properties critical to program correctness arel Exception behavior andl

Data flow

13


14/23

Exception Behavior Is Important

l Preserving exception behavior any changes in instruction execution order must NOT change

how exceptions are raised in program

( no new exceptions)

l Example:DADDU R2,R3,R4

BEQZ R2,L1

LW R1,-1(R2)

L1:

(This code example assumes branches are not delayed)

l What is the problem with moving LW before BEQZ?l e.g., Array overflow: what if R2=0, so -1+[R2] is out of program memory

bounds?14


15/23

Data Flow Of Values Must Be Preserved

l Data flow: actual flow of data values from instructions thatproduce results to those that consume them

l branches make flow dynamic (since we know details only at runtime);must determine which instruction is the data supplier

l Example:DADDU R1,R2,R3BEQZ R4,L

DSUBU R1,R5,R6

L:

OR R7,R1,R8

l OR depends on DADDU orDSUBU?Compilers and HW must preserve data flow during execution

15


16/23

Outline

l ILP Instruction Level Parallelisml Compiler techniques to increase ILPl Loop Unrollingl

Static Branch Predictionl Dynamic Branch Predictionl Overcoming Data Hazards with Dynamic Schedulingl (Start) Tomasulo Algorithml Conclusion

16


17/23

Software Techniques - Example

l This code adds a scalar to a vector:for (i=1000; i>0; i=i1)

x[i] = x[i] + s;

l Assume following latencies for all examplesl Ignore delayed branches in these examples

17

Instruction Instruction Latencyproducing result using result in cycles

FP ALU op Another FP ALU op 3

FP ALU op Store double 2Load double FP ALU op 1

Load double Store double 0

Integer op Integer op 0


18/23

FP Loop: Where are the Hazards?

l First, translate into MIPS code. To simplify, assume:l 8 is the lowest address,l F2 hassl R1 starts with with address for x[1000]

18

for (i=1000; i>0; i=i1)

x[i] = x[i] + s;

Loop: L.D F0,0(R1) ;F0=vector element

ADD.D F4,F0,F2 ;add scalar from F2

S.D 0(R1),F4 ;store result

DADDUI R1,R1,-8 ;decrement pointer 8B (DW)BNEZ R1,Loop ;branch R1!=zero


19/23

FP Loop Showing Stalls

19

1 Loop: L.D F0,0(R1) ;F0=vector element

2 stall

3 ADD.D F4,F0,F2 ;add scalar in F2

4 stall

5 stall

6 S.D 0(R1),F4 ;store result7 DADDUI R1,R1,-8 ;decrement pointer 8Bytes (DW)

8 stall ;assume cannot forward to branch

9 BNEZ R1,Loop ;branch R1!=zero

plus branch delay!

for (i=1000; i>0; i=i1)

x[i] = x[i] + s;

Instruction Instruction Latency inproducingresult usingresult clock cycles

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Loop every 9 clock cycles. How reorder code to minimize stalls?


20/23

Revised FP Loop Minimizing Stalls

20

1 Loop: L.D F0,0(R1)

2 DADDUI R1,R1,-8

3 ADD.D F4,F0,F2

4 stall

5 stall

6 S.D 8(R1),F4 ;altered offset 0=>8 when moved DADDUI

7 BNEZ R1,Loop


2 stall

3 ADD.D F4,F0,F2

4 stall

5 stall

6 S.D 0(R1),F4

7 DADDUI R1,R1,-8

8 stall

9 BNEZ R1,Loop

Swap DADDUI and S.D; change address offset of S.D

produce result use result stalls between

FP ALU op Other FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Load double Store double 0

Integer op=>R Branch on R 1Integer op Integer op 0

Loop takes 7 clock cycles

- 3 for execution (L.D,ADD.D,S.D)- 4 for loop overhead;

How to make faster?

for (i=1000; i>0; i=i1)

x[i] = x[i] + s;


21/23

Unroll Original Loop Four Times

l Straightforward way for time saving!!

21

Four loops take 27 clock cycles, or 6.75 per iteration!!(Assumes R1 is a multiple of 4)

for (i=1000; i>0; i=i1)

x[i] = x[i] + s;


3 ADD.D F4,F0,F2

6 S.D 0(R1),F4 ;drop DADDUI & BNEZ

7 L.D F6,-8(R1)

9 ADD.D F8,F6,F2

12 S.D -8(R1),F8 ;drop DADDUI & BNEZ

13 L.D F10,-16(R1)

15 ADD.D F12,F10,F2

18 S.D -16(R1),F12 ;drop DADDUI & BNEZ19 L.D F14,-24(R1)

21 ADD.D F16,F14,F2

24 S.D -24(R1),F16

25 DADDUI R1,R1,#-32 ;alter to 4*8

27 BNEZ R1,LOOP

1 cycle stall2 cycle stall

1 cycle stall

How to rewrite

loop to minimize

stalls?


22/23

Loop Unrolling Detail - Strip Mining

l Do not usually know upper bound of loopl Suppose it is n, and we would like to unroll the loop to

make kcopies of the body

lInstead of a single unrolled loop, generate a pair ofconsecutive loops:

l 1st executes (n mod k) times and has a body that is the originalloop (called strip mining of a loop)

l 2nd is the unrolled body surrounded by an outer loop thatiterates (| n/k|) times

l For large values ofn, most of the execution time will bespent in the | n/k| unrolled loops

22


23/23

Unrolled Loop with Minimal Stalls

23


3 ADD.D F4,F0,F2

6 S.D 0(R1),F4

7 L.D F6,-8(R1)

9 ADD.D F8,F6,F2

12 S.D -8(R1),F8

13 L.D F10,-16(R1)

15 ADD.D F12,F10,F2

18 S.D -16(R1),F12

19 L.D F14,-24(R1)21 ADD.D F16,F14,F2

24 S.D -24(R1),F16

25 DADDUI R1,R1,#-32

27 BNEZ R1,LOOP


2 L.D F6,-8(R1)

3 L.D F10,-16(R1)

4 L.D F14,-24(R1)

5 ADD.D F4,F0,F2

6 ADD.D F8,F6,F2

7 ADD.D F12,F10,F2

8 ADD.D F16,F14,F2

9 S.D 0(R1),F4

10 S.D -8(R1),F8

11 S.D -16(R1),F1212 DADDUI R1,R1,#-32

13 S.D 8(R1), F16 ; 8-32 = -24

14 BNEZ R1,LOOP

Four loops take 14 clock cycles, or 3.5 per loop

Date post:	14-Apr-2018
Category:	Documents
Upload:	ashaheer
View:	217 times
Download:	0 times

9.Ilp Dynamic Sched1

Documents