+ All Categories
Home > Documents > 9.Ilp Dynamic Sched1

9.Ilp Dynamic Sched1

Date post: 14-Apr-2018
Category:
Upload: ashaheer
View: 217 times
Download: 0 times
Share this document with a friend

of 23

Transcript
  • 7/29/2019 9.Ilp Dynamic Sched1

    1/23

    CSE502 Computer Architecture

    Dr. Ilchul Yoon ([email protected])

    Slides adapted from:

    Larry Wittie, SBU & John Kubiatowicz, EECS, UC, Berkeley

    Lecture 09

    Instruction Level Parallelism

  • 7/29/2019 9.Ilp Dynamic Sched1

    2/23

    Outline

    l ILP Instruction Level Parallelisml Compiler techniques to increase ILPl Loop Unrollingl Static Branch Predictionl Dynamic Branch Predictionl Overcoming Data Hazards with Dynamic Schedulingl (Start) Tomasulo Algorithml Conclusion

    2

  • 7/29/2019 9.Ilp Dynamic Sched1

    3/23

    Recall from Pipelining Review

    l Pipeline CPI = Ideal pipeline CPI + Structural Stalls +Data Hazard Stalls + Control Stalls

    l Ideal pipeline CPI: Measure of the maximum performanceattainable by the implementation

    l Structural hazards: Hardware cannot support this combinationof instructions

    l Data hazards: Instruction depends on result of prior instructionstill in the pipeline

    lControl hazards: Caused by delay between fetching ofinstructions and control flow decisions (branches and jumps)

    l e.g., in MIPS, j jump, jal call, jr return})

    3

  • 7/29/2019 9.Ilp Dynamic Sched1

    4/23

    Instruction Level Parallelism

    l Instruction-Level Parallelism (ILP)l Overlap the execution of instructions to run programs faster

    (improve performance)

    l Two approaches to exploit ILP:1. Rely on hardware to help discover and exploit the parallelism

    dynamically (e.g., Pentium 4, AMD Opteron, IBM Power)

    2. Rely on software technology to find parallelism, statically atcompile-time (e.g., Itanium 2 (IA-64))

    4

  • 7/29/2019 9.Ilp Dynamic Sched1

    5/23

    Instruction-Level Parallelism (ILP)

    l Basic Block (BB) ILP is quite smalll BB: a straight-line code sequence with no branches in except

    to the entry and no branches out except at the exit

    l average dynamic branch frequency 15% to 25%=> 4 to 7 instructions execute between a pair of branches

    l other problem: instructions in a BB likely depend on eachother

    l To obtain substantial performance enhancements, wemust exploit ILP across multiple basic blocks (trace

    scheduling)

    5

  • 7/29/2019 9.Ilp Dynamic Sched1

    6/23

    Instruction-Level Parallelism (ILP)

    l Simplest: loop-level parallelism to exploit parallelismamong iterations of a loop.

    l For example,for (j=0; j

  • 7/29/2019 9.Ilp Dynamic Sched1

    7/23

    Loop-Level Parallelism

    l Exploit loop-level parallelism by unrolling loop eitherby

    1.dynamic via branch prediction or2.static via loop unrolling by compiler(Another way is vectors, to be covered later)

    lDetermining dependences is critical !l If 2 instructions are

    l Parallel, they can execute simultaneously in a pipeline ofarbitrary depth without causing any stalls (assuming no

    structural hazards)

    l Dependent, they are not parallel and must be executed inorder, although they may often be partially overlapped

    7

  • 7/29/2019 9.Ilp Dynamic Sched1

    8/23

    Data Dependence and Hazards

    l InstrJ is data dependent (aka true dependence) on InstrI:1. InstrJ tries to read operand before InstrI writes it

    2. InstrJ is data dependent on InstrKwhich is dependent on InstrIl If two instructions are data dependent, they cannot execute

    simultaneously or be completely overlapped

    l Data dependence in instruction sequence data dependence in source code

    effect of original data dependence must be preserved

    l If data dependence causes a hazard in a pipeline, it is a Read After Write (RAW) hazard RAW is real

    8

    I: add r1,r2,r3

    J: sub r4,r1,r3

  • 7/29/2019 9.Ilp Dynamic Sched1

    9/23

    ILP and Data Dependencies, Hazards

    l HW/SW must preserve illusion (or program order)l Code must give the same results as if instructions were executed

    sequentially in the original order of the source program

    l Dependences are a property of programsl The presence of a dependence indicates the potential for a hazard,

    but the existence of an actual hazard and length of any stall are

    pipeline properties

    l Importance of data dependencies1. Indicate the possibility of a hazard2. Determine the order in which results must be calculated3. Set upper bounds on how much parallelism can possibly be exploited to

    speedup a program

    l HW/SW goal: exploit parallelism by preserving program orderonly where it affects the outcome of the program

    9

  • 7/29/2019 9.Ilp Dynamic Sched1

    10/23

    Name Dependence #1: Anti-dependence

    l Name dependence: when 2 instructions use same register ormemory location, called a name, but no flow of data between the

    instructions associated with that name; 2 versions of name

    dependence, which may cause WAR and WAW hazards.

    l InstrJ writes operand before InstrI reads it

    Called an anti-dependence by compiler writers. This results

    from reuse of the name r1

    l If anti-dependence caused a hazard in the pipeline, thats a WriteAfter Read (WAR) hazard

    10

    I: sub r4,r1,r3

    J: add r1,r2,r3

    K: mul r6,r1,r7

  • 7/29/2019 9.Ilp Dynamic Sched1

    11/23

    Name Dependence #2: Output dependence

    l InstrJ writes operand before InstrI writes it.

    l Called an output dependence by compiler writersThis also results from the reuse of name r1

    l If anti-dependence caused a hazard in the pipeline, thats a WriteAfter Write (WAW) hazard

    l Instructions involved in a name dependence can executesimultaneously, if we can make the instructions do not conflict.

    l Register renaming by hardwarel Use different register names by compiler

    11

    I: sub r1,r4,r3

    J: add r1,r2,r3

    K: mul r6,r1,r7

  • 7/29/2019 9.Ilp Dynamic Sched1

    12/23

    Control Dependencies

    l Every instruction is control dependent on some set of branches,and, in general, these control dependencies must be preserved to

    preserve program order

    if p1 {

    S1;};

    if p2 {

    S2;

    }

    l S1 is control dependent on proposition p1, and S2 is controldependent on p2, but not on p1.

    12

  • 7/29/2019 9.Ilp Dynamic Sched1

    13/23

    Carefully Violate Control Dependencies

    l Control dependence need NOT always be preservedl Can be violated by executing instructions that should not have

    been, if doing so does NOT affect program results

    l e.g., speculative execution HW throws away results of bad branchguesses.

    l Instead, 2 properties critical to program correctness arel Exception behavior andl

    Data flow

    13

  • 7/29/2019 9.Ilp Dynamic Sched1

    14/23

    Exception Behavior Is Important

    l Preserving exception behavior any changes in instruction execution order must NOT change

    how exceptions are raised in program

    ( no new exceptions)

    l Example:DADDU R2,R3,R4

    BEQZ R2,L1

    LW R1,-1(R2)

    L1:

    (This code example assumes branches are not delayed)

    l What is the problem with moving LW before BEQZ?l e.g., Array overflow: what if R2=0, so -1+[R2] is out of program memory

    bounds?14

  • 7/29/2019 9.Ilp Dynamic Sched1

    15/23

    Data Flow Of Values Must Be Preserved

    l Data flow: actual flow of data values from instructions thatproduce results to those that consume them

    l branches make flow dynamic (since we know details only at runtime);must determine which instruction is the data supplier

    l Example:DADDU R1,R2,R3BEQZ R4,L

    DSUBU R1,R5,R6

    L:

    OR R7,R1,R8

    l OR depends on DADDU orDSUBU?Compilers and HW must preserve data flow during execution

    15

  • 7/29/2019 9.Ilp Dynamic Sched1

    16/23

    Outline

    l ILP Instruction Level Parallelisml Compiler techniques to increase ILPl Loop Unrollingl

    Static Branch Predictionl Dynamic Branch Predictionl Overcoming Data Hazards with Dynamic Schedulingl (Start) Tomasulo Algorithml Conclusion

    16

  • 7/29/2019 9.Ilp Dynamic Sched1

    17/23

    Software Techniques - Example

    l This code adds a scalar to a vector:for (i=1000; i>0; i=i1)

    x[i] = x[i] + s;

    l Assume following latencies for all examplesl Ignore delayed branches in these examples

    17

    Instruction Instruction Latencyproducing result using result in cycles

    FP ALU op Another FP ALU op 3

    FP ALU op Store double 2Load double FP ALU op 1

    Load double Store double 0

    Integer op Integer op 0

  • 7/29/2019 9.Ilp Dynamic Sched1

    18/23

    FP Loop: Where are the Hazards?

    l First, translate into MIPS code. To simplify, assume:l 8 is the lowest address,l F2 hassl R1 starts with with address for x[1000]

    18

    for (i=1000; i>0; i=i1)

    x[i] = x[i] + s;

    Loop: L.D F0,0(R1) ;F0=vector element

    ADD.D F4,F0,F2 ;add scalar from F2

    S.D 0(R1),F4 ;store result

    DADDUI R1,R1,-8 ;decrement pointer 8B (DW)BNEZ R1,Loop ;branch R1!=zero

  • 7/29/2019 9.Ilp Dynamic Sched1

    19/23

    FP Loop Showing Stalls

    19

    1 Loop: L.D F0,0(R1) ;F0=vector element

    2 stall

    3 ADD.D F4,F0,F2 ;add scalar in F2

    4 stall

    5 stall

    6 S.D 0(R1),F4 ;store result7 DADDUI R1,R1,-8 ;decrement pointer 8Bytes (DW)

    8 stall ;assume cannot forward to branch

    9 BNEZ R1,Loop ;branch R1!=zero

    plus branch delay!

    for (i=1000; i>0; i=i1)

    x[i] = x[i] + s;

    Instruction Instruction Latency inproducingresult usingresult clock cycles

    FP ALU op Another FP ALU op 3

    FP ALU op Store double 2

    Load double FP ALU op 1

    Loop every 9 clock cycles. How reorder code to minimize stalls?

  • 7/29/2019 9.Ilp Dynamic Sched1

    20/23

    Revised FP Loop Minimizing Stalls

    20

    1 Loop: L.D F0,0(R1)

    2 DADDUI R1,R1,-8

    3 ADD.D F4,F0,F2

    4 stall

    5 stall

    6 S.D 8(R1),F4 ;altered offset 0=>8 when moved DADDUI

    7 BNEZ R1,Loop

    1 Loop: L.D F0,0(R1)

    2 stall

    3 ADD.D F4,F0,F2

    4 stall

    5 stall

    6 S.D 0(R1),F4

    7 DADDUI R1,R1,-8

    8 stall

    9 BNEZ R1,Loop

    Swap DADDUI and S.D; change address offset of S.D

    produce result use result stalls between

    FP ALU op Other FP ALU op 3

    FP ALU op Store double 2

    Load double FP ALU op 1

    Load double Store double 0

    Integer op=>R Branch on R 1Integer op Integer op 0

    Loop takes 7 clock cycles

    - 3 for execution (L.D,ADD.D,S.D)- 4 for loop overhead;

    How to make faster?

    for (i=1000; i>0; i=i1)

    x[i] = x[i] + s;

  • 7/29/2019 9.Ilp Dynamic Sched1

    21/23

    Unroll Original Loop Four Times

    l Straightforward way for time saving!!

    21

    Four loops take 27 clock cycles, or 6.75 per iteration!!(Assumes R1 is a multiple of 4)

    for (i=1000; i>0; i=i1)

    x[i] = x[i] + s;

    1 Loop: L.D F0,0(R1)

    3 ADD.D F4,F0,F2

    6 S.D 0(R1),F4 ;drop DADDUI & BNEZ

    7 L.D F6,-8(R1)

    9 ADD.D F8,F6,F2

    12 S.D -8(R1),F8 ;drop DADDUI & BNEZ

    13 L.D F10,-16(R1)

    15 ADD.D F12,F10,F2

    18 S.D -16(R1),F12 ;drop DADDUI & BNEZ19 L.D F14,-24(R1)

    21 ADD.D F16,F14,F2

    24 S.D -24(R1),F16

    25 DADDUI R1,R1,#-32 ;alter to 4*8

    27 BNEZ R1,LOOP

    1 cycle stall2 cycle stall

    1 cycle stall

    How to rewrite

    loop to minimize

    stalls?

  • 7/29/2019 9.Ilp Dynamic Sched1

    22/23

    Loop Unrolling Detail - Strip Mining

    l Do not usually know upper bound of loopl Suppose it is n, and we would like to unroll the loop to

    make kcopies of the body

    lInstead of a single unrolled loop, generate a pair ofconsecutive loops:

    l 1st executes (n mod k) times and has a body that is the originalloop (called strip mining of a loop)

    l 2nd is the unrolled body surrounded by an outer loop thatiterates (| n/k|) times

    l For large values ofn, most of the execution time will bespent in the | n/k| unrolled loops

    22

  • 7/29/2019 9.Ilp Dynamic Sched1

    23/23

    Unrolled Loop with Minimal Stalls

    23

    1 Loop: L.D F0,0(R1)

    3 ADD.D F4,F0,F2

    6 S.D 0(R1),F4

    7 L.D F6,-8(R1)

    9 ADD.D F8,F6,F2

    12 S.D -8(R1),F8

    13 L.D F10,-16(R1)

    15 ADD.D F12,F10,F2

    18 S.D -16(R1),F12

    19 L.D F14,-24(R1)21 ADD.D F16,F14,F2

    24 S.D -24(R1),F16

    25 DADDUI R1,R1,#-32

    27 BNEZ R1,LOOP

    1 Loop: L.D F0,0(R1)

    2 L.D F6,-8(R1)

    3 L.D F10,-16(R1)

    4 L.D F14,-24(R1)

    5 ADD.D F4,F0,F2

    6 ADD.D F8,F6,F2

    7 ADD.D F12,F10,F2

    8 ADD.D F16,F14,F2

    9 S.D 0(R1),F4

    10 S.D -8(R1),F8

    11 S.D -16(R1),F1212 DADDUI R1,R1,#-32

    13 S.D 8(R1), F16 ; 8-32 = -24

    14 BNEZ R1,LOOP

    Four loops take 14 clock cycles, or 3.5 per loop


Recommended