CISC 662 Graduate Computer Architecture Lecture 12 - CPI < 1 · • One cycle is needed for issue...

CISC 662 Graduate ComputerArchitecture

Lecture 12 - CPI < 1

Michela Taufer

http://www.cis.udel.edu/~taufer/teaching/CIS662F07

Powerpoint Lecture Notes from John Hennessy and David Patterson’s: ComputerArchitecture, 4th edition

----Additional teaching material from:

Jelena Mirkovic (U Del) and John Kubiatowicz (UC Berkeley)

2

Review Tomasulo

3

Review: Tomasulo With Reorder buffer:

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1F0 LD F0,10(R2) N

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

4

2 ADDD R(F4),ROB1

Review: Tomasulo With Reorder buffer:

ToMemory



FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F10F0

ADDD F10,F4,F0LD F0,10(R2)

NN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

5

3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB1

Review: Tomasulo With Reorder buffer

ToMemory



FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F10F0

DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)

NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

6

3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB16 ADDD ROB5, R(F6)


ToMemory



FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0 ADDD F0,F4,F6 NF4 LD F4,0(R3) N-- BNE F2,<…> NF2F10F0


NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

6 0+R3

7

3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB16 ADDD ROB5, R(F6)


ToMemory



FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

--F0

ROB5

ST 0(R3),F4ADDD F0,F4,F6

NN

F4 LD F4,0(R3) N-- BNE F2,<…> NF2F10F0


NNN

Done?

Dest Dest

Oldest

Newest

from Memory

Dest

Reorder Buffer

Registers

1 10+R26 0+R3

8

3 DIVD ROB2,R(F6)


ToMemory



FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

--F0

M[10]


YN

F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> NF2F10F0


NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

2 ADDD R(F4),ROB16 ADDD M[10],R(F6)

9



ToMemory



FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

--F0

M[10]<val2>


YEx

F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> NF2F10F0


NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

10

--F0

M[10]<val2>


YEx

F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> N



ToMemory



FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F10F0


NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

What about memoryhazards???

11

Dynamic Memory Disambiguation Order of loads and stores must be preserved Since they access memory locations, we can

examine order only after we calculate effectiveaddress

Effective address calculation is performed in order: Address of a load is examined against A fields of

all store buffers Address of a store is examined against A fields of

all load and store buffers

12

CPI < 1

13

CPI < 1?• CPI < 1 not possible if only one instruction is

issued per clock cycle• Need to allow multiple instructions to be issued

in a clock cycle

14

Getting CPI < 1: IssuingMultiple Instructions/Cycle• Vector Processing: Explicit coding of independent loops as

operations on large vectors of numbers– Multimedia instructions being added to many

processors• Superscalar: varying no. instructions/cycle (1 to 8),

scheduled by compiler or by HW (Tomasulo)– IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4

• (Very) Long Instruction Words (V)LIW:fixed number of instructions (4-16) scheduled by thecompiler; put ops into wide templates (TBD)

– Intel Architecture-64 (IA-64) 64-bit address» Renamed: “Explicitly Parallel Instruction Computer

(EPIC)”• Anticipated success of multiple instructions lead to

Instructions Per Clock cycle (IPC) vs. CPI

15

Superscalar Processors• Instructions either statically or dynamically

scheduled:– Statically scheduled by compilers– Dynamically scheduled by techniques based on

scoreboarding of Tomasulo’s

• Issue varying number of instructions per clock

16

Very Long Instruction Word• Issue a fixed number of instructions formatted

wither as one large instruction or as a fixedinstruction packet

• Instructions statically scheduled by the compiler

17

Implementing Superscalar Processors• To have multiple instructions per clock

– Run each step (i.e., assigned a reservation station anduploading the pipeline control) in half a clock cycle sothat two instructions can be processed in one clockcycle

– Build the logic necessary to handle two instructions atonce, including any dependency between instructions

18

Getting CPI < 1: IssuingMultiple Instructions/Cycle

• Superscalar: assume 2 instructions, 1 FP & 1 anything else– Fetch 64-bits/clock cycle; Int on left, FP on right– Can only issue 2nd instruction if 1st instruction issues– More ports for FP registers to do FP load & FP op in a pair

Type Pipe StagesInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WB

• 1 cycle load delay expands to 3 instructions in SS– instruction in right half can’t use it, nor instructions in next slot

19

Multiple Issue Issues• Issue packet: group of instructions from fetch

unit that could potentially issue in 1 clock– If instruction causes structural hazard or a data hazard either

due to earlier instruction in execution or to earlier instructionin issue packet, then instruction does not issue

– 0 to N instruction issues per clock cycle, for N-issue

• Performing issue checks in 1 cycle could limitclock cycle time: O(n2-n) comparisons

– => issue stage usually split and pipelined– 1st stage decides how many instructions from within this

packet can issue, 2nd stage examines hazards amongselected instructions and those already been issued

– => higher branch penalties => prediction accuracy important

20

Dynamic Scheduling in SuperscalarThe easy way• How to issue two instructions and keep in-order

instruction issue for Tomasulo?– Assume 1 integer + 1 floating point– 1 Tomasulo control for integer, 1 for floating point

• Issue 2X Clock Rate, so that issue remains inorder

• Only loads/stores might cause dependencybetween integer and FP issue:

– Replace load reservation station with a load queue;operands must be read in the order they are fetched

– Load checks addresses in Store Queue to avoid RAWviolation

– Store checks addresses in Load Queue to avoid WAR,WAW

21

How much to Speculate?• Speculation Pro: uncover events that would

otherwise stall the pipeline (cache misses)• Speculation Con: speculate costly if exceptional

event occurs when speculation was incorrect• Typical solution: speculation allows only low-

cost exceptional events (1st-level cache miss)• When expensive exceptional event occurs, (2nd-

level cache miss or TLB miss) processor waitsuntil the instruction causing event is no longerspeculative before handling the event

• Assuming single branch per cycle: future mayspeculate across multiple branches!

22

Review: Unrolled Loop thatMinimizes Stalls for Scalar

1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24

14 clock cycles, or 3.5 per iteration

LD to ADDD: 1 CycleADDD to SD: 2 Cycles

23

Loop Unrolling in SuperscalarInteger instruction FP instruction Clock cycle

Loop: LD F0,0(R1) 1LD F6,-8(R1) 2LD F10,-16(R1) ADDD F4,F0,F2 3LD F14,-24(R1) ADDD F8,F6,F2 4LD F18,-32(R1) ADDD F12,F10,F2 5SD 0(R1),F4 ADDD F16,F14,F2 6SD -8(R1),F8 ADDD F20,F18,F2 7SD -16(R1),F12 8SD -24(R1),F16 9SUBI R1,R1,#40 10BNEZ R1,LOOP 11SD -32(R1),F20 12

• Unrolled 5 times to avoid delays (+1 due to SS)• 12 clocks, or 2.4 clocks per iteration (1.5X)

24

Statically Scheduled Superscalar MIPS• The compiler is responsible for finding

independent instruction to issue• E.g., unroll loop to make n copies

• Problems might arise:• We will need additional hardware in the pipeline• Maintaining precise exceptions is hard because instructions

may complete out of order• Hazard penalties are longer

25

Dynamically Scheduled Superscalar MIPS Extend Tomasulo’s algorithm to support issue of 2

instructions per cycle We must issue instructions to reservation stations in order Issue stage can either be

Pipelined – issue one instruction in half cycle, another one inanother half

Extended – add more hardware and issue instructionssimultaneously

26

Dynamically Scheduled Superscalar MIPS

Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LOOP

• Any two instruction can be issued (not only integer + FP)• One INT unit used both for ALU and effective address calculation• Integer ALU takes 1 cycle, load 2, FP add 3• Pipelined FP units, 2 CDBs, perfect branch prediction• One cycle is needed for issue and one for write results (this stage adds one

cycle delay)• Show when each instruction issues, begins execution and writes to CDB for

the first 3 iterations of the loop• Show resource usage for integer unit, FP unit, data cache and CDB• Assume that we do not have any hardware that allows us to know whether the

as-yet-undecoded instruction is a branch• Assume instructions following branch cannot proceed with execution until we

know branch outcome• Assume one single memory port

27


Iteration Instruction Issue Execute Memory Write CDB111112222233333

L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LoopL.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LoopL.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, Loop

11223

2Comment

3 4Wait for L.D5 8

3 94 5

Wait for ADD.DWait for ALU

64

Wait for DADDIU7 Wait for BNE

4 10556

8 9Wait for L.D13

8 149 10

Wait for ADD.DWait for ALU

117


7 15889

13 14Wait for L.DWait for ADD.DWait for ALUWait for DADDIU

1813 1914 1516

CPI=16/15=1.07Dual issue version with without speculation

28



• Any two instruction can be issued (not only integer + FP)• One INT unit used for ALU• One INT unit is used for effective address calculation• Integer ALU takes 1 cycle, load 2, FP add 3• Pipelined FP units, 2 CDBs, perfect branch prediction• One cycle is needed for issue and one for write results (this stage adds one


the first 3 iterations of the loop• Show resource usage for integer unit, FP unit, data cache and CDB• Assume that we do not have any hardware that allows us to know whether the

as-yet-undecoded instruction is a branch• Assume instructions following branch cannot proceed with execution until we

know branch outcome• Assume one single memory port

29




11223

2Comment

3 4Wait for L.D5 8

3 93 4

Wait for ADD.D

54


4 9556

7 8Wait for L.D12

7 136 7

Wait for ADD.D

87


7 12889

10 11Wait for L.DWait for ADD.D

Wait for DADDIU

1510 169 1011

CPI=11/15=0.73

30



• Any two instruction can be issued (not only integer + FP)• One INT unit used for ALU• One INT unit is used for effective address calculation• Integer ALU takes 1 cycle, load 2, FP add 3• Pipelined FP units, 2 CDBs, perfect branch prediction• One cycle is needed for issue and one for write results (this stage adds one cycle delay)• Show when each instruction issues, begins execution and writes to CDB for the first 3

iterations of the loop• Show resource usage for integer unit, FP unit, data cache and CDB• Assume that we do not have any hardware that allows us to know whether the as-yet-

undecoded instruction is a branch• Assume instructions following branch can proceed with execution even if we do not know

the branch outcome - speculation• Assume one single memory port

31




Comment

CPI = _____

32



• Any two instruction can be issued (not only integer + FP)• One INT unit used for ALU• One INT unit is used for effective address calculation• Integer ALU takes 1 cycle, load 2, FP add 3• Pipelined FP units, 2 CDBs, perfect branch prediction• One cycle is needed for issue and one for write results (this stage adds one cycle delay)• Show when each instruction issues, begins execution and writes to CDB for the first 3

iterations of the loop• Show resource usage for integer unit, FP unit, data cache and CDB• Assume that we do not have any hardware that allows us to know whether the as-yet-

undecoded instruction is a branch• Assume instructions following branch can proceed with execution even if we do not know

the branch outcome - speculation• Assume two memory ports

33




Comment

CPI = _____

34

Increasing Instruction Fetch Bandwidth• Predicts next

instruct address,sends it outbefore decodinginstruction

• PC of branchsent to BTB

• When match isfound, PredictedPC is returned

• If branchpredicted taken,instruction fetchcontinues atPredicted PC

Branch Target Buffer (BTB)

35

Branch Folding (I)

Predicted instruction

• Branch foldingallows:

– 0-cycle unconditionalbranches (always)

– 0-cycle conditionalbranches (sometimes)

• BF eliminates aninstruction (thebranch) from thecode stream

• BF eliminates thesingle-cycle pipelinebubble that usuallyoccurs immediatelyafter a branch

36

Branch folding (II)

Predicted instructions

If the processor is issuing

two instructions per cycle

37



• Any two instruction can be issued (not only integer + FP)• One INT unit used for ALU• One INT unit is used for effective address calculation• Integer ALU takes 1 cycle, load 2, FP add 3• Pipelined FP units, 2 CDBs, perfect branch prediction• One cycle is needed for issue and one for write results (this stage adds one


the first 3 iterations of the loop• Show resource usage for integer unit, FP unit, data cache and CDB• Assume that we do have a branch-target buffers with branch folding that

allows us to know whether the as-yet-undecoded instruction is a branch andwhat are the next two instructions

• Assume instructions following branch can proceed with execution even if wedo not know the branch outcome - speculation

38




Comment

CPI = _____

39

Multiple Issue Challenges• While Integer/FP split is simple for the HW, get CPI of 0.5 only for

programs with:– Exactly 50% FP operations AND No hazards

• If more instructions issue at same time, greater difficulty ofdecode and issue:

– Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2instructions can issue; (N-issue ~O(N2-N) comparisons)

– Register file: need 2x reads and 1x writes/cycle– Rename logic: must be able to rename same register multiple times in one

cycle! For instance, consider 4-way issue:add r1, r2, r3 add p11, p4, p7sub r4, r1, r2 ⇒ sub p22, p11, p4lw r1, 4(r4) lw p23, 4(p22)add r5, r1, r2 add p12, p23, p4

Imagine doing this transformation in a single cycle!– Result buses: Need to complete multiple instructions/cycle

» So, need multiple buses with associated matching logic at everyreservation station.

» Or, need multiple forwarding paths

40

More about VLIW• VLIW packages: multiple operations into one

very long instruction• The compiler chooses the instructions to be

issued• Enough parallelism is needed in a straight-line

code sequence to fill the available operation slots– Unroll loops– Schedule code across basic blocks using a global scheduling

techniques

41

Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchLD F0,0(R1) LD F6,-8(R1) 1LD F10,-16(R1) LD F14,-24(R1) 2LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4

ADDD F20,F18,F2 ADDD F24,F22,F2 5SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6SD -16(R1),F12 SD -24(R1),F16 7SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8SD -0(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS)

42

Advantages of HW (Tomasulo) vs.SW (VLIW) Speculation• HW determines address conflicts• HW better branch prediction• HW maintains precise exception model• HW does not execute bookkeeping instructions• Works across multiple implementations• SW speculation is much easier for HW design

43

Superscalar v. VLIW• Smaller code size• Binary compatibility

across generations ofhardware

• Simplified Hardwarefor decoding, issuinginstructions

• No Interlock Hardware(compiler checks?)

• More registers, butsimplified Hardwarefor Register Ports(multiple independentregister files?)

44

Limits in Multi-issue Processors• Inherent limitations of ILP in programs• Difficulties in building the underlying hardware• Limitations specific to either a superscalar or

VLIW implementations

45

Deadlines

9 Oct 28 Lec12 - Multiple Issue

9 Oct 30 Lec13 - Study of the Limitations of ILP Q4

10 Nov 4 Election day – no class

10 Nov 6 Lec14 - Review Cache and Review Virtual

Memory

Chap 4

11 Nov 10 Homework 3 d u e

11 Nov 11 Lec15 - Multiprocessors and Thread-Level

Parallelism; Symmetric Shared Memory

Q 5

11 Nov 13 Lec16 - Distributed Shared Memory

Nov 17 Homework 4 due

12 Nov 18 Lec18 – Homework 3 review Chap 5

12 Nov 20 Lec19 – Homework 4 review

13 Nov 25 Lec17 - Synchronization Q6

13 Nov 27 Thanksgiving – Holiday

Date post:	31-Aug-2019
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

CISC 662 Graduate Computer Architecture Lecture 12 - CPI < 1 · • One cycle is needed for issue...

Documents