CISC 662 Graduate ComputerArchitecture
Lecture 12 - CPI < 1
Michela Taufer
http://www.cis.udel.edu/~taufer/teaching/CIS662F07
Powerpoint Lecture Notes from John Hennessy and David Patterson’s: ComputerArchitecture, 4th edition
----Additional teaching material from:
Jelena Mirkovic (U Del) and John Kubiatowicz (UC Berkeley)
3
Review: Tomasulo With Reorder buffer:
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1F0 LD F0,10(R2) N
Done?
Dest Dest
Oldest
Newest
from Memory
1 10+R2Dest
Reorder Buffer
Registers
4
2 ADDD R(F4),ROB1
Review: Tomasulo With Reorder buffer:
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F10F0
ADDD F10,F4,F0LD F0,10(R2)
NN
Done?
Dest Dest
Oldest
Newest
from Memory
1 10+R2Dest
Reorder Buffer
Registers
5
3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB1
Review: Tomasulo With Reorder buffer
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F2F10F0
DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)
NNN
Done?
Dest Dest
Oldest
Newest
from Memory
1 10+R2Dest
Reorder Buffer
Registers
6
3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB16 ADDD ROB5, R(F6)
Review: Tomasulo With Reorder buffer
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0 ADDD F0,F4,F6 NF4 LD F4,0(R3) N-- BNE F2,<…> NF2F10F0
DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)
NNN
Done?
Dest Dest
Oldest
Newest
from Memory
1 10+R2Dest
Reorder Buffer
Registers
6 0+R3
7
3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB16 ADDD ROB5, R(F6)
Review: Tomasulo With Reorder buffer
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
--F0
ROB5
ST 0(R3),F4ADDD F0,F4,F6
NN
F4 LD F4,0(R3) N-- BNE F2,<…> NF2F10F0
DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)
NNN
Done?
Dest Dest
Oldest
Newest
from Memory
Dest
Reorder Buffer
Registers
1 10+R26 0+R3
8
3 DIVD ROB2,R(F6)
Review: Tomasulo With Reorder buffer
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
--F0
M[10]
ST 0(R3),F4ADDD F0,F4,F6
YN
F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> NF2F10F0
DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)
NNN
Done?
Dest Dest
Oldest
Newest
from Memory
1 10+R2Dest
Reorder Buffer
Registers
2 ADDD R(F4),ROB16 ADDD M[10],R(F6)
9
3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB1
Review: Tomasulo With Reorder buffer
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
--F0
M[10]<val2>
ST 0(R3),F4ADDD F0,F4,F6
YEx
F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> NF2F10F0
DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)
NNN
Done?
Dest Dest
Oldest
Newest
from Memory
1 10+R2Dest
Reorder Buffer
Registers
10
--F0
M[10]<val2>
ST 0(R3),F4ADDD F0,F4,F6
YEx
F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> N
3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB1
Review: Tomasulo With Reorder buffer
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F2F10F0
DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)
NNN
Done?
Dest Dest
Oldest
Newest
from Memory
1 10+R2Dest
Reorder Buffer
Registers
What about memoryhazards???
11
Dynamic Memory Disambiguation Order of loads and stores must be preserved Since they access memory locations, we can
examine order only after we calculate effectiveaddress
Effective address calculation is performed in order: Address of a load is examined against A fields of
all store buffers Address of a store is examined against A fields of
all load and store buffers
13
CPI < 1?• CPI < 1 not possible if only one instruction is
issued per clock cycle• Need to allow multiple instructions to be issued
in a clock cycle
14
Getting CPI < 1: IssuingMultiple Instructions/Cycle• Vector Processing: Explicit coding of independent loops as
operations on large vectors of numbers– Multimedia instructions being added to many
processors• Superscalar: varying no. instructions/cycle (1 to 8),
scheduled by compiler or by HW (Tomasulo)– IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4
• (Very) Long Instruction Words (V)LIW:fixed number of instructions (4-16) scheduled by thecompiler; put ops into wide templates (TBD)
– Intel Architecture-64 (IA-64) 64-bit address» Renamed: “Explicitly Parallel Instruction Computer
(EPIC)”• Anticipated success of multiple instructions lead to
Instructions Per Clock cycle (IPC) vs. CPI
15
Superscalar Processors• Instructions either statically or dynamically
scheduled:– Statically scheduled by compilers– Dynamically scheduled by techniques based on
scoreboarding of Tomasulo’s
• Issue varying number of instructions per clock
16
Very Long Instruction Word• Issue a fixed number of instructions formatted
wither as one large instruction or as a fixedinstruction packet
• Instructions statically scheduled by the compiler
17
Implementing Superscalar Processors• To have multiple instructions per clock
– Run each step (i.e., assigned a reservation station anduploading the pipeline control) in half a clock cycle sothat two instructions can be processed in one clockcycle
– Build the logic necessary to handle two instructions atonce, including any dependency between instructions
18
Getting CPI < 1: IssuingMultiple Instructions/Cycle
• Superscalar: assume 2 instructions, 1 FP & 1 anything else– Fetch 64-bits/clock cycle; Int on left, FP on right– Can only issue 2nd instruction if 1st instruction issues– More ports for FP registers to do FP load & FP op in a pair
Type Pipe StagesInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WB
• 1 cycle load delay expands to 3 instructions in SS– instruction in right half can’t use it, nor instructions in next slot
19
Multiple Issue Issues• Issue packet: group of instructions from fetch
unit that could potentially issue in 1 clock– If instruction causes structural hazard or a data hazard either
due to earlier instruction in execution or to earlier instructionin issue packet, then instruction does not issue
– 0 to N instruction issues per clock cycle, for N-issue
• Performing issue checks in 1 cycle could limitclock cycle time: O(n2-n) comparisons
– => issue stage usually split and pipelined– 1st stage decides how many instructions from within this
packet can issue, 2nd stage examines hazards amongselected instructions and those already been issued
– => higher branch penalties => prediction accuracy important
20
Dynamic Scheduling in SuperscalarThe easy way• How to issue two instructions and keep in-order
instruction issue for Tomasulo?– Assume 1 integer + 1 floating point– 1 Tomasulo control for integer, 1 for floating point
• Issue 2X Clock Rate, so that issue remains inorder
• Only loads/stores might cause dependencybetween integer and FP issue:
– Replace load reservation station with a load queue;operands must be read in the order they are fetched
– Load checks addresses in Store Queue to avoid RAWviolation
– Store checks addresses in Load Queue to avoid WAR,WAW
21
How much to Speculate?• Speculation Pro: uncover events that would
otherwise stall the pipeline (cache misses)• Speculation Con: speculate costly if exceptional
event occurs when speculation was incorrect• Typical solution: speculation allows only low-
cost exceptional events (1st-level cache miss)• When expensive exceptional event occurs, (2nd-
level cache miss or TLB miss) processor waitsuntil the instruction causing event is no longerspeculative before handling the event
• Assuming single branch per cycle: future mayspeculate across multiple branches!
22
Review: Unrolled Loop thatMinimizes Stalls for Scalar
1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration
LD to ADDD: 1 CycleADDD to SD: 2 Cycles
23
Loop Unrolling in SuperscalarInteger instruction FP instruction Clock cycle
Loop: LD F0,0(R1) 1LD F6,-8(R1) 2LD F10,-16(R1) ADDD F4,F0,F2 3LD F14,-24(R1) ADDD F8,F6,F2 4LD F18,-32(R1) ADDD F12,F10,F2 5SD 0(R1),F4 ADDD F16,F14,F2 6SD -8(R1),F8 ADDD F20,F18,F2 7SD -16(R1),F12 8SD -24(R1),F16 9SUBI R1,R1,#40 10BNEZ R1,LOOP 11SD -32(R1),F20 12
• Unrolled 5 times to avoid delays (+1 due to SS)• 12 clocks, or 2.4 clocks per iteration (1.5X)
24
Statically Scheduled Superscalar MIPS• The compiler is responsible for finding
independent instruction to issue• E.g., unroll loop to make n copies
• Problems might arise:• We will need additional hardware in the pipeline• Maintaining precise exceptions is hard because instructions
may complete out of order• Hazard penalties are longer
25
Dynamically Scheduled Superscalar MIPS Extend Tomasulo’s algorithm to support issue of 2
instructions per cycle We must issue instructions to reservation stations in order Issue stage can either be
Pipelined – issue one instruction in half cycle, another one inanother half
Extended – add more hardware and issue instructionssimultaneously
26
Dynamically Scheduled Superscalar MIPS
Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LOOP
• Any two instruction can be issued (not only integer + FP)• One INT unit used both for ALU and effective address calculation• Integer ALU takes 1 cycle, load 2, FP add 3• Pipelined FP units, 2 CDBs, perfect branch prediction• One cycle is needed for issue and one for write results (this stage adds one
cycle delay)• Show when each instruction issues, begins execution and writes to CDB for
the first 3 iterations of the loop• Show resource usage for integer unit, FP unit, data cache and CDB• Assume that we do not have any hardware that allows us to know whether the
as-yet-undecoded instruction is a branch• Assume instructions following branch cannot proceed with execution until we
know branch outcome• Assume one single memory port
27
Dynamically Scheduled Superscalar MIPS
Iteration Instruction Issue Execute Memory Write CDB111112222233333
L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LoopL.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LoopL.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, Loop
11223
2Comment
3 4Wait for L.D5 8
3 94 5
Wait for ADD.DWait for ALU
64
Wait for DADDIU7 Wait for BNE
4 10556
8 9Wait for L.D13
8 149 10
Wait for ADD.DWait for ALU
117
Wait for DADDIU12 Wait for BNE
7 15889
13 14Wait for L.DWait for ADD.DWait for ALUWait for DADDIU
1813 1914 1516
CPI=16/15=1.07Dual issue version with without speculation
28
Dynamically Scheduled Superscalar MIPS
Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LOOP
• Any two instruction can be issued (not only integer + FP)• One INT unit used for ALU• One INT unit is used for effective address calculation• Integer ALU takes 1 cycle, load 2, FP add 3• Pipelined FP units, 2 CDBs, perfect branch prediction• One cycle is needed for issue and one for write results (this stage adds one
cycle delay)• Show when each instruction issues, begins execution and writes to CDB for
the first 3 iterations of the loop• Show resource usage for integer unit, FP unit, data cache and CDB• Assume that we do not have any hardware that allows us to know whether the
as-yet-undecoded instruction is a branch• Assume instructions following branch cannot proceed with execution until we
know branch outcome• Assume one single memory port
29
Dynamically Scheduled Superscalar MIPS
Iteration Instruction Issue Execute Memory Write CDB111112222233333
L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LoopL.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LoopL.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, Loop
11223
2Comment
3 4Wait for L.D5 8
3 93 4
Wait for ADD.D
54
Wait for DADDIU6 Wait for BNE
4 9556
7 8Wait for L.D12
7 136 7
Wait for ADD.D
87
Wait for DADDIU9 Wait for BNE
7 12889
10 11Wait for L.DWait for ADD.D
Wait for DADDIU
1510 169 1011
CPI=11/15=0.73
30
Dynamically Scheduled Superscalar MIPS
Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LOOP
• Any two instruction can be issued (not only integer + FP)• One INT unit used for ALU• One INT unit is used for effective address calculation• Integer ALU takes 1 cycle, load 2, FP add 3• Pipelined FP units, 2 CDBs, perfect branch prediction• One cycle is needed for issue and one for write results (this stage adds one cycle delay)• Show when each instruction issues, begins execution and writes to CDB for the first 3
iterations of the loop• Show resource usage for integer unit, FP unit, data cache and CDB• Assume that we do not have any hardware that allows us to know whether the as-yet-
undecoded instruction is a branch• Assume instructions following branch can proceed with execution even if we do not know
the branch outcome - speculation• Assume one single memory port
31
Dynamically Scheduled Superscalar MIPS
Iteration Instruction Issue Execute Memory Write CDB111112222233333
L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LoopL.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LoopL.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, Loop
Comment
CPI = _____
32
Dynamically Scheduled Superscalar MIPS
Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LOOP
• Any two instruction can be issued (not only integer + FP)• One INT unit used for ALU• One INT unit is used for effective address calculation• Integer ALU takes 1 cycle, load 2, FP add 3• Pipelined FP units, 2 CDBs, perfect branch prediction• One cycle is needed for issue and one for write results (this stage adds one cycle delay)• Show when each instruction issues, begins execution and writes to CDB for the first 3
iterations of the loop• Show resource usage for integer unit, FP unit, data cache and CDB• Assume that we do not have any hardware that allows us to know whether the as-yet-
undecoded instruction is a branch• Assume instructions following branch can proceed with execution even if we do not know
the branch outcome - speculation• Assume two memory ports
33
Dynamically Scheduled Superscalar MIPS
Iteration Instruction Issue Execute Memory Write CDB111112222233333
L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LoopL.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LoopL.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, Loop
Comment
CPI = _____
34
Increasing Instruction Fetch Bandwidth• Predicts next
instruct address,sends it outbefore decodinginstruction
• PC of branchsent to BTB
• When match isfound, PredictedPC is returned
• If branchpredicted taken,instruction fetchcontinues atPredicted PC
Branch Target Buffer (BTB)
35
Branch Folding (I)
Predicted instruction
• Branch foldingallows:
– 0-cycle unconditionalbranches (always)
– 0-cycle conditionalbranches (sometimes)
• BF eliminates aninstruction (thebranch) from thecode stream
• BF eliminates thesingle-cycle pipelinebubble that usuallyoccurs immediatelyafter a branch
36
Branch folding (II)
Predicted instructions
If the processor is issuing
two instructions per cycle
37
Dynamically Scheduled Superscalar MIPS
Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LOOP
• Any two instruction can be issued (not only integer + FP)• One INT unit used for ALU• One INT unit is used for effective address calculation• Integer ALU takes 1 cycle, load 2, FP add 3• Pipelined FP units, 2 CDBs, perfect branch prediction• One cycle is needed for issue and one for write results (this stage adds one
cycle delay)• Show when each instruction issues, begins execution and writes to CDB for
the first 3 iterations of the loop• Show resource usage for integer unit, FP unit, data cache and CDB• Assume that we do have a branch-target buffers with branch folding that
allows us to know whether the as-yet-undecoded instruction is a branch andwhat are the next two instructions
• Assume instructions following branch can proceed with execution even if wedo not know the branch outcome - speculation
38
Dynamically Scheduled Superscalar MIPS
Iteration Instruction Issue Execute Memory Write CDB111112222233333
L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LoopL.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, LoopL.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, Loop
Comment
CPI = _____
39
Multiple Issue Challenges• While Integer/FP split is simple for the HW, get CPI of 0.5 only for
programs with:– Exactly 50% FP operations AND No hazards
• If more instructions issue at same time, greater difficulty ofdecode and issue:
– Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2instructions can issue; (N-issue ~O(N2-N) comparisons)
– Register file: need 2x reads and 1x writes/cycle– Rename logic: must be able to rename same register multiple times in one
cycle! For instance, consider 4-way issue:add r1, r2, r3 add p11, p4, p7sub r4, r1, r2 ⇒ sub p22, p11, p4lw r1, 4(r4) lw p23, 4(p22)add r5, r1, r2 add p12, p23, p4
Imagine doing this transformation in a single cycle!– Result buses: Need to complete multiple instructions/cycle
» So, need multiple buses with associated matching logic at everyreservation station.
» Or, need multiple forwarding paths
40
More about VLIW• VLIW packages: multiple operations into one
very long instruction• The compiler chooses the instructions to be
issued• Enough parallelism is needed in a straight-line
code sequence to fill the available operation slots– Unroll loops– Schedule code across basic blocks using a global scheduling
techniques
41
Loop Unrolling in VLIW
Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchLD F0,0(R1) LD F6,-8(R1) 1LD F10,-16(R1) LD F14,-24(R1) 2LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6SD -16(R1),F12 SD -24(R1),F16 7SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8SD -0(R1),F28 BNEZ R1,LOOP 9
Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS)
42
Advantages of HW (Tomasulo) vs.SW (VLIW) Speculation• HW determines address conflicts• HW better branch prediction• HW maintains precise exception model• HW does not execute bookkeeping instructions• Works across multiple implementations• SW speculation is much easier for HW design
43
Superscalar v. VLIW• Smaller code size• Binary compatibility
across generations ofhardware
• Simplified Hardwarefor decoding, issuinginstructions
• No Interlock Hardware(compiler checks?)
• More registers, butsimplified Hardwarefor Register Ports(multiple independentregister files?)
44
Limits in Multi-issue Processors• Inherent limitations of ILP in programs• Difficulties in building the underlying hardware• Limitations specific to either a superscalar or
VLIW implementations
45
Deadlines
9 Oct 28 Lec12 - Multiple Issue
9 Oct 30 Lec13 - Study of the Limitations of ILP Q4
10 Nov 4 Election day – no class
10 Nov 6 Lec14 - Review Cache and Review Virtual
Memory
Chap 4
11 Nov 10 Homework 3 d u e
11 Nov 11 Lec15 - Multiprocessors and Thread-Level
Parallelism; Symmetric Shared Memory
Q 5
11 Nov 13 Lec16 - Distributed Shared Memory
Nov 17 Homework 4 due
12 Nov 18 Lec18 – Homework 3 review Chap 5
12 Nov 20 Lec19 – Homework 4 review
13 Nov 25 Lec17 - Synchronization Q6
13 Nov 27 Thanksgiving – Holiday