A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order ExecutionScheduling
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Instruction Level Parallel Processing
• Sequential Execution Semantics• Out-of-Order Execution
– How it can help– Issues:
• Maintaining Sequential Semantics• Scheduling
– Scoreboard• Register Renaming
• Initially, we’ll focus on Registers, Memory later on
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Sequential Semantics - Review
• Instructions appear as if they executed:– In the order they appear in the program– One after the other
ProgramOrder
Pipelining Superscalar Out-of-Order
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
fetch decode
fetch decode
sub
bne
fetch decode add
fetch decode ld
fetch decode add
Out-of-Order Execution
do {sum += a[++m]; i--;
} while (i != 0);
out-of-order
loop: add r4, r4, 1ld r2, 10(r4)add r3, r3, r2sub r1, r1, 1bne r1, r0, loop
fetch decode
fetch decode
sub
bne
fetch decode add
fetch decode ld
fetch decode add
Superscalar
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
fetch decode
fetch decode
sub
bne
fetch decode add
fetch decode ld
fetch decode add
Sequential Semantics?
• Execution does NOT adhere to sequential semantics
• To be precise: Eventually it may• Simplest solution: Define problem away• Not acceptable today: e.g., Virtual Memory• Three-phase Instruction execution
– In-Progress, Completed and Committed
inconsistent
consistent
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-of-order Execution Issues
• Preserving Sequential Semantics
• Stalling Instructions w/ dependences
• Issuing Instructions when dependences are satisfied
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Back to Sequential Semantics
• Instr. exec. in 3 phases:– In-progress, Completed, Committed– OOO for in-progress and Completed– In-order Commits
• Completed - out-of-order: ”Visible only inside”– Results visible to subsequent instructions– Results not visible to outsiders
• On interrupts completed results are discarded• Committed - in-order: ”Visible to all”
– Results visible to subsequent instructions– Results visible to outsiders
• On interrupt committed results are preserved
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
How Completes Help w/ Performance
Tim
e
DIV R3, _, _ADD R1, _, _ADD _, R1, _
In-ordercommits
in-ordercompletes
out-of-order completesin-order commits
complete
fetch decode
fetch decode
sub
bne
fetch decode add
fetch decode ld
fetch decode add
commit
commit
commit
commit
commit
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Implementing Completes/Commits
• Key idea:– Maintain sufficient state around to be able to
roll-back when necessary– Roll-back:
• Discard (aka Squash) all not committed
• One solution (conceptual):– Upon Complete instruction records previous
value of target register– Upon Discard, instruction restores target
value– Upon Commit, nothing to do
• We will return to this shortly • Focus on scheduling mechanisms
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order Execution Overview
Program Form Processing Phase
Static program
dynamic inst.Stream (trace)
execution window
completed instructions
Dispatch/ dependences
inst. Issue
inst execution
inst. Reorder & commit
In-P
rog
ress
Com
ple
ted
Com
mitte
d
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order Execution: Stages
• Fetch: get instruction from memory• Decode/Dispatch: what is it? What are the
dependences• Issue: Go – all dependences satisfied• Execute: perform operation• Complete: result available to other insts.• Commit: result available to outsiders
• We’ll start w/ Decode/Dispatch• Then we’ll consider Issue
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
OOO Scheduling• Instruction @ Decode:
– Do I have dependences yet to be satisfied?– Yes, stall until they are– No, clear to issue
• Wakeup Instructions Stalled:– Dependences satisfied– Allow instruction to issue
• Dependence:– (later instruction, earlier instruction) & type
• We’ll first consider RAW and then move on to WAW and WAR
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Stalling @ Decode for RAW
• Are there unsatisfied dependences?– RAW: have to wait for register value– We don’t really care who is producing the
value– Only whether it is available
• Can use the Register Availability Vector as in pipelining/superscalar– Also known as scoreboard
• At Decode– Reset bit corresponding to your target– At writeback set– Check all bits for source regs: if any is 0 stall
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Issuing Instructions: Scheduling• Determine when an instruction can issue
– Ignore resources for the time being• Stalled because of RAW w/ preceding instruction• Concept:
– Producer (write) notifies consumers (read)• Requirements:
– Consumers need to be able to identify producer– The register name is one possible link
• Mechanism– Consumer placed in a reservation station – Producers on complete broadcasts identity– Waiting instructions observe– Update Operand Availability – Issue if all operands now available
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Reservation Station
• State pertaining to an instruction– What registers it reads– Whether they are available– What is the destination register– What state is the instruction in
• Waiting• Executing
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-Of-Order Exec. Exampleloop: add r4, r4, 4
ld r2, 10(r4) 4 cycles latadd r3, r3, r2sub r1, r1, 1bne r1, r0, loop
1 1 1 1
r1 r2 r3 r4
RAVop src1 src2 tgt
Cycle 0
status
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-Of-Order Exec. Example: Cycle 0
1 1 1 0
r1 r2 r3 r4
RAVop src1 src2 tgt
Cycle 0
add r4/1 NA/1 r4/0 Rdy
status
loop: add r4, r4, 4ld r2, 10(r4) 5 cycles latadd r3, r3, r2sub r1, r1, 1bne r1, r0, loop
Ready to be executed
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 1loop: add r4, r4, 4
ld r2, 10(r4)add r3, r3, r2
sub r1, r1, 1
bne r1, r0, loop
1 0 1 1
r1 r2 r3 r4
RAVop src1 src2 tgt
add r4/1 NA/1 r4 Exec
status
ld r4/1 NA/1 r2 RdyR4 gets produced now
Notify those waiting for R4
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 2loop: add r4, r4, 4
ld r2, 10(r4)add r3, r3, r2
sub r1, r1, 1
bne r1, r0, loop
1 0 0 1
r1 r2 r3 r4
RAVop src1 src2 tgt
add r4/1 NA/1 r4 Cmtd
status
ld r4/1 NA/1 r2 ExecWait for r2
Result available @ cycle 6
add r3/1 r2/0 r3 Wait
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 3loop: add r4, r4, 4
ld r2, 10(r4)add r3, r3, r2
sub r1, r1, 1
bne r1, r0, loop
0 0 0 1
r1 r2 r3 r4
RAVop src1 src2 tgt
add r4/1 NA/1 r4 Cmtd
status
ld r4/1 NA/1 r2 ExecWait for r2
Result available @ cycle 6
add r3/1 r2/0 r3 Wait
sub r1/1 NA/1 r1 RdyNo dependences
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 4loop: add r4, r4, 4
ld r2, 10(r4)add r3, r3, r2
sub r1, r1, 1
bne r1, r0, loop
1 0 0 1
r1 r2 r3 r4
RAVop src1 src2 tgt
add r4/1 NA/1 r4 Cmtd
status
ld r4/1 NA/1 r2 ExecWait for r2
Result available @ cycle 6
add r3/1 r2/0 r3 Wait
sub r1/1 NA/1 r1 Execr1 produced nowNotify consumers
bne r1/1 r0/1 NA Rdyr1 will be available next cycle
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 5loop: add r4, r4, 4
ld r2, 10(r4)add r3, r3, r2
sub r1, r1, 1
bne r1, r0, loop
1 0 0 1
r1 r2 r3 r4
RAVop src1 src2 tgt
add r4/1 NA/1 r4 Cmtd
status
ld r4/1 NA/1 r2 ExecWait for r2
Result available @ cycle 6
add r3/1 r2/0 r3 Wait
sub r1/1 NA/1 r1 ComplCompleted
bne r1/1 r0/1 NA Execexecuting
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 6loop: add r4, r4, 4
ld r2, 10(r4)add r3, r3, r2
sub r1, r1, 1
bne r1, r0, loop
1 1 0 1
r1 r2 r3 r4
RAVop src1 src2 tgt
add r4/1 NA/1 r4 Cmtd
status
ld r4/1 NA/1 r2 ExecWait for r2
Result available @ cycle 6Notify consumers
add r3/1 r2/1 r3 Rdy
sub r1/1 NA/1 r1 ComplCompleted
bne r1/1 r0/1 NA Execexecuting
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 7loop: add r4, r4, 4
ld r2, 10(r4)add r3, r3, r2
sub r1, r1, 1
bne r1, r0, loop
1 1 1 1
r1 r2 r3 r4
RAVop src1 src2 tgt
add r4/1 NA/1 r4 Cmtd
status
ld r4/1 NA/1 r2 CmtdExecuting
Notify consumers
add r3/1 r2/1 r3 Exec
sub r1/1 NA/1 r1 Compl
Completedbne r1/1 r0/1 NA Compl
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Cycle 8loop: add r4, r4, 4
ld r2, 10(r4)add r3, r3, r2
sub r1, r1, 1
bne r1, r0, loop
1 1 1 1
r1 r2 r3 r4
RAVop src1 src2 tgt
add r4/1 NA/1 r4 Cmtd
status
ld r4/1 NA/1 r2 Cmtd
add r3/1 r2/1 r3 Cmtd
sub r1/1 NA/1 r1 Cmtd
bne r1/1 r0/1 NA Cmtd
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Notifying Consumers
• Identity of Producer• Uniquely Identify the Instruction• Easily retrievable @ decode by others
– Target Register• Recall we stall on WAR or WAW
– Functional Unit • If not pipelined
– Place in instruction window– PC? not. Why?
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Name Dependences and OOO
• WAW or WAR: We need to update register but others are still using it– add r1, r1, 10– sw r1, 20(r2)– add r1, r3, 30– sub r2, r1, 40
• There is only one r1– sw needs to see the value of 1st add– sub needs to wait for 2nd add and not 1st
• Solution: Stall decode when WAW or WAR
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Detecting WAW and WAR
• WAW? Look at Scoreboard– If bit is 0 then there is a pending write– Stall
• WAR? Need to know whether all preceding consumers have read the value– Keep a count per register– Increase at decode for all reads– Decrease on issue
• More elegant solution via register renaming– Soon
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Window vs. Scheduler• Window
– Distance between oldest and youngest instruction that can co-exist inside the CPU
– Larger window Potential for more ILP• Scheduler
– Number of instructions that are waiting to be issued
• Window– Instructions enter at Fetch– Exit at Commit
• Scheduler– Instructions enter at Decode– Leave at writeback
• Window >= Scheduler– Can be the same structure
• In window but not in scheduler completed
inst
ruct
ion
s
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding• Schedule based on RAW dependences• WAW and WAR cause stalls
– WAW at decode– WAR at writeback
• Optimization: Why is this OK?
• Implemented in the CDC 6600 in ‘64– 18 non-pipelined FUs
• 4 FP: 2 mul, 1 add, 1 div• 7 MEM: 5 load, 2 store• 7 INT: add, shift, logical etc.
• Centralized Control Scheme– Controls all Instruction Issue– Detects all hazards
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
MIPS/DLX w/ Scoreboarding
RegisterFile
FP mul
FP mul
FP divide
FP add
FP integer
scoreboard
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Overview• Ignore IF and MEM for simplicity• 4-stage execution
– Issue Check for structural hazardsCheck for WAW hazardsStall until all clear
– ReadOp Check for RAW hazardsWait until all operands readyRead Registers
– Execute Execute OperationsNotify scoreboard when complete
– Write Check for WAR hazardsStall Write until all clear
• A completing instruction cannot write dest if an earlier instruction has not read dest.
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Optimizations/Tricks
• WAW as in original OOO• WAR is optimized
– Second Producer is allowed to execute up to complete
– It is stalled there until preceding consumers complete
• No Commit– No precise interrupts
• Window is implemented in the scoreboard• One entry per Functional Unit
– Recall not pipelined– Instructions identified by FU id
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Organization• Three structures
– Instruction Status– Functional Unit Status– Register Result Status
• Instruction Status– Which stage the instruction is currently in
• Functional Unit Status: scheduling– Busy– OP Operation– Fi Dest. Reg.– Fj, Fk Source Regs– Qj, Qk FUs producing sources– Rj, Rk Ready bits for sources
• Register Result Status: dep. determination– Which FU will produce a register
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding explained
• Register status reg:– Which FU produces the register
• Use at decode– Source reg match is a RAW– Target reg macth is a WAW stall
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Functional Unit Status• Busy:
– resource allocation• OP:
– what to do once issued (e.g., add, sub)• Dest. Reg.:
– Where to write result– To find WAR
• Fj, Fk Source Regs– for WAR: can’t write if consumers pending for
previous value of register (if FU not the same)• Qj, Qk FUs producing sources
– To wait for appropriate producer• Rj, Rk Ready bits for sources
– To determine when ready: all ready
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding ExampleInstruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
Functional Unit Statusdest S1 S2 FU for j FU for k Fj? Fk?
Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd NoDivide No
Register result status
ClockF0 F2 F4 F6 F8 F10 F12 ... F30
FU
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Example: Cycle 0Instruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2 1LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
Functional Unit Statusdest S1 S2 FU for j FU for k Fj? Fk?
Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger yes LD F6Mult1 NoMult2 NoAdd NoDivide No
Register result status
ClockF0 F2 F4 F6 F8 F10 F12 ... F30
FU integer
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Example, contd.
• The rest you’ll find on the web site• Go through it• Source: Patterson
• Summary:– Execution proceeds in an order dictated by
dependences– RAW, WAR and WAW force ordering– Tricks may be possible
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Beyond Simple OoO
A B
CD
E
A: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F2, F7, F4
• E will wait for B, C and D. • WAR w/ C and D• WAW w/ B• Can we do better?
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
What if we had infinite registersA: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F2, F7, F4
A: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F9, F7, F4
No false dependences anymoreSince we do not reuse a name we can’t have WAW
and WAR
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Why we can’t have Infinite Registers
• False/Name dependences (WAR and WAW)– Artifact of having finite registers
• There is no such thing as infinite• There is no such thing as large enough
– Well there is (in a sec.)– Computers execute Billions of Instructions
per sec. Even a multi-billion register file would soon be exhausted
• Want to exploit parallelism across several instances of the same code– Loops, recursive functions (most frequent
part)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Yes, there is “large enough”
� At any given point there will be a finite number of instructions in the window
� if each instruction has a single register target
� if there are N instructions� How many registers do we need?
� N?� N + X?
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Register Renaming• Register Version
– Every Write creates a new version– Uses read the last version– Need to keep a version until all uses have read it.
• Register Renaming:– Architectural vs. Physical Registers
• more phys. than arch.– Maintain a map of arch. to phys. regs.– Use in-order decoding to properly identify
dependences.– Instructions wait only for input op. availability.– Only last version is written to reg. file.
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Register RenamingA: DIVF F3, F1, F0 r1, -, -B: SUBF F2, F1, F0 r2, -, -C: MULF F0, F2, F4 r3, r2, -D: SUBF F6, F2, F3 r4, r2, r1E: ADDF F2, F5, F4 r5, -, -F: ADDF F0, F0, F2 r6, r3, r5
Register Rename TableF0 F1 F2 F3 F5 F6 F7 ... F30
A R1B R2 R1C R3 R2 R1D R3 R2 R1 R4E R3 R5 R1 R4F R6 R5 R1 R4
Need more physical registers than architecturalIgnore control flow for the time being.
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Process
• Only need to remember last producer of each architectural register– Vector
• At decode– Find the most recent producers for all
source registers– After: declare self as most recent producer
of target register• Complication:
– May have to retract• Speculative Execution, e.g., interrupts
– Need to be able to restore the mapping state
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Support Structures
• Register Rename Table– f(aR) = pR– one entry per architectural Register
• Free Register List– Lists not used Physical Registers
• At Decode– grab a new register from the free list– Change mapping in rename table
• At Commit– Release Register? Not… Why?– Could release previous version
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
How Many Physical Registers?
• Correctness:– At least as many architectural plus?
• Performance:– As many as possible– Not correctness– Recall not all instructions produce register
results• stores and branches
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Dynamic Scheduling
A: DIVF F3, F1, F0 r1, -, -
B: SUBF F2, F1, F0 r2, -, -
C: MULF F0, F2, F4 r3, r2, -
D: SUBF F6, F2, F3 r4, r2, r1
E: ADDF F2, F5, F4 r5, -, -
F: ADDF F0, F0, F2 r6, r3, r5
Name Value- Values and Names flow together- Writeback specifies both value and name- A waiting instruction inspects all results- It is allowed to execute when all inputs are available
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Physical Registers
• Physical register file is just one option• What we need is separate storage
– Consumers could keep values in their reservation station
– Tomasulo’s next
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s Algorithm• IBM 360/91 - Fast 360 for scientific code
– Completed in 1967– Dynamic scheduling– Predates cache memories
• Pipelined FUs– Adder up to 3 instructions– Multiplier up to 2 instructions
• Tomasulo vs. Scoreboard– Distributed hazard detection and control– Results are bypassed to FUs– Common Data Bus (CDB) for results
• All results visible to all instead of via a register
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
DLX w/ Tomasulo• Tomasulo’s Algorithm
– Use “tags” to identify data values– Reservation stations distributed control– CDB broadcasts all results to all RSs
• Extend DLX as example– Assume multiple FUs than pipelined– Main difference is Register-Memory Insts.
• I.e., DLX does not have them• But that’s really a detail :-)
• Physical Registers?– Not really. What we need is different storage and
name for every version.– Here it’s the producing reservation station
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Dynamic DLX
adders Mults
Load buffers Store buffers
CDB
RSRS
Operation Stack Registers
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s Algorithm• 3 major steps
– Dispatch• Get instruction from fetch queue• ALU op: check for available RS• Load: Check for available load buffer• If available: dispatch and copy read regs to RS or
load buffer• if not: stall - structural hazard
– Issue• If all ops are available: issue• If not monitor CDB for operands
– Complete• If CDB available, broadcast result• else stall
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s Algorithm contd.• Reservation stations
– Handle distributed hazard detection and instruction control
• Everything receiving data get its tag– 4-bit tag specifies reservation station or load buffer– Also which FU will produce result
• Register specifier is used to assign tags– Then they are discarded– Input register specifiers are ONLY used in dispatch.
(Rename table)• Common Data Bus:
– value + “tag” = where this comes from– vs. typical bus: value + “tag” = where this goes to
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s Algorithm Contd.
• Reservation Stations– Op Opcode– Qj, Qk Tag Fields (source ops)– Vj, VkOperand values (source ops)– Busy Currently in use
• Register file and Store Buffer– Qi Tag field– Busy Currently in use– Vi Value
• Load Buffers– Busy Currently in Use
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Arch.Reg. Name
Tomasulo’s: Understanding Speculative vs. Architectural State
• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30
Value of r1I have it
Value of r2I have it
Value of r3I have it
Value of r4I have it
Register file
Whe
re is
the
regi
ster
?
Can be: “I have it”, “reservation station id”
Value of Src1NA NA Value of Src2NA
tgt src2
Value of Src1NA NA Value of Src2NA
Reservation Stations
Reg Arch. name
src1
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming 1st Instruction
• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30
-----RS0
Value of r2I have it
Value of r3I have it
Value of r4I have it
Register file
Value of R2r1 I have it 10I have it
tgt src2
Value of Src1NA NA Value of Src2NA
Reservation Stations
src1
Value of Src1NA NA Value of Src2NA
RS0
• Read sources (r2)• Rename r1 to RS0
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming 2nd Instruction
-----RS0
Value of r2I have it
Value of r3I have it
----RS1
Register file
Value of R2r1 I have it 10I have it
tgt src2
----r4 RS0 20I have it
Reservation Stations
src1
Value of Src1NA NA Value of Src2NA
RS1
• Sources: r1 in RS0 NYA• Rename r4 to RS1
• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming 3rd Instruction
-----RS2
Value of r2I have it
Value of r3I have it
----RS1
Register file
Value of R2r1 I have it 10I have it
tgt src2
----r4 RS0 20I have it
Reservation Stations
src1
Value of R3r1 I have it 30I have itRS2
• Sources: r3 Avail.• Rename r1 to RS2
• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Example: cycle 0Instruction status Execution Write
Instruction j k Issue complete Result
LD F6 34+ R2LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No0 Add2 No0 Add3 No0 Mult1 No0 Mult2 No
Register result status
F0 F2 F4 F6 F8 F10 ...
FU
Busy AddressLoad1 NoLoad2 NoLoad3 No
load buffers
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Example: cycle 1Instruction status Execution Write
Instruction j k Issue complete Result
LD F6 34+ R2 1LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 A1 No0 A2 No0 A3 No0 M1 No0 M2 No
Register result status
F0 F2 F4 F6 F8 F10 ...
FU L1
Busy AddressL1 yesL2 NoL3 No
load buffers
34+R2
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Example: cycle 3Instruction status Execution Write
Instruction j k Issue complete Result
LD F6 34+ R2 1 3LD F2 45+ R3 2MULTDF0 F2 F4 3SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 A1 No0 A2 No0 A3 No0 M1 Yes Mul R(F4) L20 M2 No
Register result status
F0 F2 F4 F6 F8 F10 ...
FU M1 L2 L1
Busy AddressL1 yesL2 NoL3 No
load buffers
34+R245+R3
- Mul is issued vs. scoreboard- What’s waiting for L1?
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Example…
• Check the web site…• Too much for in-class• Summary:
– Execution proceeds in any order that does not violate RAW dependences
– WAR and WAW are removed
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s vs. ScoreboardInstruction status Execution WriteInstruction j k Issue complete ResultLD F6 34+ R2 1 3 4LD F2 45+ R3 2 4 5MULTDF0 F2 F4 3 15 16SUBDF8 F6 F2 4 7 8DIVDF10 F0 F6 5 56 57ADDDF6 F8 F2 6 10 11
- In-order issue- Out-of-order execution- Out-of-order completion
Instruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61 62ADDD F6 F8 F2 13 14 16 22
Scoreboard:
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s• Out-of-order loads and stores?
– What about WAW, RAW and WAR?– Compare all load addresses against the addresses of
all preceding store buffers– Stall if they match
• CDB is a bottleneck– One write per cycle– Could duplicate– But, come at a cost– Datapath + duplicated tags and control
• Complex Implementation– Scalability?– All results to all sources– What if we want 128 instrs?
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s• Advantages
– Distribution of hazard detection– Elimination of WAR and WAW stalls
• Common Data Bus– Broadcasts result to multiple instrs (+)– Bottleneck
• Register Renaming– Removes WAR and WAW hazards– More interesting when same code appears twice
• Think of loops• More on this later
– BUT: Associative lookups– RECALL: direct map is faster
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
In SummaryFeature Scoreboarding Tomasulo's
CDC6600 IBM 360
Structural Stall in Issue for Stall in DispatchFU for RS
Stall in RS for FURAW Via Registers From CDB
WAR Stall in WB Copy Value to RS
WAW Stall in Issue Register Renaming
Logic Centralized Distributed
Bottlenecks No Register One CDBBypassStall in issue block