Date post: | 02-Jun-2018 |
Category: |
Documents |
Upload: | arezo-shafiee |
View: | 217 times |
Download: | 0 times |
of 24
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
1/24
Dynamic ILP
Speculation
1
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
2/24
Outline
Speculation
Re-order buffers
Limits to ILP
2
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
3/24
Speculation
Branch Prediction Out of OrderExecution
3
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
4/24
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
5/24
Branch Prediction and
Speculative Execution
Speculation is to runinstructions onprediction predictions
could be wrong.
Branch prediction:cannot be avoided,could be very accurate
Misprediction is lessfrequent event but canwe ignore?
Example:
for (i=0; i
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
6/24
Exception Behavior
Preserving exception behavior -- exceptions must beraised exactly as in sequential execution
Same sequence as sequential
No extra exceptions Example:
DADDU R2,R3,R4BEQZ R2,L1LW R1,0(R2)
L1:
Problem with moving LW before BEQZ? Again, a dynamic execution must look like a sequential
execution, any time when it is stopped
6
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
7/24
Exceptions in Order
Solutions:
Early detection of FP exceptions
The use of software mechanisms to restore a precise
exception state before resuming execution,
Delaying instruction completion until we know an
exception is impossible
7
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
8/24
Precise Interrupts An interrupt is precise if the saved process
state corresponds with a sequential model ofprogram execution where one instructioncompletes before the next begins.
Tomasulo had:
In-order issue, out-of-order execution, andout-of-order completion
Need to fix the out-of-order completionaspect so that we can find precise breakpointin instruction stream.
8
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
9/24
Short Seminar PreciseExceptions
1. 01277582(Implementation of precise exceptionin a 5-stage pipeline embedded processor -
CNF03).pdf
2. 01354393(A 0.18-spl mu-m CMOS
implementation of an area efficient preciseexception handling unit for processing-in-memory systems - CNF04).pdf
3. 00004607(Implementing precise interrupts inpipelined processors - JNL88).pdf
9
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
10/24
Branch Prediction Vs. Precise
Interrupt
Mis-prediction is exception on the branchinst
Execution branches out on exceptions Every instruction is predicted not to take the branch
to interrupt handler
Same technique for handling both issue:
in-order completion or commit: changeregister/memory only in program order
(sequential)
How does it ensure the correctness?
10
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
11/24
HW Support for More ILP
Speculation: allow an instruction to issue that isdependent on branch predicted to be taken without anyconsequences (including exceptions) if branch is not
actually taken (HW undo
);
Combine branch prediction with dynamic schedulingto execute before branches resolved
Separate speculative bypassing of results from realbypassing of results
When instruction no longer speculative,
write boosted results (instruction commit)or discard boosted results
execute out-of-order but commit in-orderto prevent irrevocable action (update state or exception)until instruction commits
11
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
12/24
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
13/24
Reorder Buffer Implementation
13
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
14/24
Result Shift Register Result Shift Register" is used to control
the result bus
N is the length of the longest functionalunit pipeline
An instruction that takes i clockperiods reserves stage i
If the stage already contains validcontrol information, then issue is helduntil the next clock period
Issuing instruction places controlinformation in the result shift register.
the functional unit that will be supplying theresult
the destination register
This control information is also marked"valid"
Each clock period, the controlinformation is shifted down one stagetoward stage one.
When it reaches stage one, it is usedduring the next clock period to controlthe result bus
14
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
15/24
The Hardware: Reorder Buffer
If inst write results in program order,reg/memory always get the correctvalues
Reorder buffer (ROB) reorder out-of-order inst to program order at the time ofwriting reg/memory (commit)
If some inst goes wrong, handle it at thetime of commit just flush instafterwards
Inst cannot write reg/memoryimmediately after execution, so ROB alsobuffer the results
No such a place in Tomasulo original
ReorderBufferDecode
FU1 FU2
RS RS
Fetch Unit
Rename
L-bufS-buf
DM
Regfile
IM
15
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
16/24
Four Steps of Speculative
Tomasulo Algorithm1. Issueget instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr & sendoperands & reorder buffer no. for destination (this stage sometimescalled dispatch)
2. Executionoperate on operands (EX)When both operands ready then execute; if not ready, watch CDB forresult; when both in reservation station, execute; checks RAW(sometimes called issue)
3. Write resultfinish execution (WB)Write on Common Data Bus to all awaiting FUs& reorder buffer; mark reservation station available.
4. Commit
update register with reorder resultWhen instr. at head of reorder buffer & result present, update registerwith result (or store to memory) and remove instr from reorder buffer.Mispredicted branch flushes reorder buffer (sometimes calledgraduation)
16
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
17/24
Reorder Buffer Details Holds Instruction type: branch, store, ALU
register operation
Holds branch valid and exception bits
Flush pipeline when any bit is set
Holds dest, result and PC
Write results to dest at the time of commit
Which PC to hold?
A ready bit indicates if theinstruction has completedexecution and the value is ready
Supplies operands between executioncomplete and commit
ROB replaces the Store Buffer also
Reorder Buffer
Destr
eg
Result
Except
ions?
ProgramCounter
Branch
orL/W?
Ready?
17
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
18/24
Speculative Execution
Recovery
Flush the pipeline on mis-
prediction
MIPS 5-stage pipeline
used flushing on taken
branches
Where is the flush signal
from?
When to flush?
Reorder
BufferDecode
FU1 FU2
RS RS
Fetch Unit
Rename
L-bufS-buf
DM
Regfile
IM
18
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
19/24
Changes to Other Components
Use ROB index as tag Why not RS index any more?
Why is ROB index a valid choice?
Renaming table maps architecture registersto ROB index if the register is renamed
Reservation stations now use ROB index fortracking dependence and for wakeup
Again tag (now ROB index) and data are
broadcast on CDB at writeback Inst may receive values from reg/mem, data
broadcasting, or ROB
19
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
20/24
Complexity of ROB
Assume dual-issue superscalar
Load/Store machine with three-operand instructions
64 registers
16-entry circular buffer Hardware support needed for ROB
two write ports
Four read ports (two source operands of two instructions)
Four 6-bit comparators for associative lookup
Limited capacity of ROB is a structural hazard
Repeated writes to same register actually happen
This is not the case in classical Tomasulo
20
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
21/24
Code ExampleLoop: LD R2, 0(R1)
DADDIU R2, R2, #1
SD R2, 0(R1)
DADDIU R1, R1, #4BNE R2, R3, Loop
How would this code be executed?
Inst Issue Exec Memoryread
Write
results
Commit
LD 1 2 3 4 5
21
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
22/24
Summary Reservations stations: implicit register renaming to
larger set of registers + buffering source operands Prevents registers as bottleneck
Avoids WAR, WAW hazards of Scoreboard
Not limited to basic blocks when compared to staticscheduling (integer units gets ahead, beyondbranches)
Today, helps cache misses as well Dont stall for L1 Data cache miss
Can support memory-level parallelism
Lasting Contributions Dynamic scheduling
Register renaming
Load/store disambiguation
360/91 descendants are Pentium III; PowerPC 604;MIPS R10000; HP-PA 8000; Alpha 21264
22
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
23/24
8/10/2019 08-speculation_2 [Compatibility Mode].pdf
24/24
Advantages of HW (Tomasulo)
vs. SW (VLIW) Speculation
HW determines address conflicts
HW better branch prediction
HW maintains precise exception model
Works across multiple implementations
SW speculation is much easier for HW design
24