1ILP: speculation and multiple-instruction issue
ESE 545 Computer Architecture
Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW
Computer Architecture
2
Review from Last Lecture Leverage Implicit Parallelism for Performance:
Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP Dynamic HW exploiting ILP
Works when can’t know dependence at compile time
Can hide L1 cache misses Code for one machine runs well on another
ILP: speculation and multiple-instruction issue
3
Review from Last lecture Reservations stations: renaming to larger set of registers + buffering
source operands Prevents registers as bottleneck Avoids WAR, WAW hazards Allows loop unrolling in HW
Not limited to basic blocks (integer units gets ahead, beyond branches)
Helps cache misses as well Lasting Contributions
Dynamic scheduling Register renaming Load/store disambiguation
360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,
…
ILP: speculation and multiple-instruction issue
4
Speculation to Greater ILP Greater ILP: Overcome control dependence
by hardware speculating on outcome of branches and executing program as if guesses were correct Speculation fetch, issue, and execute
instructions as if branch predictions were always correct
Dynamic scheduling only fetches and issues instructions
Essentially a data flow execution model: Operations execute as soon as their operands are available
ILP: speculation and multiple-instruction issue
5
Speculation to Greater ILP 3 components of HW-based speculation:1. Dynamic branch prediction to choose which
instructions to execute 2. Speculation to allow execution of instructions
before control dependences are resolved + ability to undo effects of incorrectly
speculated sequence 3. Dynamic scheduling to deal with scheduling of
different combinations of basic blocks
ILP: speculation and multiple-instruction issue
6
Adding Speculation to Tomasulo Must separate execution from allowing instruction
to finish or “commit” This additional step called instruction commit When an instruction is no longer speculative, allow
it to update the register file or memory Requires additional set of buffers to hold results of
instructions that have finished execution but have not committed
This reorder buffer (ROB) is also used to pass results among instructions that may be speculated
ILP: speculation and multiple-instruction issue
7
Phases of Instruction ExecutionFetch: Instruction bits retrieved from instruction cache.I‐cache
Fetch Buffer
Issue Buffer
Functional Units
ArchitecturalState
Execute: Instructions and operands issued to functional units. When execution completes, all results and exception flags are available.
Decode: Instructions dispatched to appropriate issue buffer
Reorder BufferCommit: Instruction irrevocably updates architectural state (aka “graduation”), or takes precise trap/interrupt.
PC
Commit
Decode/Rename
ILP: speculation and multiple-instruction issue
8
Reorder Buffer (ROB) In Tomasulo’s algorithm, once an instruction writes its
result, any subsequently issued instructions will find result in the register file
With speculation, the register file is not updated until the instruction commits (we know definitively that the instruction should
execute) Thus, the ROB supplies operands in interval between
completion of instruction execution and instruction commit ROB is a source of operands for instructions, just as
reservation stations (RS) provide operands in Tomasulo’s algorithm
ROB extends architectured registers like RSILP: speculation and multiple-instruction issue
9
Separating Completion from Commit Re-order buffer holds register results from
completion until commit Entries allocated in program order during decode Buffers completed values and exception state until in-order
commit point Completed values can be used by dependents before committed
(bypassing) Each entry holds program counter, instruction type, destination
register specifier and value if any, and exception status (info often compressed to save hardware)
Memory reordering needs special data structures Speculative store address and data buffers Speculative load address and data buffers
ILP: speculation and multiple-instruction issue
10
Reorder Buffer Entry Each entry in the ROB contains four fields: 1. Instruction type
• a branch (has no destination result), a store (has a memory address destination), or a register operation (ALU operation or load, which has register destinations)
2. Destination• Register number (for loads and ALU operations) or
memory address (for stores) where the instruction result should be written
3. Value• Value of instruction result until the instruction commits
4. Ready• Indicates that instruction has completed execution, and the value
is ready
ILP: speculation and multiple-instruction issue
11
Reorder Buffer Operation Holds instructions in FIFO order, exactly as issued When instructions complete, results placed into ROB
Supplies operands to other instruction between execution complete & commit more registers like RS
Tag results with ROB buffer number instead of reservation station Instructions commit values at head of ROB placed in registers As a result, easy to undo
speculated instructions on mispredicted branches or on exceptions Reorder
BufferFPOp
Queue
FP Adder FP AdderRes Stations Res Stations
FP Regs
Commit path
ILP: speculation and multiple-instruction issue
12
4 Steps of Speculative Tomasulo Algorithm1. Issue—get instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”)
2. Execution—operate on operands (EX)When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)
3. Write result—finish execution (WB)Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.
4. Commit—update register with reorder resultWhen instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instrfrom reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)
ILP: speculation and multiple-instruction issue
13
Tomasulo With Reorder Buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1F0 LD F0,10(R2) N
Done?
Dest Dest
Oldest
Newest
from Memory
1 10+R2Dest
Reorder Buffer
Registers
ILP: speculation and multiple-instruction issue
14
2 ADDD R(F4),ROB12 ADDD R(F4),ROB1
Tomasulo With Reorder Buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F10F10F0
ADDD F10,F4,F0LD F0,10(R2)
NN
Done?
Dest Dest
Oldest
Newest
from Memory
1 10+R2Dest
Reorder Buffer
Registers
ILP: speculation and multiple-instruction issue
15
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1
Tomasulo With Reorder Buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F2F10F10F0
DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)
NNN
Done?
Dest Dest
Oldest
Newest
from Memory
1 10+R2Dest
Reorder Buffer
Registers
ILP: speculation and multiple-instruction issue
16
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)
Tomasulo With Reorder Buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0 ADDD F0,F4,F6 NF4 LD F4,0(R3) N-- BNE F2,<…> NF2F10F10F0
DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)
NNN
Done?
Dest Dest
Oldest
Newest
from Memory
1 10+R2Dest
Reorder Buffer
Registers
5 0+R3
ILP: speculation and multiple-instruction issue
17
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)
Tomasulo With Reorder Buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
--F0
ROB5 ST 0(R3),F4ADDD F0,F4,F6
NN
F4 LD F4,0(R3) N-- BNE F2,<…> NF2F10F10F0
DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)
NNN
Done?
Dest Dest
Oldest
Newest
from Memory
Dest
Reorder Buffer
Registers
1 10+R25 0+R3
ILP: speculation and multiple-instruction issue
18
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)
Tomasulo With Reorder Buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
--F0
M[10] ST 0(R3),F4ADDD F0,F4,F6
YN
F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> NF2F10F10F0
DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)
NNN
Done?
Dest Dest
Oldest
Newest
from Memory
1 10+R2Dest
Reorder Buffer
Registers
2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD M[10],R(F6)6 ADDD M[10],R(F6)
ILP: speculation and multiple-instruction issue
19
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1
Tomasulo With Reorder Buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
--F0
M[10]<val2>
ST 0(R3),F4ADDD F0,F4,F6
YEx
F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> NF2F10F10F0
DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)
NNN
Done?
Dest Dest
Oldest
Newest
from Memory
1 10+R2Dest
Reorder Buffer
Registers
ILP: speculation and multiple-instruction issue
20
--F0
M[10]<val2>
ST 0(R3),F4ADDD F0,F4,F6
YEx
F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> N
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1
Tomasulo With Reorder Buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F2F10F10F0
DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)
NNN
Done?
Dest Dest
Oldest
Newest
from Memory
1 10+R2Dest
Reorder Buffer
Registers
What about memoryhazards???
ILP: speculation and multiple-instruction issue
21
Avoiding Memory Hazards WAW and WAR hazards through memory are eliminated
with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending
RAW hazards through memory are maintained by two restrictions: 1. not allowing a load to initiate the second step of its
execution if any active ROB entry occupied by a store has a Destination field that matches the value of the A field of the load, and
2. maintaining the program order for the computation of an effective address of a load with respect to all earlier stores.
these restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memory access until the store has written the data
ILP: speculation and multiple-instruction issue
22
Exceptions and Interrupts Interrupts and Exceptions either interrupt the current instruction or
happen between instructions Possibly large quantities of state must be saved before interrupting
Machines with precise exceptions provide one single point in the program to restart execution All instructions before that point have completed No instructions after or including that point have completed
IBM 360/91 invented “imprecise interrupts” Computer stopped at this PC; its likely close to this address Not so popular with programmers Also, what about Virtual Memory? (Not in IBM 360)
Technique for both precise interrupts/exceptions and speculation: in-order issue, and in-order commit If we speculate and are wrong, need to back up and restart execution to
point at which we predicted incorrectly This is exactly same as need to do with precise exceptions
Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROB If a speculated instruction raises an exception, the exception is recorded in
the ROB This is why reorder buffers are used in all new processors!
ILP: speculation and multiple-instruction issue
23
“Data-in-ROB” Design(HP PA8000, Pentium Pro, Core2Duo, Nehalem)
Managed as circular buffer in program order, new instructions dispatched to free slots, oldest instruction committed/reclaimed when done (“p” bit set on result)
Tag is given by index in ROB (Free pointer value) In dispatch, non-busy source operands read from architectural register
file and copied to Src1 and Src2 with presence bit “p” set. Busy operands copy tag of producer and clear “p” bit.
Set valid bit “v” on dispatch, set issued bit “i” on issue On completion, search source tags, set “p” bit and copy data into src
on tag match. Write result and exception flags to ROB. On commit, check exception status, and copy result into architectural
register file if no trap. On trap, flush machine and ROB, set free=oldest, jump to handler
Tagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv Opcode
Oldest
Free
24
Managing Rename for Data-in-ROB
If “p” bit set, then use value in architectural register file Else, tag field indicates instruction that will/has produced value For dispatch, read source operands <p,tag,value> from arch.
regfile, then also read <p,result> from producing instruction in ROB at tag index, bypassing as needed. Copy operands to ROB.
Write destination arch. register entry with <0,Free,_>, to assign tag to ROB index of this instruction
On commit, update arch. regfile with <1, _, Result> On trap, reset table (All p=1)
Tagp ValueTagp ValueTagp Value
Tagp Value
One entry per arch. register
Rename table associated with architectural registers, managed in decode/dispatch
ILP: speculation and multiple-instruction issue
25
ROB
Data Movement in Data-in-ROB DesignArchitectural Register
FileRead operands during decode
Read operands at issue
Functional Units
Read results for commit
Bypass newer values at dispatch
Result Data
Write results at completion
Write results at commit
Source Operands
Write sources in dispatch
ILP: speculation and multiple-instruction issue
26
Unified Physical Register File(MIPS R10K, Alpha 21264, Intel Pentium 4 & Sandy/Ivy Bridge) Rename all architectural registers into a single physical
register file during decode, no register values read Functional units read and write from single unified
register file holding committed and temporary registers in execute
Commit only updates mapping of architectural register to physical register, no data movement
Unified Physical Register File
Read operands at issue
Functional Units
Write results at completion
Committed Register Mapping
Decode Stage Register Mapping
ILP: speculation and multiple-instruction issue
27
Lifetime of Physical Registers
ld x1, (x3)addi x3, x1, #4sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x11)
ld P1, (Px)addi P2, P1, #4sub P3, Py, Pzadd P4, P2, P3ld P5, (P1)add P6, P5, P4sd P6, (P1)ld P7, (Pw)
Rename
When can we reuse a physical register?When next writer of same architectural register commits
• Physical regfile holds committed and speculative values• Physical registers decoupled from ROB entries (no data in ROB)
ILP: speculation and multiple-instruction issue
28
Physical Register Management
op p1 PR1 p2 PR2exuse Rd PRdLPRd
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
x5P5x6P6x7
x0P8x1
x2P7x3
x4
ROB
Rename Table
Physical Regs Free List
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
ppp
P0P1P3P2P4
(LPRd requires third read port on Rename Table for each instruction)
<x1>P8 p
ILP: speculation and multiple-instruction issue
29
Physical Register Management
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8
ILP: speculation and multiple-instruction issue
30
Physical Register Management
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1
ILP: speculation and multiple-instruction issue
31
Physical Register Management
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3
ILP: speculation and multiple-instruction issue
32
Physical Register Management
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3P1
P2
x add P1 P3 x3 P2
ILP: speculation and multiple-instruction issue
33
Physical Register Management
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
ILP: speculation and multiple-instruction issue
34
Physical Register Management
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
x ld p P7 x1 P0x addi P0 x3 P1x sub p P6 p P5 x6 P3
x ld p P7 x1 P0
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
P5
P3
P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
Execute & Commitp
p
p<x1>
P8
x
ILP: speculation and multiple-instruction issue
35
Physical Register Management
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
x sub p P6 p P5 x6 P3x addi P0 x3 P1x addi P0 x3 P1
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
P8
x x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
P5
P3
P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
Execute & Commitp
p
p<x1>
P8
x
p
p<x3>
P7
ILP: speculation and multiple-instruction issue
36
Review: Improving the 5-stage Pipeline Performance
ILP: speculation and multiple-instruction issue
• Higher clock frequency (lower CCT): deeper pipelines – Overlap more instructions
• Higher CPIideal: wider pipelines – Insert multiple instruction in parallel in the pipeline • Lower CPIstall:
– Diversified pipelines for different functional units – Out-of-order execution
• Balance conflicting goals – Deeper & wider pipelines -> more control hazards – Branch prediction
• It is all works because of instruction-level parallelism (ILP)
37
Instruction-level Parallelism (ILP)
ILP: speculation and multiple-instruction issue
38
Deeper Pipelines
ILP: speculation and multiple-instruction issue
39
Deeper Pipelines Summary Advantages: higher clock frequency
– The workhorse behind multi-GHz processors – Opteron: 11, UltraSparc: 14, Power5: 17, Pentium4: 22/34
Cost – Complexity: more forwarding & stall cases
Disadvantages – More overlapping -> more dependencies -> more stalls
• CPIstall grows due to data and control hazards – Clock overhead becomes increasingly important – Power consumption
ILP: speculation and multiple-instruction issue
40
Limits to Pipeline Depth Each pipeline stage introduces some overhead (O)
- Delay of pipeline registers – Inequalities in work per stage
• Cannot break up work into stages at arbitrary points – Clock skew
• Clocks to different registers may not be perfectly aligned
If original CCT was T, with N stages CCT is T/N+O – If N -> ∞, speedup = T / (T/N+O) -> T/O
• Assuming that IC and CPI stay constant – Eventually overhead dominates and deeper pipelines
have diminishing returns
ILP: speculation and multiple-instruction issue
41
Wide or Superscalar Pipelines
ILP: speculation and multiple-instruction issue
42
Wide Pipelines Summary Advantages: lower CPIideal (1/N)
– Opteron: 3, UltraSparc: 4, Power5: 4, Pentium4: 3 Cost
– Need wider path to instruction cache– Need more ALUs, more register file ports, …– Complexity: more forwarding & stall cases to check
Disadvantages– Parallel execution -> more dependencies ->more stalls
• CPIstall grows due to data and control hazards
ILP: speculation and multiple-instruction issue
43
Diversified Pipelines w/o Reorder Buffers
ILP: speculation and multiple-instruction issue
44
A Modern Superscalar Out-of-Order Processor
ILP: speculation and multiple-instruction issue
45
Branch Penalty in Superscalar OOO Processors
ILP: speculation and multiple-instruction issue
46
The Challenges for Superscalar Processors Clock frequency: getting close to pipelining limits?
– Clocking overheads, CPI degradation Branch prediction & memory latency
– Limit the practical benefits of out-of-order execution Power consumption
– Gets worse with higher clock & more OOO logic Design complexity
– Grows exponentially with issue width Limited ILP?
• Alternative: single-chip (multicore) multiprocessors
ILP: speculation and multiple-instruction issue
47
Multiple-instruction Issue:Getting CPI below 1 CPI ≥ 1 if issue only 1 instruction every clock cycle Multiple-issue processors come in 3 flavors:
1. statically-scheduled superscalar processors,2. dynamically-scheduled superscalar processors, and 3. VLIW (very long instruction word) processors
2 types of superscalar processors issue varying numbers of instructions per clock use in-order execution if they are statically
scheduled (e.g., Cell SPU), or out-of-order execution if they are dynamically
scheduled (Intel Pentium) VLIW processors, in contrast, issue a fixed number of
instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (Intel/HP Itanium)
ILP: speculation and multiple-instruction issue
48
VLIW: Very Large Instruction Word Each “instruction” has explicit coding for multiple operations
In IA-64, grouping called a “packet” In Transmeta, grouping called a “molecule” (with “atoms” as ops)
Tradeoff instruction space for simple decoding The long instruction word has room for many operations By definition, all the operations the compiler puts in the long
instruction word are independent => execute in parallel E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
Need compiling technique that schedules across several branches
ILP: speculation and multiple-instruction issue
49
Recall: Unrolled Loop that Minimizes Stalls for Scalar1 Loop: L.D F0,0(R1)2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD.D F4,F0,F26 ADD.D F8,F6,F27 ADD.D F12,F10,F28 ADD.D F16,F14,F29 S.D 0(R1),F410 S.D -8(R1),F811 S.D -16(R1),F1212 DSUBUI R1,R1,#3213 BNEZ R1,LOOP14 S.D 8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration
L.D to ADD.D: 1 CycleADD.D to S.D: 2 Cycles
ILP: speculation and multiple-instruction issue
50
Loop Unrolling in VLIWMemory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchL.D F0,0(R1) L.D F6,-8(R1) 1L.D F10,-16(R1) L.D F14,-24(R1) 2L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4
ADD.D F20,F18,F2 ADD.D F24,F22,F2 5S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6S.D -16(R1),F12 S.D -24(R1),F16 7S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8S.D -0(R1),F28 BNEZ R1,LOOP 9
Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)Average: 2.5 ops per clock, 50% efficiencyNote: Need more registers in VLIW (15 vs. 6 in SS)
ILP: speculation and multiple-instruction issue
51
Problems with 1st Generation VLIW Increase in code size
generating enough operations in a straight-line code fragment requires ambitiously unrolling loops
whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding
Operated in lock-step; no hazard detection HW a stall in any functional unit pipeline caused entire processor to
stall, since all functional units must be kept synchronized Compiler might prediction function units, but caches hard to predict
Binary code compatibility Pure VLIW => different numbers of functional units and unit
latencies require different versions of the code
ILP: speculation and multiple-instruction issue
52
Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” IA-64: instruction set architecture 128 64-bit integer regs + 128 82-bit floating point regs
Not separate register files per functional unit as in old VLIW Hardware checks dependencies
(interlocks => binary compatibility over time) Predicated execution (select 1 out of 64 1-bit flags)
=> fewer mispredictions? Itanium™ was first implementation (2001)
Highly parallel and deeply pipelined hardware at 800Mhz 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process
Itanium 2™ is name of 2nd implementation (2005) 6-wide, 8-stage pipeline at 1666Mhz on 0.13 µ process Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3
ILP: speculation and multiple-instruction issue
53
Branch Hints
Memory Hints
InstructionCache& BranchPredictors
FetchMemory
Subsystem
Three levels of cache:L1, L2, L3
Register Stack & Rotation
Explicit Parallelism
128 GR &128 FR,RegisterRemap&Stack Engine
Register Handling
Fast, Simple 6-Issue
Issue Control
Micro-architecture Features in hardware:
Itanium™ EPIC Design (Copyright: Intel at Hotchips ’00)
Architecture Features programmed by compiler:
Predication Data & ControlSpeculation
Bypasses & Dependencies
Parallel Resources
4 Integer + 4 MMX Units
2 FMACs (4 for SSE)
2 LD/ST units
32 entry ALAT
Speculation Deferral ManagementILP: speculation and multiple-instruction issue
54
Increasing Instruction Fetch Bandwidth
• Predicts next instruct address, sends it out before decoding instruction
• PC of branch sent to BTB• When match is found, Predicted
PC is returned• If branch predicted taken,
instruction fetch continues at Predicted PC
ILP: speculation and multiple-instruction issue
55
IF BW: Return Address Predictor Small buffer of return
addresses acts as a stack
Caches most recent return addresses
Call Push a return address on stack
Return Pop an address off stack & predict as new PC
0%
10%
20%
30%
40%
50%
60%
70%
0 1 2 4 8 16Return address buffer entries
Mis
pre
dic
tio
n f
req
ue
ncy
gom88ksimcc1compressxlispijpegperlvortex
ILP: speculation and multiple-instruction issue
56
More Instruction Fetch Bandwidth Integrated branch prediction branch predictor is part of
instruction fetch unit and is constantly predicting branches Instruction prefetch Instruction fetch units prefetch to
deliver multiple instruct. per clock, integrating it with branch prediction
Instruction memory access and buffering Fetching multiple instructions per cycle:
May require accessing multiple cache blocks (prefetchto hide cost of crossing cache blocks)
Provides buffering, acting as on-demand unit to provide instructions to issue stage as needed and in quantity needed
ILP: speculation and multiple-instruction issue
57
Speculation: Register Renaming vs. ROB Alternative to ROB is a larger physical set of
registers combined with register renaming Extended registers replace function of both ROB and reservation
stations
Instruction issue maps names of architectural registers to physical register numbers in extended register set On issue, allocates a new unused register for the destination
(which avoids WAW and WAR hazards) Speculation recovery easy because a physical register holding an
instruction destination does not become the architectural register until the instruction commits
Most Out-of-Order processors today use extended registers with renaming
ILP: speculation and multiple-instruction issue
58
Register Renaming with Extended Set of Physical Registers
ILP: speculation and multiple-instruction issue
59
First Pentium-4: Willamette
Die Photo Heat Sink
ILP: speculation and multiple-instruction issue
60
The Intel Pentium 4 Microarchitecture
ILP: speculation and multiple-instruction issue
61
Pentium-4 Pipeline
20 Pipeline Stages Drive Wire Delay Trace-Cache: caching paths through the code for quick decoding. Renaming: similar to Tomasulo architecture Branch prediction and DATA prefetching!
Pentium (Original 586)Pentium-II (and III) (Original 686)
ILP: speculation and multiple-instruction issue
62
32nm Intel Quad-Core Sandy Bridge (2011)
ILP: speculation and multiple-instruction issue
63
Value Prediction Attempts to predict value produced by instruction
E.g., Loads a value that changes infrequently Value prediction is useful only if it significantly
increases ILP Focus of research has been on loads; so-so
results, no processor uses value prediction Related topic is address aliasing prediction
RAW for load and store or WAW for 2 stores Address alias prediction is both more stable and
simpler since need not actually predict the address values, only whether such values conflict Has been used by a few processors
ILP: speculation and multiple-instruction issue
64
Perspective Interest in multiple-issue because wanted to improve
performance without affecting uniprocessor programming model
Taking advantage of ILP is conceptually simple, but design problems are amazingly complex in practice
Conservative in ideas, just faster clock and bigger Recent processors (Intel Pentium, IBM Power 5, AMD
Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple-issue processors announced in 1995 Clocks 10 to 20X faster, caches 5 to 20X bigger, 2 to
4X as many renaming registers, and 2X as many load-store units performance 8 to 16X
Peak v. delivered performance gap increasing
ILP: speculation and multiple-instruction issue
65
Limits to ILP Conflicting studies of amount
Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication
How much ILP is available using existing mechanisms with increasing HW budgets?
Do we need to invent new HW/SW mechanisms to keep on processor performance curve? Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock Motorola AltaVec: 128 bit ints and FPs Supersparc Multimedia ops, etc.
ILP: speculation and multiple-instruction issue
66
Limits to ILPInitial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start:
1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided2. Branch prediction – perfect; no mispredictions3. Jump prediction – all jumps perfectly predicted (returns, case statements)2 & 3 no control dependencies; perfect speculation & an unbounded buffer of instructions available4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW
Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle;
ILP: speculation and multiple-instruction issue
67
Model Power 5Instructions Issued per clock
Infinite 4
Instruction Window Size
Infinite 200
Renaming Registers
Infinite 48 integer + 40 Fl. Pt.
Branch Prediction Perfect 2% to 6% misprediction(Tournament Branch Predictor)
Cache Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3
Memory Alias Analysis
Perfect ??
Limits to ILP HW Model Comparison
ILP: speculation and multiple-instruction issue
68
Upper Limit to ILP: Ideal Machine
Programs
0
20
40
60
80
100
120
140
160
gcc espresso li fpppp doducd tomcatv
54.862.6
17.9
75.2
118.7
150.1
Integer: 18 - 60
FP: 75 - 150
Inst
ruct
ions
Per
Clo
ck
ILP: speculation and multiple-instruction issue
69
New Model Model Power 5Instructions Issued per clock
Infinite Infinite 4
Instruction Window Size
Infinite, 2K, 512, 128, 32
Infinite 200
Renaming Registers
Infinite Infinite 48 integer + 40 Fl. Pt.
Branch Prediction
Perfect Perfect 2% to 6% misprediction(Tournament Branch Predictor)
Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3
Memory Alias
Perfect Perfect ??
Limits to ILP HW Model Comparison
ILP: speculation and multiple-instruction issue
70
5563
18
75
119
150
36 41
15
61 59 60
10 15 12
49
16
45
10 13 11
35
15
34
8 8 914
914
0
20
40
60
80
100
120
140
160
gcc espresso li fpppp doduc tomcatv
Inst
ruct
ions
Per
Clo
ck
Inf inite 2048 512 128 32
More Realistic HW: Window ImpactChange from Infinite window 2048, 512, 128, 32 FP: 9 - 150
Integer: 8 - 63
IPC
ILP: speculation and multiple-instruction issue
71ILP: speculation and multiple-instruction issue
New Model Model Power 5Instructions Issued per clock
64 Infinite 4
Instruction Window Size
2048 Infinite 200
Renaming Registers
Infinite Infinite 48 integer + 40 Fl. Pt.
Branch Prediction
Perfect vs. 8K Tournament vs. 512 2-bit vs. profile vs. none
Perfect 2% to 6% misprediction(Tournament Branch Predictor)
Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3
Memory Alias
Perfect Perfect ??
Limits to ILP HW Model Comparison
72
35
41
16
6158
60
9
1210
48
15
67 6
46
13
45
6 6 7
45
14
45
2 2 2
29
4
19
46
0
10
20
30
40
50
60
gcc espresso li fpppp doducd tomcatvProgram
Perfect Selective predictor Standard 2-bit Static None
More Realistic HW: Branch ImpactChange from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle
ProfileBHT (512)TournamentPerfect No prediction
FP: 15 - 45
Integer: 6 - 12
IPC
ILP: speculation and multiple-instruction issue
73
Misprediction Rates
1%
5%
14%12%
14%12%
1%
16%18%
23%
18%
30%
0%3% 2% 2%
4%6%
0%
5%
10%
15%
20%
25%
30%
35%
tomcatv doduc fpppp li espresso gcc
Mis
pred
ictio
n Ra
te
Profile-based 2-bit counter Tournament
ILP: speculation and multiple-instruction issue
74
New Model Model Power 5Instructions Issued per clock
64 Infinite 4
Instruction Window Size
2048 Infinite 200
Renaming Registers
Infinite v. 256, 128, 64, 32, none
Infinite 48 integer + 40 Fl. Pt.
Branch Prediction
8K 2-bit Perfect Tournament Branch Predictor
Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3
Memory Alias
Perfect Perfect Perfect
Limits to ILP HW Model Comparison
ILP: speculation and multiple-instruction issue
75
11
15
12
29
54
10
15
12
49
16
1013
12
35
15
44
9 10 11
20
11
28
5 5 6 5 57
4 45
45 5
59
45
0
10
20
30
40
50
60
70
gcc espresso li fpppp doducd tomcatvProgram
Infinite 256 128 64 32 None
More Realistic HW: Renaming Register Impact (N int + N fp)
Change 2048 instrwindow, 64 instr issue, 8K 2 level Prediction
64 None256Infinite 32128
Integer: 5 - 15
FP: 11 - 45
IPC
ILP: speculation and multiple-instruction issue
76
New Model Model Power 5
Instructions Issued per clock
64 Infinite 4
Instruction Window Size
2048 Infinite 200
Renaming Registers
256 Int + 256 FP Infinite 48 integer + 40 Fl. Pt.
Branch Prediction
8K 2-bit Perfect Tournament
Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3
Memory Alias
Perfect v. Stack v. Inspect v. none
Perfect Perfect
Limits to ILP HW Model Comparison
ILP: speculation and multiple-instruction issue
77
Program
0
5
10
15
20
25
30
35
40
45
50
gcc espresso li fpppp doducd tomcatv
10
15
12
49
16
45
7 79
49
16
4 5 4 46 5
35
3 3 4 4
45
Perfect Global/stack Perfect Inspection None
More Realistic HW: Memory Address Alias Impact
Change 2048 instrwindow, 64 instr issue, 8K 2 level Prediction, 256 renaming registers
NoneGlobal/Stack perf;heap conflicts
Perfect Inspec.Assem.
FP: 4 - 45(Fortran,no heap)
Integer: 4 - 9
IPC
ILP: speculation and multiple-instruction issue
78
New Model Model Power 5
Instructions Issued per clock
64 (no restrictions)
Infinite 4
Instruction Window Size
Infinite vs. 256, 128, 64, 32
Infinite 200
Renaming Registers
64 Int + 64 FP Infinite 48 integer + 40 Fl. Pt.
Branch Prediction
1K 2-bit Perfect Tournament
Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3
Memory Alias
HW disambiguation
Perfect Perfect
Limits to ILP HW Model Comparison
ILP: speculation and multiple-instruction issue
79
Program
0
10
20
30
40
50
60
gcc expresso li fpppp doducd tomcatv
10
15
12
52
17
56
10
15
12
47
16
10
1311
35
15
34
910 11
22
12
8 8 9
14
9
14
6 6 68
79
4 4 4 5 46
3 2 3 3 3 3
45
22
Infinite 256 128 64 32 16 8 4
Realistic HW: Window ImpactPerfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window
64 16256Infinite 32128 8 4
Integer: 6 - 12
FP: 8 - 45
IPC
ILP: speculation and multiple-instruction issue
80
HW vs. SW to Increase ILP Memory disambiguation: HW best
WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage Can get conflicts via allocation of stack frames as a called procedure reuses
the memory addresses of a previous frame on the stack
Speculation: HW best when dynamic branch prediction better than compile
time prediction Exceptions easier for HW HW doesn’t need bookkeeping code or compensation code Very complicated to get right
Scheduling: SW can look ahead to schedule better
Compiler independence: does not require new compiler, recompilation to run well
ILP: speculation and multiple-instruction issue
81
In Conclusion Limits to ILP (power efficiency, compilers,
dependencies …) seem to limit to 3 to 6 issue for practical options
These are not laws of physics; just practical limits for today, and perhaps overcome via research
Compiler and ISA advances could change results
ILP: speculation and multiple-instruction issue
82
Acknowledgements These slides contain material developed and copyright
by: Morgan Kauffmann (Elsevier, Inc.) Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) Christos Kozyrakis (Stanford) David Patterson (UCB)
ILP: speculation and multiple-instruction issue