INSTRUCTION LEVEL
PARALLELISM AND ITS
EXPLOITATION (PART 2)Chapter 3
Appendix H
1
CP
E731 -
Dr. Iyad Jafar
OUTLINE
Dynamic Scheduling (3.4 and 3.5)
HW-Based Speculation (3.6)
Multiple Issue and Static Scheduling (3.7)
2
CP
E731 -
Dr. Iyad Jafar
DYNAMIC SCHEDULING
Scheduling of instructions at compile is beneficial, butlimited!
Compiler scheduling can not solve all dependencies Pipeline is stalled and no further instructions are issueduntil the dependence is cleared
Dynamic scheduling is hardware approach torearrange or reschedule instructions in a way thatguarantees the correctness of data flow and exceptions Recompiling code when moving the program? Handle runtime dependencies Allows processor to handle unpredictable delays
Potential improvement, but HW cost Complicating exceptions 3
CP
E731 -
Dr. Iyad Jafar
DYNAMIC SCHEDULING
Idea and Challenges Pipelining uses in-order issue and execute of
instructions
Why not to start execution as soon as thereis no hazard? In-order issue but the instruction executes as soon as the
data is available.
This implies:Out-of-order executionOut-of-order completionWAR and WAW hazards
4
CP
E731 -
Dr. Iyad Jafar
Stalled for no reason!
DIV.D F0,F2,F4ADD.D F6,F0,F8SUB.D F8,F10,F14MUL.D F6,F10,F8
DYNAMIC SCHEDULING
To allow out-of-order execution The fetch stage precedes the issue stage and places
the instruction into the instruction register or a queueof pending instructions
Split the ID stage into issue and read operandsstages
Instructions are issued from the register or the queuein-order, but they could be stalled or bypassed fromexecution later
Multiple instructions can be in execution! Use ofmultiple functional units and/or pipelined units
How to deal with data dependences? Algorithms
Scoreboarding (C.7) Tomasulo
5
CP
E731 -
Dr. Iyad Jafar
DYNAMIC SCHEDULING - TOMASULO
Robert Tomasulo 1967, IBM Minimize RAW by tracking when operands are
available. Instructions start execution only when their operands areavailable
Eliminate WAW and WAR by register renaming Renaming all destination registers including those witha pending read or write of an earlier instruction
Most modern high performance processors use aderivative of Tomasulo algorithm
6
CP
E731 -
Dr. Iyad Jafar
DIV.D F0,F2,F4ADD.D F6,F0,F8S.D F6,0(R1)SUB.D F8,F10,F14MUL.D F6,F10,F8
DIV.D F0,F2,F4ADD.D S,F0,F8S.D S,0(R1)SUB.D T,F10,F14MUL.D F6,F10,T
DYNAMIC SCHEDULING - TOMASULO
In Tomasulo’s scheme, register renaming isprovided by reservation stations Contain the instruction (operation)! Buffer the operands of instructions waiting to issue
(no need to fetch from registers (WAR)) Pending instructions designate the reservation
station that will provide their input(s) For successive register writes, the last write is used to
update the register (WAW) As instructions are issued, the register specifiers for
pending operands are renamed to the names of thereservation station that will produce the results
Two important properties Distributed hazard detection and execution control Results are passed directly to functional units from
the reservation stations or registers (Common Databus, CDB)
7
CP
E731 -
Dr. Iyad Jafar
DYNAMIC SCHEDULING - TOMASULO
8
CP
E731 -
Dr. Iyad Jafar
DYNAMIC SCHEDULING - TOMASULO Three basic steps are invoked
Issue Get the next instruction from the head of the instruction queue If there is available RS, issue the instruction with operands (WAR!)
if they are available in registers. Otherwise, structural hazardoccurs.
If operand(s) is not available in registers, keep track of thefunctional units that will produce the operands (RAW!)
Execute If one or more of the operands is not yet available, monitor CDB! When an operand is ready, it is placed on CDB and copied to all RS
waiting for it (RAW!) If all operands are ready, execute the instruction! Loads and stores require a two-step execution process; computing
effective address and memory access. When the base register is available, compute the address and
place it in load/store buffer. Load accesses the memory if it is available while Store wait for
the value to be stored. No instruction is allowed to initiate execution, until all branches
that precede the instruction in program order have completed!
9
CP
E731 -
Dr. Iyad Jafar
DYNAMIC SCHEDULING - TOMASULO Three basic steps are invoked
WriteWhen the result is available, write it on the CDB to
update RS and registers.Stores are buffered in the store buffer until both the
value to be stored and the address are available. Theresult is written as soon as the memory unit is free.
Notes Hazard detection and elimination HW is attached to
registers, RS and load/store buffers Once an instruction has been issued and is waiting for a
source operand, it refers to the operand by the reservationstation number where the instruction that will write theregister has been assigned
The Tomasulo scheme refers to the buffer or unit thatwill produce a result; the register names arediscarded when an instruction is issued to areservation station.
10
CP
E731 -
Dr. Iyad Jafar
DYNAMIC SCHEDULING - TOMASULO
Fields in RS Op—The operation to perform on source operands S1 and S2. Qj, Qk— hold the RS number that will produce the source
operand. Zero value indicates that the source operand is alreadyavailable in Vj or Vk, or is unnecessary.
Vj, Vk—The value of the source operands. For loads, the Vk fieldis used to hold the offset field.
A—Used to hold information for the memory address calculationfor a load or store. Initially, the immediate field of the instructionis stored here; after the address calculation, the effective addressis stored here.
Busy—Indicates that this reservation station and itsaccompanying functional unit are occupied.
Fields in Register File Qi – the number of RS that will produce the result to be written
into this register. Blank value implies that no active instruction iswriting to this register and value is ready to be read.
11
CP
E731 -
Dr. Iyad Jafar
DYNAMIC SCHEDULING - TOMASULO
12
CP
E731 -
Dr. Iyad Jafar
DYNAMIC SCHEDULING - TOMASULO
Example - Show what the information tables looklike for the following code sequence when(a) all instructions have issued but only the first load has
completed and written its result
(b) the MUL.D is ready to write its result.
Assume the following latencies.
13
CP
E731 -
Dr. Iyad Jafar
Operation Execute Latency
LOAD 1
FP ADD 2
MULTIPLY 6
DIVIDE 12
DYNAMIC SCHEDULING - TOMASULOAll instructions issued, first load has written
14
CP
E731 -
Dr. Iyad Jafar
DYNAMIC SCHEDULING - TOMASULOMUL.D is about to write!
15
CP
E731 -
Dr. Iyad Jafar
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … 24 25
F I E W
F I E W
F I S E E E E E E W
F I E E W
F I S S S S S S E E E … E W
F I S E E W
L.D F6, 32(R2)
L.D F2, 44(R3)
MUL.D F0,F2,F4
SUB.D F8,F2,F6
DIV.D F10,F0,F6
ADD.D F6,F8,F2
Compare to pipelined implementation with no dynamic scheduling!
DYNAMIC SCHEDULING - TOMASULOMUL.D is about to write!
16
CP
E731 -
Dr. Iyad Jafar
HW-BASED SPECULATION
Exploiting more ILP makes it harder to deal withcontrol dependencies
In dynamic scheduling An instruction is allowed to initiate execution when
all preceding branches program order have completed! Performance!?
Overcome control dependencies in dynamicscheduling using SW or HW speculation! Fetch, issue and execute instruction assuming
prediction is correct Need a mechanism to revert upon incorrect prediction
17
CP
E731 -
Dr. Iyad Jafar
HW-BASED SPECULATION Extending Tomasulo!?
Separate bypassing results among instructions, from theactual completion of an instruction to allow bypassingresults until speculation result is known (Instructioncommit stage!)
Basically, instructions execute out of order but commit inorder to prevent irreversible actions!
The commit step requires a hardware to Buffer the results of instructions that have finished
execution but have not committed Pass results among instructions that may be speculated
Reorder buffer (ROB) !18
CP
E731 -
Dr. Iyad Jafar
HW-BASED SPECULATION ROB
Provides additional registers that hold the result betweencompletion and committing instructions, and the time tocommit
Supplies operands to other instructions, like RS, betweencomplete and commit
The result is not written to register file or memory until theinstruction commits
Each ROB entry has Instruction type (branch, register, store)Destination field supplies register number for Load
and ALU, memory address for StoreValue field holds the value of the instruction result
until it commitsReady field indicates whether instruction has finished
execution and value is ready Renaming is provided by ROB. Results are tagged using
ROB entries. RS still buffer the operands.19
CP
E731 -
Dr. Iyad Jafar
HW-BASED SPECULATION
20
CP
E731 -
Dr. Iyad Jafar
HW-BASED SPECULATION
21
CP
E731 -
Dr. Iyad Jafar
Execution Steps Issue
If RS and ROB available, issue instruction. Otherwise, stall.
Send operands from registers or ROB. The number of assigned ROB entry is sent to RS
Execute If one or more operands is not available, monitor the
CDB for the register to be computed!Execute when all operands are available
WriteWhen the result is available, write it on CDB with ROB
tag assigned to instruction. The result is stored in designated ROB entries as well
as any RS waiting the result
HW-BASED SPECULATION
22
CP
E731 -
Dr. Iyad Jafar
Execution Steps Commit – final stage and has three different cases
Normal commit – When ALU or load instructionsreaches the head of ROB and its result is available, theresult is written to register file and the instruction isremoved from ROB
Store commit – similar to normal commit, except thatthe memory is updated!
Branch commit – if the instruction at head of ROB isbranch with incorrect perdition, the ROB is flushed andexecution is restarted from the correct address. In casethe prediction correct, the branch is finished.
HW-BASED SPECULATION
23
CP
E731 -
Dr. Iyad Jafar
Example. Show how the status tables look like when the MUL.D is ready to go to commit. Assume same latencies for FP units as in previous examples.
HW-BASED SPECULATIONMUL.D is about to write!
24
CP
E731 -
Dr. Iyad Jafar
1 2 3 4 5 6 7 8 9 10
11
12
13
14
15
… 24
25
26
27
F I E W C
F I E W C
F I S E E E E E E W C
F I E E W - - - - - C
F I S S S S S S E E E … E W C
F I S E E W - - - - - - - - C
L.D F6, 32(R2)
L.D F2, 44(R3)
MUL.D F0,F2,F4
SUB.D F8,F2,F6
DIV.D F10,F0,F6
ADD.D F6,F8,F2
Compare to pipelined implementation with no dynamic scheduling!
HW-BASED SPECULATION
25
CP
E731 -
Dr. Iyad Jafar
head
xx
HW-BASED SPECULATION
26
CP
E731 -
Dr. Iyad Jafar
Notes – comparison with no speculationexample No instruction after the earliest uncompleted
instruction (MUL.D) is allowed to complete orcommit
This allows for precise exception treatment.Instructions following the offending instruction areflushed.
This is not applicable in Tomasulo’s since earlierinstruction can update the state before it is knownthat a following instruction causes an exception.
Check example on p. 189
MULTIPLE ISSUE AND STATIC SCHEDULING
27
CP
E731 -
Dr. Iyad Jafar
Techniques presented so far attempt to achieve a CPI of 1.They issue one instruction per cycle.
Having CPI less than 1 requires issuing multiple instructionsper cycle multiple issue processors
Three basic types of multiple issue processors Statically scheduled superscalar processor
Variable number of instructions per cycle and in orderexecution. Commonly 2 instructions!
Statically scheduled by compiler
Very large instruction word (VLIW) processor Fixed number of instructions per cycle issued as one large
instruction or issue packet Statically scheduled by compiler
Dynamically scheduled superscalar processor Variable number of instructions per cycle and out order
execution
28
CP
E731 -
Dr. Iyad Jafar
MULTIPLE ISSUE AND STATIC SCHEDULING
THE BASIC VLIW APPROACH
29
CP
E731 -
Dr. Iyad Jafar
Uses multiple independent functional units Instead of issuing multiple independent instructions,
VLIW packages multiple instructions into one longinstruction or require that instruction satisfy someconstraints
Simple VLIW example VLIW processor with instructions that contain five
operations (one integer/branch, two FP, two Memory) The instruction will have a set of bits for each unit 16-24
bits (instruction size 80-120 bits) It is assumed that there is enough parallelism among
operations taken care by compiler; loop unrolling andscheduling the code Local / global scheduling (appendix H)
THE BASIC VLIW APPROACH
30
CP
E731 -
Dr. Iyad Jafar
Example. Suppose we have aVLIW that could issue twomemory references, two FPoperations, and one integeroperation or branch in everyclock cycle.Show an unrolled version of theloop x[i] = x[i] + s for such aprocessor. Unroll as many timesas necessary to eliminate anystalls (empty issue cycles). Ignoredelayed branches.
THE BASIC VLIW APPROACH
31
CP
E731 -
Dr. Iyad Jafar
9 cycles are required for 7 elements 9/7 per element.Compare to 14 cycles for 4 elements!
23 instruction/ 9 cycles 2.5 CPI
THE BASIC VLIW APPROACH
32
CP
E731 -
Dr. Iyad Jafar
VLIW processors might be less efficient (comparedto superscalar) Increased code size
The need to unrolling loops Whenever instructions are not full, unused functional
units are translated to wasted bits (compression anddecoding)
Operated in lockstep no hazard detection at HW A stall in any functional unit causes the entire pipeline to
stall Although compiler can schedule deterministic functional
units to prevent stalls; predicting data accesses that causecache stall is difficult (blocking caches!)
Binary code compatibility! Migration between successive implementations of VLIW
requires different versions of the code!
Check Appendix H for techniques used in the EPIC IA-64architecture to address these issues