INSTRUCTION EVEL PARALLELISMANDITS E (P 2 ......If RS and ROB available, issue instruction....

INSTRUCTION LEVEL

PARALLELISM AND ITS

EXPLOITATION (PART 2)Chapter 3

Appendix H

1

CP

E731 -

Dr. Iyad Jafar

OUTLINE

Dynamic Scheduling (3.4 and 3.5)

HW-Based Speculation (3.6)

Multiple Issue and Static Scheduling (3.7)

2

CP

E731 -

Dr. Iyad Jafar

DYNAMIC SCHEDULING

Scheduling of instructions at compile is beneficial, butlimited!

Compiler scheduling can not solve all dependencies Pipeline is stalled and no further instructions are issueduntil the dependence is cleared

Dynamic scheduling is hardware approach torearrange or reschedule instructions in a way thatguarantees the correctness of data flow and exceptions Recompiling code when moving the program? Handle runtime dependencies Allows processor to handle unpredictable delays

Potential improvement, but HW cost Complicating exceptions 3

CP

E731 -

Dr. Iyad Jafar

DYNAMIC SCHEDULING

Idea and Challenges Pipelining uses in-order issue and execute of

instructions

Why not to start execution as soon as thereis no hazard? In-order issue but the instruction executes as soon as the

data is available.

This implies:Out-of-order executionOut-of-order completionWAR and WAW hazards

4

CP

E731 -

Dr. Iyad Jafar

Stalled for no reason!

DIV.D F0,F2,F4ADD.D F6,F0,F8SUB.D F8,F10,F14MUL.D F6,F10,F8

DYNAMIC SCHEDULING

To allow out-of-order execution The fetch stage precedes the issue stage and places

the instruction into the instruction register or a queueof pending instructions

Split the ID stage into issue and read operandsstages

Instructions are issued from the register or the queuein-order, but they could be stalled or bypassed fromexecution later

Multiple instructions can be in execution! Use ofmultiple functional units and/or pipelined units

How to deal with data dependences? Algorithms

Scoreboarding (C.7) Tomasulo

5

CP

E731 -

Dr. Iyad Jafar

DYNAMIC SCHEDULING - TOMASULO

Robert Tomasulo 1967, IBM Minimize RAW by tracking when operands are

available. Instructions start execution only when their operands areavailable

Eliminate WAW and WAR by register renaming Renaming all destination registers including those witha pending read or write of an earlier instruction

Most modern high performance processors use aderivative of Tomasulo algorithm

6

CP

E731 -

Dr. Iyad Jafar

DIV.D F0,F2,F4ADD.D F6,F0,F8S.D F6,0(R1)SUB.D F8,F10,F14MUL.D F6,F10,F8

DIV.D F0,F2,F4ADD.D S,F0,F8S.D S,0(R1)SUB.D T,F10,F14MUL.D F6,F10,T


In Tomasulo’s scheme, register renaming isprovided by reservation stations Contain the instruction (operation)! Buffer the operands of instructions waiting to issue

(no need to fetch from registers (WAR)) Pending instructions designate the reservation

station that will provide their input(s) For successive register writes, the last write is used to

update the register (WAW) As instructions are issued, the register specifiers for

pending operands are renamed to the names of thereservation station that will produce the results

Two important properties Distributed hazard detection and execution control Results are passed directly to functional units from

the reservation stations or registers (Common Databus, CDB)

7

CP

E731 -

Dr. Iyad Jafar


8

CP

E731 -

Dr. Iyad Jafar

DYNAMIC SCHEDULING - TOMASULO Three basic steps are invoked

Issue Get the next instruction from the head of the instruction queue If there is available RS, issue the instruction with operands (WAR!)

if they are available in registers. Otherwise, structural hazardoccurs.

If operand(s) is not available in registers, keep track of thefunctional units that will produce the operands (RAW!)

Execute If one or more of the operands is not yet available, monitor CDB! When an operand is ready, it is placed on CDB and copied to all RS

waiting for it (RAW!) If all operands are ready, execute the instruction! Loads and stores require a two-step execution process; computing

effective address and memory access. When the base register is available, compute the address and

place it in load/store buffer. Load accesses the memory if it is available while Store wait for

the value to be stored. No instruction is allowed to initiate execution, until all branches

that precede the instruction in program order have completed!

9

CP

E731 -

Dr. Iyad Jafar

DYNAMIC SCHEDULING - TOMASULO Three basic steps are invoked

WriteWhen the result is available, write it on the CDB to

update RS and registers.Stores are buffered in the store buffer until both the

value to be stored and the address are available. Theresult is written as soon as the memory unit is free.

Notes Hazard detection and elimination HW is attached to

registers, RS and load/store buffers Once an instruction has been issued and is waiting for a

source operand, it refers to the operand by the reservationstation number where the instruction that will write theregister has been assigned

The Tomasulo scheme refers to the buffer or unit thatwill produce a result; the register names arediscarded when an instruction is issued to areservation station.

10

CP

E731 -

Dr. Iyad Jafar


Fields in RS Op—The operation to perform on source operands S1 and S2. Qj, Qk— hold the RS number that will produce the source

operand. Zero value indicates that the source operand is alreadyavailable in Vj or Vk, or is unnecessary.

Vj, Vk—The value of the source operands. For loads, the Vk fieldis used to hold the offset field.

A—Used to hold information for the memory address calculationfor a load or store. Initially, the immediate field of the instructionis stored here; after the address calculation, the effective addressis stored here.

Busy—Indicates that this reservation station and itsaccompanying functional unit are occupied.

Fields in Register File Qi – the number of RS that will produce the result to be written

into this register. Blank value implies that no active instruction iswriting to this register and value is ready to be read.

11

CP

E731 -

Dr. Iyad Jafar


12

CP

E731 -

Dr. Iyad Jafar


Example - Show what the information tables looklike for the following code sequence when(a) all instructions have issued but only the first load has

completed and written its result

(b) the MUL.D is ready to write its result.

Assume the following latencies.

13

CP

E731 -

Dr. Iyad Jafar

Operation Execute Latency

LOAD 1

FP ADD 2

MULTIPLY 6

DIVIDE 12

DYNAMIC SCHEDULING - TOMASULOAll instructions issued, first load has written

14

CP

E731 -

Dr. Iyad Jafar

DYNAMIC SCHEDULING - TOMASULOMUL.D is about to write!

15

CP

E731 -

Dr. Iyad Jafar

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … 24 25

F I E W

F I E W

F I S E E E E E E W

F I E E W

F I S S S S S S E E E … E W

F I S E E W

L.D F6, 32(R2)

L.D F2, 44(R3)

MUL.D F0,F2,F4

SUB.D F8,F2,F6

DIV.D F10,F0,F6

ADD.D F6,F8,F2

Compare to pipelined implementation with no dynamic scheduling!

DYNAMIC SCHEDULING - TOMASULOMUL.D is about to write!

16

CP

E731 -

Dr. Iyad Jafar

HW-BASED SPECULATION

Exploiting more ILP makes it harder to deal withcontrol dependencies

In dynamic scheduling An instruction is allowed to initiate execution when

all preceding branches program order have completed! Performance!?

Overcome control dependencies in dynamicscheduling using SW or HW speculation! Fetch, issue and execute instruction assuming

prediction is correct Need a mechanism to revert upon incorrect prediction

17

CP

E731 -

Dr. Iyad Jafar

HW-BASED SPECULATION Extending Tomasulo!?

Separate bypassing results among instructions, from theactual completion of an instruction to allow bypassingresults until speculation result is known (Instructioncommit stage!)

Basically, instructions execute out of order but commit inorder to prevent irreversible actions!

The commit step requires a hardware to Buffer the results of instructions that have finished

execution but have not committed Pass results among instructions that may be speculated

Reorder buffer (ROB) !18

CP

E731 -

Dr. Iyad Jafar

HW-BASED SPECULATION ROB

Provides additional registers that hold the result betweencompletion and committing instructions, and the time tocommit

Supplies operands to other instructions, like RS, betweencomplete and commit

The result is not written to register file or memory until theinstruction commits

Each ROB entry has Instruction type (branch, register, store)Destination field supplies register number for Load

and ALU, memory address for StoreValue field holds the value of the instruction result

until it commitsReady field indicates whether instruction has finished

execution and value is ready Renaming is provided by ROB. Results are tagged using

ROB entries. RS still buffer the operands.19

CP

E731 -

Dr. Iyad Jafar


20

CP

E731 -

Dr. Iyad Jafar


21

CP

E731 -

Dr. Iyad Jafar

Execution Steps Issue

If RS and ROB available, issue instruction. Otherwise, stall.

Send operands from registers or ROB. The number of assigned ROB entry is sent to RS

Execute If one or more operands is not available, monitor the

CDB for the register to be computed!Execute when all operands are available

WriteWhen the result is available, write it on CDB with ROB

tag assigned to instruction. The result is stored in designated ROB entries as well

as any RS waiting the result


22

CP

E731 -

Dr. Iyad Jafar

Execution Steps Commit – final stage and has three different cases

Normal commit – When ALU or load instructionsreaches the head of ROB and its result is available, theresult is written to register file and the instruction isremoved from ROB

Store commit – similar to normal commit, except thatthe memory is updated!

Branch commit – if the instruction at head of ROB isbranch with incorrect perdition, the ROB is flushed andexecution is restarted from the correct address. In casethe prediction correct, the branch is finished.


23

CP

E731 -

Dr. Iyad Jafar

Example. Show how the status tables look like when the MUL.D is ready to go to commit. Assume same latencies for FP units as in previous examples.

HW-BASED SPECULATIONMUL.D is about to write!

24

CP

E731 -

Dr. Iyad Jafar

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

… 24

25

26

27

F I E W C

F I E W C

F I S E E E E E E W C

F I E E W - - - - - C

F I S S S S S S E E E … E W C

F I S E E W - - - - - - - - C

L.D F6, 32(R2)

L.D F2, 44(R3)

MUL.D F0,F2,F4

SUB.D F8,F2,F6

DIV.D F10,F0,F6

ADD.D F6,F8,F2

Compare to pipelined implementation with no dynamic scheduling!


25

CP

E731 -

Dr. Iyad Jafar

head

xx


26

CP

E731 -

Dr. Iyad Jafar

Notes – comparison with no speculationexample No instruction after the earliest uncompleted

instruction (MUL.D) is allowed to complete orcommit

This allows for precise exception treatment.Instructions following the offending instruction areflushed.

This is not applicable in Tomasulo’s since earlierinstruction can update the state before it is knownthat a following instruction causes an exception.

Check example on p. 189

MULTIPLE ISSUE AND STATIC SCHEDULING

27

CP

E731 -

Dr. Iyad Jafar

Techniques presented so far attempt to achieve a CPI of 1.They issue one instruction per cycle.

Having CPI less than 1 requires issuing multiple instructionsper cycle multiple issue processors

Three basic types of multiple issue processors Statically scheduled superscalar processor

Variable number of instructions per cycle and in orderexecution. Commonly 2 instructions!

Statically scheduled by compiler

Very large instruction word (VLIW) processor Fixed number of instructions per cycle issued as one large

instruction or issue packet Statically scheduled by compiler

Dynamically scheduled superscalar processor Variable number of instructions per cycle and out order

execution

28

CP

E731 -

Dr. Iyad Jafar

MULTIPLE ISSUE AND STATIC SCHEDULING

THE BASIC VLIW APPROACH

29

CP

E731 -

Dr. Iyad Jafar

Uses multiple independent functional units Instead of issuing multiple independent instructions,

VLIW packages multiple instructions into one longinstruction or require that instruction satisfy someconstraints

Simple VLIW example VLIW processor with instructions that contain five

operations (one integer/branch, two FP, two Memory) The instruction will have a set of bits for each unit 16-24

bits (instruction size 80-120 bits) It is assumed that there is enough parallelism among

operations taken care by compiler; loop unrolling andscheduling the code Local / global scheduling (appendix H)


30

CP

E731 -

Dr. Iyad Jafar

Example. Suppose we have aVLIW that could issue twomemory references, two FPoperations, and one integeroperation or branch in everyclock cycle.Show an unrolled version of theloop x[i] = x[i] + s for such aprocessor. Unroll as many timesas necessary to eliminate anystalls (empty issue cycles). Ignoredelayed branches.


31

CP

E731 -

Dr. Iyad Jafar

9 cycles are required for 7 elements 9/7 per element.Compare to 14 cycles for 4 elements!

23 instruction/ 9 cycles 2.5 CPI


32

CP

E731 -

Dr. Iyad Jafar

VLIW processors might be less efficient (comparedto superscalar) Increased code size

The need to unrolling loops Whenever instructions are not full, unused functional

units are translated to wasted bits (compression anddecoding)

Operated in lockstep no hazard detection at HW A stall in any functional unit causes the entire pipeline to

stall Although compiler can schedule deterministic functional

units to prevent stalls; predicting data accesses that causecache stall is difficult (blocking caches!)

Binary code compatibility! Migration between successive implementations of VLIW

requires different versions of the code!

Check Appendix H for techniques used in the EPIC IA-64architecture to address these issues

Date post:	31-Oct-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

INSTRUCTION EVEL PARALLELISMANDITS E (P 2 ......If RS and ROB available, issue instruction....

Documents