Download - Please see “ portrait orientation ” PowerPoint file for Chapter 8

Please see “portrait orientation” PowerPoint file for Chapter 8

Figure 8.1. Basic idea of instruction pipelining.


Figure 8.2. A 4-stage pipeline.

F1

F2

F3

I1

I2

I3

E1

E2

E3

D1

D2

D3

W1

W2

W3

Instruction

F4 D4I4

Clock cycle 1 2 3 4 5 6 7 8 9

Figure 8.3. Effect of an execution operation taking more than one clock cycle.

E4

F5I5 D5

Time

E5

W4


Figure 8.4. Pipeline stall caused by a cache miss in F2.

F1

F2

F3

I1

I2 (Load)

I3

E1

M2

D1

D2

D3

W1

W2

Instruction

F4I4

Clock cycle 1 2 3 4 5 6 7

Figure 8.5. Effect of a Load instruction on pipeline timing.

F5I5 D5

Time

E2

E3 W3

E4D4

F1

F2

F3

I1 (Mul)

I2 (Add)

I3

D1

D3

E1

E3

E2

W3

Instruction

Figure 8.6. Pipeline stalled by data dependency between D2 and W1.

1 2 3 4 5 6 7 8 9Clock cycle

W1

D2A W2

F4 D4 E4 W4I4

D2

Time

Figure 8.6. Pipeline stalled by data dependency between D2 and W1.


Figure 8.7. Operand forwarding in a pipelined processor.

F2I2 (Branch)

I3

Ik

E2

F3

Fk Ek

Fk+1 Ek+1Ik+1

Instruction

Figure 8.8. An idle cycle caused by a branch instruction.

Execution unit idle

1 2 3 4 5Clock cycleTime

F1I1 E1

6

X


Figure 8.9. Branch timing.

F : Fetchinstruction

E : Executeinstruction

W : Writeresults

D : Dispatch/Decode

Instruction queue

Instruction fetch unit

Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.

unit

X

Figure 8.11. Branch timing in the presence of an instruction queue.Branch target address is computed in the D stage.

F1 D1 E1 E1 E1 W1

F4

W3E3

I5 (Branch)

I1

F2 D2

1 2 3 4 5 6 7 8 9Clock cycle

E2 W2

F3 D3

E4D4 W4

F5 D5

F6

Fk Dk Ek

Fk+1 Dk+1

I2

I3

I4

I6

Ik

Ik+1

Wk

Ek+1

10

1 1 1 1 2 3 2 1 1Queue length 1

Time

Add

LOOP Shift_left R1DecrementBranch=0

R2LOOP

NEXT

(a) Original program loop

LOOP Decrement R2Branch=0

Shift_left

LOOP

R1NEXT

(b) Reordered instructions

Figure 8.12. Reordering of instructions for a delayed branch.

Add

R1,R3

R1,R3


Figure 8.13. Execution timing showing the delay slot being filledduring the last two passes through the loop in Figure 8.12.


Figure 8.14. Timing when a branch decision has been incorrectly predictedas not taken.


Figure 8.15. State-machine representation of branch prediction algorithms.


Figure 8.16. Figure 8.16. Equivalent operations using complex and simple addressing modes.

AddCompareBranch=0

R1,R2R3,R4. . .

CompareAddBranch=0

R3,R4R1,R2. . .

(a) A program fragment

(b) Instructions reordered

Figure 8.17. Instruction reordering.


Figure 8.18. Datapath modified for pipelined execution, withInterstage buffers at the input and output of the ALU.

W : Writeresults

Dispatchunit

Instruction queue

Floating-pointunit

Integerunit

Figure 8.19. A processor with two execution units.

F : Instructionfetch unit

I1 (Fadd) D1

D2

D3

D4

E1A E1B E1C

E2

E3 E3 E3

E4

W1

W2

W3

W4

I2 (Add)

I3 (Fsub)

I4 (Sub)

Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19,assuming no hazards are encountered.

1 2 3 4 5 6Clock cycleTime

F1

F2

F3

F4

7


Figure 8.21. Instruction completion in program order.

LDX R3, 0, R6 Loadnumber ofitemsin thelist.OR R0, R0, R4 R4 to beusedasoffset in thelistOR R0, R0, R7 Clear R7 to be usedasaccumulator.

LOOPSTART LDX R3, R4, R5 Loadlist iteminto R5.ADD R5, R7, R7 Add number toaccumulator.ADD R4, 8, R4 Point to thenext entry.

SUBcc R6, 1, R6 Decrement R6 andsetconditionflags.BG xcc, LOOPSTART Loop if moreitems in the list.

NEXT ...

(a) Desired program loop

LDX R3, 0, R6OR R0, R0, R4OR R0, R0, R7

LOOPSTART LDX R3, R4, R5ADD R4, 8, R4

SUBcc R6, 1, R6BG,pt xcc, LOOPSTART Predictedtaken,Annul bit = 0ADD R5, R7, R7

NEXT ...

(b) Instructions reorganized to use the delay slot

Figure 8.22. An addition loop showing the use of the branch delay slotand branch prediction.


Figure 8.23. Main building blocks of the UltraSPARC II processor.

E C N1 N2 N3 W

F D G

Fetch Group

Decode

CheckDelay

Cache

Execute

Delay Write

E C N1 N2 N3 W

R X1 X2 X3 N3 W

R X1 X2 X3 N3 W

Two integerpipelines

Two floating-point

pipelines

Figure 8.24. Pipeline organization of the UltraSPARC II processor.

CheckExecuteExecute

RegisterExecute Write

Instruction

Buffer


Figure 8.25. Example of instruction grouping.

ADD R3, R5, R6 G E C N1 N2 N3 WLDSW R4, R7, R6 G E C N1 N2 N3 W

(a) Instructions with common destination

MOVRZ R1, R6, R7 G E C N1 N2 N3 WOR R7, R8, R9 G E C N1 N2 N3 W

(b) Delay caused by MOVR instruction

Figure 8.26 Dispatch delays due to hazards.

Inte

ger

regi

ster

fil

e

Ann

exIEU0

IEU1

ALU

Interstage buffers

Figure 8.27. Integer execution unit.

I1(Icc) G E CI2(BRcc) G E CI3 G E CI4 G E CI5 G EI6 G EI7 G EI8 G EI9 GI10 GI11 GI12 G

Abort

Figure 8.28. Worst-case timing for an incorrectly predicted branch.

Integerregister file/

anne x

Figure 8.29. Load and store unit.

G E C N1

data

tags

dTLB

D-Cache

D-Cache

Compare

Load/store queue

Miss ToE-Cache


Figure 8.30. Execution flow.


Table 8.1. Examples of SPARC instructions.