Please see “portrait orientation” PowerPoint file for Chapter 8
Figure 8.1. Basic idea of instruction pipelining.
Please see “portrait orientation” PowerPoint file for Chapter 8
Figure 8.2. A 4-stage pipeline.
F1
F2
F3
I1
I2
I3
E1
E2
E3
D1
D2
D3
W1
W2
W3
Instruction
F4 D4I4
Clock cycle 1 2 3 4 5 6 7 8 9
Figure 8.3. Effect of an execution operation taking more than one clock cycle.
E4
F5I5 D5
Time
E5
W4
Please see “portrait orientation” PowerPoint file for Chapter 8
Figure 8.4. Pipeline stall caused by a cache miss in F2.
F1
F2
F3
I1
I2 (Load)
I3
E1
M2
D1
D2
D3
W1
W2
Instruction
F4I4
Clock cycle 1 2 3 4 5 6 7
Figure 8.5. Effect of a Load instruction on pipeline timing.
F5I5 D5
Time
E2
E3 W3
E4D4
F1
F2
F3
I1 (Mul)
I2 (Add)
I3
D1
D3
E1
E3
E2
W3
Instruction
Figure 8.6. Pipeline stalled by data dependency between D2 and W1.
1 2 3 4 5 6 7 8 9Clock cycle
W1
D2A W2
F4 D4 E4 W4I4
D2
Time
Figure 8.6. Pipeline stalled by data dependency between D2 and W1.
Please see “portrait orientation” PowerPoint file for Chapter 8
Figure 8.7. Operand forwarding in a pipelined processor.
F2I2 (Branch)
I3
Ik
E2
F3
Fk Ek
Fk+1 Ek+1Ik+1
Instruction
Figure 8.8. An idle cycle caused by a branch instruction.
Execution unit idle
1 2 3 4 5Clock cycleTime
F1I1 E1
6
X
Please see “portrait orientation” PowerPoint file for Chapter 8
Figure 8.9. Branch timing.
F : Fetchinstruction
E : Executeinstruction
W : Writeresults
D : Dispatch/Decode
Instruction queue
Instruction fetch unit
Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.
unit
X
Figure 8.11. Branch timing in the presence of an instruction queue.Branch target address is computed in the D stage.
F1 D1 E1 E1 E1 W1
F4
W3E3
I5 (Branch)
I1
F2 D2
1 2 3 4 5 6 7 8 9Clock cycle
E2 W2
F3 D3
E4D4 W4
F5 D5
F6
Fk Dk Ek
Fk+1 Dk+1
I2
I3
I4
I6
Ik
Ik+1
Wk
Ek+1
10
1 1 1 1 2 3 2 1 1Queue length 1
Time
Add
LOOP Shift_left R1DecrementBranch=0
R2LOOP
NEXT
(a) Original program loop
LOOP Decrement R2Branch=0
Shift_left
LOOP
R1NEXT
(b) Reordered instructions
Figure 8.12. Reordering of instructions for a delayed branch.
Add
R1,R3
R1,R3
Please see “portrait orientation” PowerPoint file for Chapter 8
Figure 8.13. Execution timing showing the delay slot being filledduring the last two passes through the loop in Figure 8.12.
Please see “portrait orientation” PowerPoint file for Chapter 8
Figure 8.14. Timing when a branch decision has been incorrectly predictedas not taken.
Please see “portrait orientation” PowerPoint file for Chapter 8
Figure 8.15. State-machine representation of branch prediction algorithms.
Please see “portrait orientation” PowerPoint file for Chapter 8
Figure 8.16. Figure 8.16. Equivalent operations using complex and simple addressing modes.
AddCompareBranch=0
R1,R2R3,R4. . .
CompareAddBranch=0
R3,R4R1,R2. . .
(a) A program fragment
(b) Instructions reordered
Figure 8.17. Instruction reordering.
Please see “portrait orientation” PowerPoint file for Chapter 8
Figure 8.18. Datapath modified for pipelined execution, withInterstage buffers at the input and output of the ALU.
W : Writeresults
Dispatchunit
Instruction queue
Floating-pointunit
Integerunit
Figure 8.19. A processor with two execution units.
F : Instructionfetch unit
I1 (Fadd) D1
D2
D3
D4
E1A E1B E1C
E2
E3 E3 E3
E4
W1
W2
W3
W4
I2 (Add)
I3 (Fsub)
I4 (Sub)
Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19,assuming no hazards are encountered.
1 2 3 4 5 6Clock cycleTime
F1
F2
F3
F4
7
Please see “portrait orientation” PowerPoint file for Chapter 8
Figure 8.21. Instruction completion in program order.
LDX R3, 0, R6 Loadnumber ofitemsin thelist.OR R0, R0, R4 R4 to beusedasoffset in thelistOR R0, R0, R7 Clear R7 to be usedasaccumulator.
LOOPSTART LDX R3, R4, R5 Loadlist iteminto R5.ADD R5, R7, R7 Add number toaccumulator.ADD R4, 8, R4 Point to thenext entry.
SUBcc R6, 1, R6 Decrement R6 andsetconditionflags.BG xcc, LOOPSTART Loop if moreitems in the list.
NEXT ...
(a) Desired program loop
LDX R3, 0, R6OR R0, R0, R4OR R0, R0, R7
LOOPSTART LDX R3, R4, R5ADD R4, 8, R4
SUBcc R6, 1, R6BG,pt xcc, LOOPSTART Predictedtaken,Annul bit = 0ADD R5, R7, R7
NEXT ...
(b) Instructions reorganized to use the delay slot
Figure 8.22. An addition loop showing the use of the branch delay slotand branch prediction.
Please see “portrait orientation” PowerPoint file for Chapter 8
Figure 8.23. Main building blocks of the UltraSPARC II processor.
E C N1 N2 N3 W
F D G
Fetch Group
Decode
CheckDelay
Cache
Execute
Delay Write
E C N1 N2 N3 W
R X1 X2 X3 N3 W
R X1 X2 X3 N3 W
Two integerpipelines
Two floating-point
pipelines
Figure 8.24. Pipeline organization of the UltraSPARC II processor.
CheckExecuteExecute
RegisterExecute Write
Instruction
Buffer
Please see “portrait orientation” PowerPoint file for Chapter 8
Figure 8.25. Example of instruction grouping.
ADD R3, R5, R6 G E C N1 N2 N3 WLDSW R4, R7, R6 G E C N1 N2 N3 W
(a) Instructions with common destination
MOVRZ R1, R6, R7 G E C N1 N2 N3 WOR R7, R8, R9 G E C N1 N2 N3 W
(b) Delay caused by MOVR instruction
Figure 8.26 Dispatch delays due to hazards.
Inte
ger
regi
ster
fil
e
Ann
exIEU0
IEU1
ALU
Interstage buffers
Figure 8.27. Integer execution unit.
I1(Icc) G E CI2(BRcc) G E CI3 G E CI4 G E CI5 G EI6 G EI7 G EI8 G EI9 GI10 GI11 GI12 G
Abort
Figure 8.28. Worst-case timing for an incorrectly predicted branch.
Integerregister file/
anne x
Figure 8.29. Load and store unit.
G E C N1
data
tags
dTLB
D-Cache
D-Cache
Compare
Load/store queue
Miss ToE-Cache
Please see “portrait orientation” PowerPoint file for Chapter 8
Figure 8.30. Execution flow.
Please see “portrait orientation” PowerPoint file for Chapter 8
Table 8.1. Examples of SPARC instructions.