Post on 20-Dec-2015
transcript
CSC 4250Computer Architectures
October 17, 2006
Chapter 3. Instruction-Level Parallelism
& Its Dynamic Exploitation
MIPS FP Unit using Tomasulo’s Algorithm
MIPS Processor with Scoreboard
Three Steps in Execution for Tomasulo’s Alg.
1. Issue ─ if no structural hazards
2. Execute ─ if both operands are available
3. Write result on CDB (from there into reservation stations waiting for results)
Recall that for Scoreboard: Four Steps in Execution
1. Issue ─ if no structural nor WAW hazards2. Read operands ─ if no RAW hazards3. Execute ─ if both operands are received4. Write result ─ if no WAR hazards
How Hazards are Handled
Structural Hazards ─ Reservation stations allow more instructions to be issued
RAW Hazards ─ An instruction is executed only when its operands are available
WAR and WAW Hazards ─ Register renaming eliminates these hazards by renaming all destination registers, including those with a pending read or write for an earlier instruction, so that the out-of-order write does not affect any instruction that depends on an earlier value of an operand
Tags
Tag is a 4-bit quantity that denotes one of five reservation stations or one of six load buffers
Tag fields are found in the reservation stations, the register file, and the store buffers
Example
L.D F6,34(R2)
L.D F2,45(R3)
MUL.DF0,F2,F4
SUB.D F8,F2,F6
DIV.D F10,F0,F6
ADD.DF6,F8,F2
Three Tables
(1st table is not part of hardware; 2nd and 3rd tables are distributed)
1. Instruction status ─ indicates which of three steps of instruction
2. Reservation stations ─ busy, op, Vj, Vk, Qj, Qk, A (V = value; Q = reservation station)
3. Register status ─ indicates which reservation station will write this register
Figure 0.0Instruction Issue Execute Write Result
L.D F6,34(R2) √ √
L.D F2,45(R3) √ √
MUL.D F0,F2,F4 √
SUB.D F8,F2,F6
DIV.D F10,F0,F6
ADD.D F6,F8,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 34+Reg[R2]
Load2 Yes Load 45+Reg[R3]
Add1 No
Add2 No
Add3 No
Mult1 Yes Mult Reg[F4] Load2
Mult2 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1 Load2 Load1
Figure 0.1Instruction Issue Execute Write Result
L.D F6,34(R2) √ √
L.D F2,45(R3) √ √
MUL.D F0,F2,F4 √
SUB.D F8,F2,F6 √
DIV.D F10,F0,F6
ADD.D F6,F8,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 34+Reg[R2]
Load2 Yes Load 45+Reg[R3]
Add1 Yes Sub Load2 Load1
Add2 No
Add3 No
Mult1 Yes Mult Reg[F4] Load2
Mult2 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1 Load2 Load1 Add1
Figure 0.2 (Suppose LD is slow)
Instruction Issue Execute Write Result
L.D F6,34(R2) √ √
L.D F2,45(R3) √ √
MUL.D F0,F2,F4 √
SUB.D F8,F2,F6 √
DIV.D F10,F0,F6 √
ADD.D F6,F8,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 34+Reg[R2]
Load2 Yes Load 45+Reg[R3]
Add1 Yes Sub Load2 Load1
Add2 No
Add3 No
Mult1 Yes Mult Reg[F4] Load2
Mult2 Yes Div Mult1 Load1
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1 Load2 Load1 Add1 Mult2
Figure 0.3 (Suppose LD is slow)
Instruction Issue Execute Write Result
L.D F6,34(R2) √ √
L.D F2,45(R3) √ √
MUL.D F0,F2,F4 √
SUB.D F8,F2,F6 √
DIV.D F10,F0,F6 √
ADD.D F6,F8,F2 √
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 34+Reg[R2]
Load2 Yes Load 45+Reg[R3]
Add1 Yes Sub Load2 Load1
Add2 Yes Add Add1 Load2
Add3 No
Mult1 Yes Mult Reg[F4] Load2
Mult2 Yes Div Mult1 Load1
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1 Load2 Add2 Add1 Mult2
Figure 3.3Instruction Issue Execute Write Result
L.D F6,34(R2) √ √ √
L.D F2,45(R3) √ √
MUL.D F0,F2,F4 √
SUB.D F8,F2,F6 √
DIV.D F10,F0,F6 √
ADD.D F6,F8,F2 √
Name Busy Op Vj Vk Qj Qk A
Load1 No
Load2 Yes Load 45+Reg[R3]
Add1 Yes Sub Mem[34+Reg[R2]] Load2
Add2 Yes Add Add1 Load2
Add3 No
Mult1 Yes Mult Reg[F4] Load2
Mult2 Yes Div Mem[34+Reg[R2]] Mult1
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1 Load2 Add2 Add1 Mult2
Figure 0.4 (2nd load just completes)
Instruction Issue Execute Write Result
L.D F6,34(R2) √ √ √
L.D F2,45(R3) √ √ √
MUL.D F0,F2,F4 √ √
SUB.D F8,F2,F6 √ √
DIV.D F10,F0,F6 √
ADD.D F6,F8,F2 √
Name Busy Op Vj Vk Qj Qk A
Load1 No
Load2 No
Add1 Yes Sub Mem[45+Reg[R3]] Mem[34+Reg[R2]]
Add2 Yes Add Mem[45+Reg[R3]] Add1
Add3 No
Mult1 Yes Mult Mem[45+Reg[R3]] Reg[F4]
Mult2 Yes Div Mem[34+Reg[R2]] Mult1
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1 Add2 Add1 Mult2
Figure 3.4Instruction Issue Execute Write Result
L.D F6,34(R2) √ √ √
L.D F2,45(R3) √ √ √
MUL.D F0,F2,F4 √ √
SUB.D F8,F2,F6 √ √ √
DIV.D F10,F0,F6 √
ADD.D F6,F8,F2 √ √ √
Name Busy Op Vj Vk Qj Qk A
Load1 No
Load2 No
Add1 No
Add2 No
Add3 No
Mult1 Yes Mult Mem[45+Reg[R3]] Reg[F4]
Mult2 Yes Div Mem[34+Reg[R2]] Mult1
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1 Mult2
Loop-Based Example
Loop: L.D F0,0(R1)
MUL.D F4,F0,F2
S.D F4,0(R1)
DADDIU R1,R1,#−8
BNE R1,R2,Loop
Figure 0.5. One active iteration of loopInstruction Iteration Issue Execute Write Result
L.D F0,0(R1) 1 √ √
MUL.D F4,F0,F2 1 √
S.D F4,0(R1) 1 √
L.D F0,0(R1) 2
MUL.D F4,F0,F2 2
S.D F4,0(R1) 2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load Reg[R1]
Load2 No
Add1 No
Add2 No
Add3 No
Mult1 Yes Mult Reg[F2] Load1
Mult2 No
Store1 Yes Store Mult1 Reg[R1]
Store2 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Load1 Mult1
Figure 0.6. One+ active iteration of loopInstruction Iteration Issue Execute Write Result
L.D F0,0(R1) 1 √ √
MUL.D F4,F0,F2 1 √
S.D F4,0(R1) 1 √
L.D F0,0(R1) 2 √
MUL.D F4,F0,F2 2
S.D F4,0(R1) 2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load Reg[R1]
Load2 Yes Load Reg[R1]-8
Add1 No
Add2 No
Add3 No
Mult1 Yes Mult Reg[F2] Load1
Mult2 No
Store1 Yes Store Mult1 Reg[R1]
Store2 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Load2 Mult1
Figure 0.7. One++ active iteration of loopInstruction Iteration Issue Execute Write Result
L.D F0,0(R1) 1 √ √
MUL.D F4,F0,F2 1 √
S.D F4,0(R1) 1 √
L.D F0,0(R1) 2 √ √
MUL.D F4,F0,F2 2 √
S.D F4,0(R1) 2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load Reg[R1]
Load2 Yes Load Reg[R1]-8
Add1 No
Add2 No
Add3 No
Mult1 Yes Mult Reg[F2] Load1
Mult2 Yes Mult Reg[F2] Load2
Store1 Yes Store Mult1 Reg[R1]
Store2 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Load2 Mult12
Figure 3.6. Two active iterations of loopInstruction Iteration Issue Execute Write Result
L.D F0,0(R1) 1 √ √
MUL.D F4,F0,F2 1 √
S.D F4,0(R1) 1 √
L.D F0,0(R1) 2 √ √
MUL.D F4,F0,F2 2 √
S.D F4,0(R1) 2 √
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load Reg[R1]
Load2 Yes Load Reg[R1]-8
Add1 No
Add2 No
Add3 No
Mult1 Yes Mult Reg[F2] Load1
Mult2 Yes Mult Reg[F2] Load2
Store1 Yes Store Mult1 Reg[R1]
Store2 Yes Store Mult2 Reg[R1]-8
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Load2 Mult12
IBM 360/91
Great ideas: Data tagging Register renaming Dynamic detection of memory hazards Generalized forwarding
Ideas broadly used now in microprocessors Was 360/91 successful commercially?
IBM 360/85 (1968)
First commercial computer with a cache: Slower clock time (80ns versus 60ns) Less memory interleaving (4 versus 16) Slower main memory (1.04 μs versus 0.75 μs) Cheaper in price
Which machine was faster on applications?