+ All Categories
Home > Documents > CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its...

CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its...

Date post: 20-Dec-2015
Category:
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
22
CSC 4250 Computer Architectures October 17, 2006 Chapter 3. Instruction-Level Parallelism & Its Dynamic Exploitation
Transcript
Page 1: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

CSC 4250Computer Architectures

October 17, 2006

Chapter 3. Instruction-Level Parallelism

& Its Dynamic Exploitation

Page 2: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

MIPS FP Unit using Tomasulo’s Algorithm

Page 3: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

MIPS Processor with Scoreboard

Page 4: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Three Steps in Execution for Tomasulo’s Alg.

1. Issue ─ if no structural hazards

2. Execute ─ if both operands are available

3. Write result on CDB (from there into reservation stations waiting for results)

Recall that for Scoreboard: Four Steps in Execution

1. Issue ─ if no structural nor WAW hazards2. Read operands ─ if no RAW hazards3. Execute ─ if both operands are received4. Write result ─ if no WAR hazards

Page 5: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

How Hazards are Handled

Structural Hazards ─ Reservation stations allow more instructions to be issued

RAW Hazards ─ An instruction is executed only when its operands are available

WAR and WAW Hazards ─ Register renaming eliminates these hazards by renaming all destination registers, including those with a pending read or write for an earlier instruction, so that the out-of-order write does not affect any instruction that depends on an earlier value of an operand

Page 6: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Tags

Tag is a 4-bit quantity that denotes one of five reservation stations or one of six load buffers

Tag fields are found in the reservation stations, the register file, and the store buffers

Page 7: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Example

L.D F6,34(R2)

L.D F2,45(R3)

MUL.DF0,F2,F4

SUB.D F8,F2,F6

DIV.D F10,F0,F6

ADD.DF6,F8,F2

Page 8: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Three Tables

(1st table is not part of hardware; 2nd and 3rd tables are distributed)

1. Instruction status ─ indicates which of three steps of instruction

2. Reservation stations ─ busy, op, Vj, Vk, Qj, Qk, A (V = value; Q = reservation station)

3. Register status ─ indicates which reservation station will write this register

Page 9: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Figure 0.0Instruction Issue Execute Write Result

L.D F6,34(R2) √ √

L.D F2,45(R3) √ √

MUL.D F0,F2,F4 √

SUB.D F8,F2,F6

DIV.D F10,F0,F6

ADD.D F6,F8,F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 34+Reg[R2]

Load2 Yes Load 45+Reg[R3]

Add1 No

Add2 No

Add3 No

Mult1 Yes Mult Reg[F4] Load2

Mult2 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mult1 Load2 Load1

Page 10: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Figure 0.1Instruction Issue Execute Write Result

L.D F6,34(R2) √ √

L.D F2,45(R3) √ √

MUL.D F0,F2,F4 √

SUB.D F8,F2,F6 √

DIV.D F10,F0,F6

ADD.D F6,F8,F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 34+Reg[R2]

Load2 Yes Load 45+Reg[R3]

Add1 Yes Sub Load2 Load1

Add2 No

Add3 No

Mult1 Yes Mult Reg[F4] Load2

Mult2 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mult1 Load2 Load1 Add1

Page 11: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Figure 0.2 (Suppose LD is slow)

Instruction Issue Execute Write Result

L.D F6,34(R2) √ √

L.D F2,45(R3) √ √

MUL.D F0,F2,F4 √

SUB.D F8,F2,F6 √

DIV.D F10,F0,F6 √

ADD.D F6,F8,F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 34+Reg[R2]

Load2 Yes Load 45+Reg[R3]

Add1 Yes Sub Load2 Load1

Add2 No

Add3 No

Mult1 Yes Mult Reg[F4] Load2

Mult2 Yes Div Mult1 Load1

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mult1 Load2 Load1 Add1 Mult2

Page 12: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Figure 0.3 (Suppose LD is slow)

Instruction Issue Execute Write Result

L.D F6,34(R2) √ √

L.D F2,45(R3) √ √

MUL.D F0,F2,F4 √

SUB.D F8,F2,F6 √

DIV.D F10,F0,F6 √

ADD.D F6,F8,F2 √

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 34+Reg[R2]

Load2 Yes Load 45+Reg[R3]

Add1 Yes Sub Load2 Load1

Add2 Yes Add Add1 Load2

Add3 No

Mult1 Yes Mult Reg[F4] Load2

Mult2 Yes Div Mult1 Load1

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mult1 Load2 Add2 Add1 Mult2

Page 13: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Figure 3.3Instruction Issue Execute Write Result

L.D F6,34(R2) √ √ √

L.D F2,45(R3) √ √

MUL.D F0,F2,F4 √

SUB.D F8,F2,F6 √

DIV.D F10,F0,F6 √

ADD.D F6,F8,F2 √

Name Busy Op Vj Vk Qj Qk A

Load1 No

Load2 Yes Load 45+Reg[R3]

Add1 Yes Sub Mem[34+Reg[R2]] Load2

Add2 Yes Add Add1 Load2

Add3 No

Mult1 Yes Mult Reg[F4] Load2

Mult2 Yes Div Mem[34+Reg[R2]] Mult1

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mult1 Load2 Add2 Add1 Mult2

Page 14: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Figure 0.4 (2nd load just completes)

Instruction Issue Execute Write Result

L.D F6,34(R2) √ √ √

L.D F2,45(R3) √ √ √

MUL.D F0,F2,F4 √ √

SUB.D F8,F2,F6 √ √

DIV.D F10,F0,F6 √

ADD.D F6,F8,F2 √

Name Busy Op Vj Vk Qj Qk A

Load1 No

Load2 No

Add1 Yes Sub Mem[45+Reg[R3]] Mem[34+Reg[R2]]

Add2 Yes Add Mem[45+Reg[R3]] Add1

Add3 No

Mult1 Yes Mult Mem[45+Reg[R3]] Reg[F4]

Mult2 Yes Div Mem[34+Reg[R2]] Mult1

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mult1 Add2 Add1 Mult2

Page 15: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Figure 3.4Instruction Issue Execute Write Result

L.D F6,34(R2) √ √ √

L.D F2,45(R3) √ √ √

MUL.D F0,F2,F4 √ √

SUB.D F8,F2,F6 √ √ √

DIV.D F10,F0,F6 √

ADD.D F6,F8,F2 √ √ √

Name Busy Op Vj Vk Qj Qk A

Load1 No

Load2 No

Add1 No

Add2 No

Add3 No

Mult1 Yes Mult Mem[45+Reg[R3]] Reg[F4]

Mult2 Yes Div Mem[34+Reg[R2]] Mult1

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mult1 Mult2

Page 16: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Loop-Based Example

Loop: L.D F0,0(R1)

MUL.D F4,F0,F2

S.D F4,0(R1)

DADDIU R1,R1,#−8

BNE R1,R2,Loop

Page 17: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Figure 0.5. One active iteration of loopInstruction Iteration Issue Execute Write Result

L.D F0,0(R1) 1 √ √

MUL.D F4,F0,F2 1 √

S.D F4,0(R1) 1 √

L.D F0,0(R1) 2

MUL.D F4,F0,F2 2

S.D F4,0(R1) 2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load Reg[R1]

Load2 No

Add1 No

Add2 No

Add3 No

Mult1 Yes Mult Reg[F2] Load1

Mult2 No

Store1 Yes Store Mult1 Reg[R1]

Store2 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Load1 Mult1

Page 18: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Figure 0.6. One+ active iteration of loopInstruction Iteration Issue Execute Write Result

L.D F0,0(R1) 1 √ √

MUL.D F4,F0,F2 1 √

S.D F4,0(R1) 1 √

L.D F0,0(R1) 2 √

MUL.D F4,F0,F2 2

S.D F4,0(R1) 2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load Reg[R1]

Load2 Yes Load Reg[R1]-8

Add1 No

Add2 No

Add3 No

Mult1 Yes Mult Reg[F2] Load1

Mult2 No

Store1 Yes Store Mult1 Reg[R1]

Store2 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Load2 Mult1

Page 19: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Figure 0.7. One++ active iteration of loopInstruction Iteration Issue Execute Write Result

L.D F0,0(R1) 1 √ √

MUL.D F4,F0,F2 1 √

S.D F4,0(R1) 1 √

L.D F0,0(R1) 2 √ √

MUL.D F4,F0,F2 2 √

S.D F4,0(R1) 2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load Reg[R1]

Load2 Yes Load Reg[R1]-8

Add1 No

Add2 No

Add3 No

Mult1 Yes Mult Reg[F2] Load1

Mult2 Yes Mult Reg[F2] Load2

Store1 Yes Store Mult1 Reg[R1]

Store2 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Load2 Mult12

Page 20: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Figure 3.6. Two active iterations of loopInstruction Iteration Issue Execute Write Result

L.D F0,0(R1) 1 √ √

MUL.D F4,F0,F2 1 √

S.D F4,0(R1) 1 √

L.D F0,0(R1) 2 √ √

MUL.D F4,F0,F2 2 √

S.D F4,0(R1) 2 √

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load Reg[R1]

Load2 Yes Load Reg[R1]-8

Add1 No

Add2 No

Add3 No

Mult1 Yes Mult Reg[F2] Load1

Mult2 Yes Mult Reg[F2] Load2

Store1 Yes Store Mult1 Reg[R1]

Store2 Yes Store Mult2 Reg[R1]-8

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Load2 Mult12

Page 21: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

IBM 360/91

Great ideas: Data tagging Register renaming Dynamic detection of memory hazards Generalized forwarding

Ideas broadly used now in microprocessors Was 360/91 successful commercially?

Page 22: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

IBM 360/85 (1968)

First commercial computer with a cache: Slower clock time (80ns versus 60ns) Less memory interleaving (4 versus 16) Slower main memory (1.04 μs versus 0.75 μs) Cheaper in price

Which machine was faster on applications?


Recommended