+ All Categories
Home > Documents > EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID...

EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID...

Date post: 20-Mar-2018
Category:
Upload: lydieu
View: 236 times
Download: 5 times
Share this document with a friend
32
ee457_Lab6_Part4_r3_for_lecture.fm 10/29/06 1 / 32 C Copyright 2006 Gandhi Puvvada EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 Questions 10/29/2006 Lecture by Gandhi Puvvada University of Southern California
Transcript
Page 1: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 1 / 32 C Copyright 2006 Gandhi Puvvada

EE457

Lab 6 Design of a Pipelined CPU

Lab 6 Part 4 Questions

10/29/2006

Lecture by Gandhi Puvvada

University of Southern California

Page 2: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 2 / 32 C Copyright 2006 Gandhi Puvvada

Thanks to Ray Madani and Binh Tran of DEN for their technical support

Ideally: Do this Part 4 after lab 6 parts 1, 2, 3

Ideally: 1. Read the question

2. Attempt to solve it by yourself3. Verify by viewing this lecture

Practically:View the lecture before

solving by yourself

Page 3: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 3 / 32 C Copyright 2006 Gandhi Puvvada

1. [Based on question #4.1 of Summer 95 Final] Pipelined Ripple_Carry Adder:

Array addition. The A(i), B(i), and C(i) are all in t he register file.

The single-bit opcode is a "1" for ADD and a "0" for NOP.

Instruction Opcode rs = Source Reg 1 rt = Source Reg 2 rd = Destination Reg

size of the fields => 1 bit 3 bits 3 bits 3 bits

add rd, rs, rt 1 rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0

nop 0 rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0

11111111111

Page 4: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 4 / 32 C Copyright 2006 Gandhi Puvvada

Do NOT be misled by Miss Bruin’s design below!

opcode rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0

IF/ID

Size = 10bit

R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA0WA1

R1D3 R1D2 R1D1 R1D0 R2D3 R2D2 R2D1 R2D0

WD3

WD2

WD1

WD0

WRITE

CLKSYS_CLKREGISTER FILE

A BCo Ci

S

A BCo Ci

S

A BCo Ci

S

A BCo Ci

S

ID/EX1Size =

EX4/WBSize =

EX3/EX4Size =

EX2/EX3Size =

EX1/EX2Size =

EX1

WB

EX2

EX3

EX4

IF

ID

D

Q

D D D D D D D D D

Q Q Q Q Q Q Q Q Q

Read_Address_1 Read_Address_2 Write_Address

Read_Data_2

Wri

te_D

ata

Read_Data_1

rs = Source Reg 1 rt = Source Reg 2 rd = Destination Reg

Miss Bruin’s Design

12 bits

12 bits

11 bits

10 bits

8 bits

1a1a1a1a1a1a1a1a1a1a1a

Page 5: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 5 / 32 C Copyright 2006 Gandhi Puvvada

opcode rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0

IF/ID

Size = 10bit

R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA0WA1

R1D3 R1D2 R1D1 R1D0 R2D3 R2D2 R2D1 R2D0

WD3WD2

WD1

WD0

WRITE

CLKSYS_CLKREGISTER FILE

IF

ID

D

Q

D D D D D D D D D

Q Q Q Q Q Q Q Q Q

Read_Address_1 Read_Address_2 Write_Address

Read_Data_2

Wri

te_D

ata

Read_Data_1

rs = Source Reg 1 rt = Source Reg 2 rd = Destination Reg

ADD = 1 NOP = 0OpCode2a2a2a2a2a2a2a2a2a2a2a

Page 6: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 6 / 32 C Copyright 2006 Gandhi Puvvada

opcode rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0

IF/ID

Size = 10bit

R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA0WA1

R1D3 R1D2 R1D1 R1D0 R2D3 R2D2 R2D1 R2D0

WD3

WD2

WD1

WD0

WRITE

CLKSYS_CLKREGISTER FILE

IF

ID

D

Q

D D D D D D D D D

Q Q Q Q Q Q Q Q Q

Read_Address_1 Read_Address_2 Write_Address

Read_Data_2

Wri

te_D

ata

Read_Data_1Mis

s Bru

in’s

Des

ign

1b1b1b1b1b1b1b1b1b1b1b

Page 7: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 7 / 32 C Copyright 2006 Gandhi Puvvada

opcode rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0

IF/ID

Size = 10bit

R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA0WA1

R1D3 R1D2 R1D1 R1D0 R2D3 R2D2 R2D1 R2D0

WD3WD2

WD1

WD0

WRITE

CLKSYS_CLKREGISTER FILE

A BCo Ci

S

A BCo Ci

S

A BCo Ci

S

A BCo Ci

S

ID/EX1Size =

EX4/WBSize =

EX3/EX4Size =

EX2/EX3Size =

EX1/EX2Size =

EX1

WB

EX2

EX3

EX4

IF

ID

D

Q

D D D D D D D D D

Q Q Q Q Q Q Q Q Q

Read_Address_1 Read_Address_2 Write_Address

Read_Data_2

Wri

te_D

ata

Read_Data_1

rs = Source Reg 1 rt = Source Reg 2 rd = Destination Reg

2b2b2b2b2b2b2b2b2b2b2b

Page 8: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 8 / 32 C Copyright 2006 Gandhi Puvvada

2. [Based on question 5 of Summer 2003 Midterm and question 8 of Spring 1994 Final] Pipeline Design (Stalling / Flushing / Forwarding):

2.1 Bubbles are produced ________________________________________________________ (in stalling only/in flushing only/in stalling as well as in flushing/in neither stalling nor flushing).

2.2 In the early-branch design of the pipeline CPU (current lab6 based on 3rd ed.), flushing and stalling ___________________ (never occur in the same clock cycle/may sometimes occur in the same clock cycle/always occur in the same clock cycle).

In a late-branch design (based on the first edition), if the branch below is successful, do flushing and stalling both occur together or one would prevent the other? Explain.

beq $1, $2, TARGETlw $4, 40 ($5)or $8, $4, $6

2.3 There are 9 (1+4+2+2) control lines generated by the control unit. Eight of these (8 out of 9) are going from the ID stage to the EX stage. Do you need to convert all the 8 signals to zero when you stall an instruction in the ID stage?

33333333333

Page 9: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 9 / 32 C Copyright 2006 Gandhi Puvvada

2.4 To ___________ (stall/flush) an instruction in ID stage, you inhibit (prevent) updating of the following register(s). (circle as many of the following as you wish) PC , IF/ID , ID/EX , EX/MEM , MEM/WB You never inhibit (prevent) updating of a stage register if you are currently _______________________________________________________________________ (flushing / stalling / can not fill this blank with either of the previous two choices).

2.5 Late Branch design of the first edition with one HDU in ID stage and one FU in EX stage, and an internally forwarding register file.

All the three streams use the same 3 instructions in different order.

For stream #(each) above, the following occur(s): (circle all correct choices) (i) hazard detection and stalling by HDU (ii) forwarding by FU(iii) internal forwarding in the reg. file (iv) none of these

Now reconsider the above three streams in the context of the early-branch design based on the current lab 6. Explain any differences or striking resemblances to your three answers above.

Stream #1 Stream #2 Stream #3add $3 , $3 , $1; lw $3 , 40($5); lw $3 , 40($5);or $6 , $5 , $4; or $6 , $5 , $4; add $3 , $3 , $1;lw $3 , 40($5); add $3 , $3 , $1; or $6 , $5 , $4;

44444444444

Page 10: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 10 / 32 C Copyright 2006 Gandhi Puvvada

2.6 In this question we consider the early-branch design of our current lab 6 with two HDUs (HDU and HDU_Br) and two FUs (FU and FU_Br). Of course the register file is an internally forwarding register file. Identify the dependencies in the following instruction streams and how they should be resolved:

For this stream # (each) the following occur(s): (circle all correct choices) (i) HDU_B initiated stalling (ii) HDU initiated stalling(iii) forwarding by FU_B (iv) forwarding by FU(v) internal forwarding in the reg. file (vi) none of these)

Summary: In the lab #6 design for the early-branch,

beq dependent an R-Type

beq dependent on lw

No forwarding at the end of the clock.

Stream #1 Stream #2add $2 , $2 , $2; add $2 , $3 , $4;sub $1 , $2 , $3; sub $5 , $6 , $7;beq $2 , $0 , loop1; beq $5 , $2 , loop1;

Stream #3 Stream #4lw $4 , $3(40); lw $4 , $3(40);beq $4 , $0 , loop1; sub $5 , $6 , $7;

beq $4 , $0 , loop1;

55555555555

Page 11: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 11 / 32 C Copyright 2006 Gandhi Puvvada

2.7 dependent instruction after a lw instruction.

HDU => STALL

To avoid HDU, Simple minded compiler => NOP

No gain, No loss TRUE or FALSE?

beq in early branch

To avoid flushing hardware, can a compiler put a NOP after every beq ? Feasible?

Performance? No gain, No loss or any gain or loss?

66666666666

Page 12: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 12 / 32 C Copyright 2006 Gandhi Puvvada

2.8 specific point of tapping of the branch control signal in the ID stage for (a) ANDing with the equality inference and (b) for HDU_Br to produce STALL_BEQ.

0

opco

de Co

ntr

ol(PC

)

EX

MEWB

HDU_Br

STALL_BEQSTALL_LW

STALL

Branch01

Branch

=

A

B

C

Hazarddetection

unit

77777777777

Page 13: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 13 / 32 C Copyright 2006 Gandhi Puvvada

Mr. Bruin claims that he discovered a problem in this design. He argues that the branch control signal for the AND gate should be taken after the flush mux (Point C) in the design to avoid erroneous branching.

lw $4 , $3(40) ;beq $4 , $0 , loop1 ;

He further offers a solution by moving the tapping of branch control signal from point B to point C instead. Evaluate the proposed solution by answering the following:

It is _______________________________ (a must / a feasible change but does not make any difference / a feasible change that improves the design / a sin) to move the tapping of branch control signal for the AND gate from point B to point C.

It is _______________________________ (a must / a feasible change but does not make any difference / a feasible change that improves the design / a sin) to move the tapping of branch control signal for the HDU_Br from point B to point C.

77777777788

Page 14: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 14 / 32 C Copyright 2006 Gandhi Puvvada

Another person suggests .... ..... identify BEQ instruction by inspection of a single bit among the six-bit OPCODE field. ..... get branch control signal from point A in the figure.

Is this a good suggestion or bad one?

Notice that in case the first BEQ is taken, the second BEQ should be flushed.

beq $0 , $1 , loop1 ;beq $4 , $2 , loop2 ;

88888888888

Page 15: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 15 / 32 C Copyright 2006 Gandhi Puvvada

2.9 Forwarding muxes in the EX stage:

FW_R

S_W

B

FW_R

S_M

EM

11

0

0

original read data

forwardedhelp fromWB stage

forwardedhelp fromMEM stage

FW_R

S_M

EM_n

ew

FW_R

S_W

B_n

ew

11

0

0

original read data

forwardedhelp fromMEM stage

forwardedhelp fromWB stage

Original lab design Modified design

88888888888

Page 16: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 16 / 32 C Copyright 2006 Gandhi Puvvada

add $10, $11, $12 ;

add $3 , $3 , $3 ;or $6 , $3 , $4 ;

In the original design, FW_RS_WB= (0/1/X), FW_RS_MEM= (0/1/X)In the modified design, FW_RS_WB= (0/1/X), FW_RS_MEM= (0/1/X)

add $3 , $3 , $3 ;add $10, $11, $12 ;

or $6 , $3 , $4 ;In the original design, FW_RS_WB= (0/1/X), FW_RS_MEM= (0/1/X)In the modified design, FW_RS_WB= (0/1/X), FW_RS_MEM= (0/1/X)

add $3 , $3 , $3 ; <====================add $3 , $5 , $2 ; <====================or $6 , $3 , $4 ;

In the original design, FW_RS_WB= (0/1/X), FW_RS_MEM= (0/1/X)In the modified design, FW_RS_WB= (0/1/X), FW_RS_MEM= (0/1/X)

From the observations made in above instruction sequences, can we generate the 2 forwarding control signals independent of each other (a) in the original design and (b) in the modified design?

99999999999

Page 17: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 17 / 32 C Copyright 2006 Gandhi Puvvada

3 Modified Pipeline Design (7-stage pipeline) :

RegInstr.TLB

Instr.cache

DataTLB

Datacache

FU

PC

IF1 IF2 ID EX MEM1 MEM2 WB

Zero

Zero

BRANCH

BR

1

7-stage pipelined version of the late-branch design of the 1st edition

HDU

cont

rol

1010101010101010101010

Page 18: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 18 / 32 C Copyright 2006 Gandhi Puvvada

RegInstr.TLB

Instr.cache

HDU

DataTLB

Datacache

FU

IF1 IF2 ID EX MEM1 MEM2 WB

BRANCH

BR

17-stage pipelined version of the early-branch design of the 3rd ed. and our lab 6

FU_Br

PCco

ntro

l

HDU_Br

Zero

1010101010101010101010

Page 19: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 19 / 32 C Copyright 2006 Gandhi Puvvada

RegInstr.

Data

FU

PC

IF ID EX MEM WB

Zero

Zero

BRANCH

BR

15-stage pipeline of the late-branch design of the 1st edition

HDU

contr

ol

RegInstr.

HDU

Data

FU

IF ID EX MEM WB

BRANCH

BR

1

5-stage pipeline of the early-branch design of the 3rd ed. and our lab 6

FU_Br

PC

cont

rol

HDU_Br

Zero

RegInstr.TLB

Instr.cache

DataTLB

Datacache

FU

PC

IF1 IF2 ID EX MEM1 MEM2 WB

Zero

Zero

BRANCH

BR

1

7-stage pipelined version of the late-branch design of the 1st edition

HDU

contr

ol

RegInstr.TLB

Instr.cache

HDU

DataTLB

Datacache

FU

IF1 IF2 ID EX MEM1 MEM2 WB

BRANCH

BR

1

7-stage pipelined version of the early-branch design of the 3rd ed. and our lab 6

FU_Br

PC

cont

rol

HDU_Br

Zero

All

4 pi

pelin

es

Page 20: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 20 / 32 C Copyright 2006 Gandhi Puvvada

Dependency of a R-type instruction on a load word instruction, stalling by HDU to resolve the dependency problem:

Design item In 5-stage late-branch In 5-stage early-branch In 7-stage late-branch In 7-stage early-branch

i lw $1, 60($2)

i+1 add $4, $1, $6

Any bubbles? How many?

Where are they inserted?Complete the Time-Space diagrams.

This example is completed by us.

Bubbles = ___1______ (0/1/2/3) Bubbles = _____1_____ (0/1/2/3) Bubbles = ____2______ (0/1/2/3) Bubbles = ____2______ (0/1/2/3)

i lw $1, 60($2)

i+1 sub $10, $11, $12

i+2 add $4, $1, $6

Any bubbles? How many?

Where are they inserted?.

Bubbles = ___________ (0/1/2/3) Bubbles = ___________ (0/1/2/3) Bubbles = ___________ (0/1/2/3) Bubbles = ___________ (0/1/2/3)

How many comparators does

the HDU (not HDU_Br)

have? Where do the destina-

tion register addr. inputs to

the comparators come from?

# of comparators = _____Destination reg. addr. input(s) come(s) from:

# of comparators = _____Destination reg. addr. input(s) come(s) from:

# of comparators = _____Destination reg. addr. input(s) come(s) from:

# of comparators = _____Destination reg. addr. input(s) come(s) from:

Delay slots for lw: To avoid the

use of HDU, how delay slots

should we declare for lw?

# of Delay slots = ______ # of Delay slots = ______ # of Delay slots = ______ # of Delay slots = ______

lwadd

add lwadd lw

add

lwadd

Same as 5-stage late-branch lwadd

add

add

add

lw

lw

lw

lwadd

Same as 7-stage late-branch

lwsubadd lwsubadd lwsubadd lwsubadd

Page 11

Page 21: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 21 / 32 C Copyright 2006 Gandhi Puvvada

Dependency of a R-type instruction on another R-type instruction; Forwarding:

Design item In 5-stage late-branch In 5-stage early-branch In 7-stage late-branch In 7-stage early-branch

i add $5, $7, $9

i+1 xor $1, $2, $3

i+2 or $10, $11, $12

i+3 sub $3, $5, $1

Explain forwarding to

instruction (i+3)

sub receives latest $1 from xor when sub is in ______ stage and xor is in ______ stage under the control of _________________(FU/internal forward-ing in register file).sub receives latest $5 from ________________________________due to ______________________________(FU/internal forward-ing in register file).

sub receives latest $1 from xor first time when sub is in ______ stage and xor is in ______ stage under the control of ________ ________ (FU_Br/FU/internal forwarding in register file). It receives the same value again second time when sub is in ______ stage and xor is in ______ stage under the control of ________ (FU_Br/FU).sub receives latest $5 from ________________________________due to ______________________________(FU_Br/FU/internal forwarding in register file).

sub receives latest $1 from xor when sub is in ______ stage and xor is in ______ stage under the control of _________________(FU/internal forward-ing in register file).sub receives latest $5 from add when sub is in ______ stage and add is in ______ stage under the control of _________________(FU/internal forward-ing in register file).

sub receives latest $1 from xor first time when sub is in ______ stage and xor is in ______ stage under the control of ________ (FU_Br/FU). It receives the same value again second time when sub is in ______ stage and xor is in ______ stage under the control of ________ (FU_Br/FU).sub receives latest $5 from add when sub is in ______ stage and add is in ______ stage under the control of _________________(FU_Br/FU/internal forwarding in register file).

Page 12

Page 22: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 22 / 32 C Copyright 2006 Gandhi Puvvada

FU_Br, FU details:

Design item In 5-stage late-branch In 5-stage early-branch In 7-stage late-branch In 7-stage early-branch

How many comparators does

the forwarding unit in ID

stage (FU_Br, not FU) have?

How big are the forwarding

muxes (n-bit wide m-to-1

mux)? How many? Where

do the data inputs to the

muxes come from?

# of comparators in FU_Br = __________________Forwarding mux(es) in the A-leg of equality checker (size and num-ber (which is same for the B-leg)) = ______________________________Data inputs for this/these come from ___________________________________________________________________________________________

# of comparators in FU_Br = __________________Forwarding mux(es) in the A-leg of equality checker (size and num-ber (which is same for the B-leg)) = ______________________________Data inputs for this/these come from ___________________________________________________________________________________________

How many comparators does

the forwarding unit in EX

stage (FU, not FU_Br) have?

How big are the forwarding

muxes (n-bit wide m-to-1

mux)? How many? Where

do the data inputs to the

muxes come from?

# of comparators in FU =___________________Forwarding mux(es) in the A-leg of ALU (size and number (which is same for the B-leg)) = ______________________________________Data inputs for this/these come from ___________________________________________________________________________________________

# of comparators in FU =___________________Forwarding mux(es) in the A-leg of ALU (size and number (which is same for the B-leg)) = ______________________________________Data inputs for this/these come from ___________________________________________________________________________________________

Same a

s 5-st

age l

ate-b

ranc

h

T

RUE / FA

LSE

Same a

s 7-st

age l

ate-b

ranc

h

T

RUE / FA

LSE

Not ap

plica

ble

Not ap

plica

ble

Page 13

Page 23: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 23 / 32 C Copyright 2006 Gandhi Puvvada

Priority in FU and FU_Br:

Design item In 5-stage late-branch In 5-stage early-branch In 7-stage late-branch In 7-stage early-branch

Priority in FU (FU, not

FU_Br): Forwarding to a

dependent instruction stand-

ing in EX stage. Opt to for-

ward from the nearer than

the farther

The FU prefers to accept forwarding help from the __________ (MEM/WB) over ________________(MEM/WB).

The FU prefers to accept forwarding help from the __________ (MEM1/MEM2/WB) over ________________(MEM1/MEM2WB) as well as ______________(MEM1/MEM2WB) Fur-ther ____________________________________________________________________________________________________________________________________________________________.

Priority in FU_Br (FU_Br,

not FU): Forwarding to a

BEQ instruction standing in

ID stage. Opt to forward

from the nearer than the

farther

No priority needs to be implemented in FU_Br.TRUE / FALSEExplain: _____________________________________________________________________________________________________________________________________

The FU_Br prefers to accept forwarding help from a ______________(R-Type/lw) instruction in the _______________ (MEM1/MEM2/WB) over a ______________(R-Type/lw) instruction in the ______________(MEM1/MEM2/WB).

Same as in

the 5-sta

ge

late-branch

TRUE / FALSE

Same as in

the 7-sta

ge

late-branch

TRUE / FALSE

Not applicable

Not applicable

Page 14

Page 24: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 24 / 32 C Copyright 2006 Gandhi Puvvada

Dependency of a BEQ instruction on a R-type instruction; Stalling through HDU_Br, Forwarding through FU_Br/FU:

Design item In 5-stage late-branch In 5-stage early-branch In 7-stage late-branch In 7-stage early-branch

i beq $2, $4, Target

How many instructions fol-

lowing a successful branch

are flushed?

# of instructions that need to be flushed =

___________________

# of instructions that need to be flushed =

___________________

# of instructions that need to be flushed =

___________________

# of instructions that need to be flushed =

___________________

i add $1, $2, $3

i+1 beq $1, $0, loop

How many clock cycles does

the BEQ have to be stalled?

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from add when beq is in _______ stage and add is in _________ stage under the control of _________________(FU/internal forwarding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from add when beq is in _______ stage and add is in _________ stage under the control of _________________(FU_Br/FU/internal for-warding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from add when beq is in _______ stage and add is in _________ stage under the control of _________________(FU/internal forwarding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from add when beq is in _______ stage and add is in _________ stage under the control of _________________(FU_Br/FU/internal for-warding in register file).

i add $1, $2, $3

i+1 xor $11, $12, $13

i+2 beq $1, $0, loop

How many clock cycles does

the BEQ have to be stalled?

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from add when beq is in _______ stage and add is in _________ stage under the control of _________________(FU/internal forwarding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from add when beq is in _______ stage and add is in _________ stage under the control of _________________(FU_Br/FU/internal for-warding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from add when beq is in _______ stage and add is in _________ stage under the control of _________________(FU/internal forwarding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from add when beq is in _______ stage and add is in _________ stage under the control of _________________(FU_Br/FU/internal for-warding in register file).

Page 15

Page 25: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 25 / 32 C Copyright 2006 Gandhi Puvvada

Dependency of a BEQ instruction on a lw instruction; Stalling through HDU_Br, Forwarding through FU_Br/FU:

Design item In 5-stage late-branch In 5-stage early-branch In 7-stage late-branch In 7-stage early-branch

i lw $1, $2(40)

i+1 beq $1, $0, loop

How many clock cycles does

the BEQ have to be stalled?

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from lw when beq is in _______ stage and lw is in _________ stage under the control of _________________(FU/internal forwarding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from lw when beq is in _______ stage and lw is in _________ stage under the control of _________________(FU_Br/FU/internal for-warding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from lw when beq is in _______ stage and lw is in _________ stage under the control of _________________(FU/internal forwarding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from lw when beq is in _______ stage and lw is in _________ stage under the control of _________________(FU_Br/FU/internal for-warding in register file).

i lw $1, $2(40)

i+1 add $6, $5, $4

i+2 beq $1, $0, loop

How many clock cycles does

the BEQ have to be stalled?

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from lw when beq is in _______ stage and lw is in _________ stage under the control of _________________(FU/internal forwarding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from lw when beq is in _______ stage and lw is in _________ stage under the control of _________________(FU_Br/FU/internal for-warding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from lw when beq is in _______ stage and lw is in _________ stage under the control of _________________(FU/internal forwarding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from lw when beq is in _______ stage and lw is in _________ stage under the control of _________________(FU_Br/FU/internal for-warding in register file).

i lw $1, $2(40)

i+1 add $6, $5, $4

i+2 or $16, $15, $14

i+3 beq $1, $0, loop

How many clock cycles does

the BEQ have to be stalled?

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from lw when beq is in _______ stage and lw is in _________ stage under the control of _________________(FU/internal forwarding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from lw when beq is in _______ stage and lw is in _________ stage under the control of _________________(FU_Br/FU/internal for-warding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from lw when beq is in _______ stage and lw is in _________ stage under the control of _________________(FU/internal forwarding in register file).

# of clock cycles beq needs to be stalled =___________________beq receives latest $1 from lw when beq is in _______ stage and lw is in _________ stage under the control of _________________(FU_Br/FU/internal for-warding in register file).

Page 16

Page 26: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 26 / 32 C Copyright 2006 Gandhi Puvvada

Miscellaneous:

Design item In 5-stage late-branch In 5-stage early-branch In 7-stage late-branch In 7-stage early-branch

How many comparators does

the HDU_Br) have?

Destination register addr.(s)

come(s) to HDU_Br from .....

# of comparators in HDU_Br = _________Dest. Reg. addr.(s) come(s) from ________________________________________________________

# of comparators in HDU_Br = _________Dest. Reg. addr.(s) come(s) from ________________________________________________________

Though it is not desirable to

“delay” the BEQ execution,

how late in the pipeline can

you execute the BEQ instr. ?

The latest stage for exe-cuting BEQ is ________(EX/MEM/WB).

The latest stage for exe-cuting BEQ is ________(EX/MEM1/MEM2/WB).

The earliest a BEQ can be

executed from is:

The earliest stage for exe-cuting BEQ is ________(IF/ID/EX).

The earliest stage for exe-cuting BEQ is ________(IF1/IF2/ID/EX).

Not applicable

Not applicable

Not applicable

Not applicable

Not applicable

Not applicable

Page 17

Page 27: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 27 / 32 C Copyright 2006 Gandhi Puvvada

3.2 Flushing of the two instructions in the IF1 and IF2 stages in the case of the 7-stage pipeline:

Note: This part of the design is common to both branch implementations (late or early).

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

7-stage pipeline

PC

cont

rol

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

7-stage pipeline

PC

cont

rol

RESET RESET

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

7-stage pipeline

PC

cont

rol

RESET RESET

Assistant #2’sdesign of flush

Assistant #2’sdesign of flush

1818181818181818181818

Page 28: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 28 / 32 C Copyright 2006 Gandhi Puvvada

4. Modified Pipeline Design (4-stage pipeline)

EX and MEM =====> EXMEM

No forwarding from EXMEM

If BEQ is dependent on an instruction in EXMEM, it is stalled until the dependency is resolved.So no forwarding into ID stage. No FU_Br.

The HDU is not needed in this design and is removed.

The input connections to the FU and HDU_Br are reduced.

4.1 Complete the forwarding paths to FU.

1818181818181818181818

Page 29: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 29 / 32 C Copyright 2006 Gandhi Puvvada

04

Inst

ruct

ion

mem

ory

PC

+

r1

r2

R1

R2w

W

opco

ders

rtrd

shift

func

t

Reg

iste

rs

Co

ntr

ol(PC

)

(rs)

(rt)

ALU

rtrd

ALUctrlSign

ext.

EX

MEWB

ALUSrcALUOpRegDst

ALUSrc

Reg

Dst

ALUOp

RegWrite_EX

Dat

am

emor

y

@

W

R

Mem

Rea

d

Mem

Writ

e

IF.Flush

WR

WB

MEM

_dat

aR

EG_d

ata

Reg

Writ

e

MemtoReg

+

=

func

ts_

ext

ShiftLeft 2

Zero

Forwarding Unit

IF/IDIF-Stage

ID/EXMEMID-Stage EXMEM-Stage

EXMEM/WBWB-Stage

rs

Writ

eReg

iste

r_EX

HDU_Br

STALL_BEQ

STALL

Branch

01

0

1

1

1

1

0

00

0

0

1

Branch

1

fowarding_mux_control

2121212121212121212121

Page 30: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 30 / 32 C Copyright 2006 Gandhi Puvvada

4.2 Compare and contrast the 5-stage pipeline design of lab #6 with this 4-stage pipeline design.

4.2.1 we do not need the regular HDU for LW dependency in the 4-stage pipeline because .....

However, we still need HDU_Br to stall the BEQ instructions.

For this stream # (each), ________ clock cycles is needed for stalling.

Stream #1:lw $4 , $3(40) ;add $10, $4 , $6 ;

Stream #2:lw $4 , $3(40) ;beq $10, $4 , loop1 ;

Stream #3:add $4 , $3, $2 ;beq $10, $4, loop1 ;

1919191919191919191918

Page 31: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 31 / 32 C Copyright 2006 Gandhi Puvvada

4.2.2 In the 5-stage, In the 4-stage

the PCWrite is under the control of ____________________________ _____________________________________ (HDU/HDU_Br/FU/FU_Br/Successful Branch/Successful Jump/Combination of these/none of these/none, no need to control, activated all the time).

4.2.3 In the 5-stage, In the 4-stage

number and size of comparators in the forwarding unit (FU)

The FU in the case of the 4-stage pipeline produces ____________________ (one/two) outputs, of size __________ (1-bit / each 1-bit / 2-bit / each 2-bit) to control the forwarding muxes.

4.2.4 In the 5-stage, In the 4-stage

The HDU_Br (Hazard Detection Unit assisting beq)

number and size of comparators in the HDU_Br

4.2.5 _______ (Like / Unlike) in the case of the 5-stage pipeline, we ____________ (need / don’t need) prioritization in the 4-stage pipeline in providing forwarding help to the instr #3 in the sequence of adds on the right.

instr #1 add $2, $2, $2instr #2 add $2, $2, $2instr #3 add $2, $2, $2

1919192020202020202020

Page 32: EE457 Lab 6 Design of a Pipelined CPU Lab 6 Part 4 ... rs2 rs1 rs0 rt2 rt1 rt0 rd2 rd1 rd0 IF/ID Size = 10bit R1A2 R1A1 R1A0 R2A2 R2A1 R2A0 WA2 WA1 WA0 R1D3 R1D2 R1D1 R1D0 R2D3 R2D2

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 32 / 32 C Copyright 2006 Gandhi Puvvada

4.2.6 If the clock frequency is the same for the two pipelines and we ignore the control (branch) hazard, the performance of the 4-stage pipeline is________________________________________ (better than / equal to / worse than / sometimes better than and sometimes worse than) the 5-stage pipeline performance.

4.2.7 In the 4-stage pipeline, since the ALU and the Memory are both in one stage, they can work simultaneously and this merging of ALU with Memory in a single stage does not call for extending the clock period (even if we use the original ALU and Data memory which are NOT fast). TRUE / FALSE

2020202020202020202020


Recommended