+ All Categories
Home > Documents > Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are...

Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are...

Date post: 19-Dec-2015
Category:
View: 217 times
Download: 1 times
Share this document with a friend
82
Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c
Transcript
Page 1: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Chap.6: Enhancing Performance with Pipelining

Jen-Chang Liu, Spring 2006

Parts of the slides are duplicated frominst.eecs.berkeley.edu/~cs61c

Page 2: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Review Datapath (1/3)Datapath is the hardware that performs

operations necessary to execute programs.

Control instructs datapath on what to do next.

Datapath needs: access to storage (general purpose register

s and memory) computational ability (ALU) helper hardware (local registers and PC)

Page 3: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Review Datapath (2/3)Five stages of datapath (executing an in

struction):1. Instruction Fetch (Increment PC)2. Instruction Decode (Read Registers)3. ALU (Computation)4. Memory Access5. Write to Registers

ALL instructions must go through ALL five stages.

Page 4: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Review Datapath (3/3)P

C

inst

ruct

ion

me

mor

y

+4

rtrsrd

regi

ste

rs

ALU

Da

tam

em

ory

imm

1. InstructionFetch

2. Decode/ Register

Read

3. Execute 4. Memory5. Write

Back

Page 5: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Review Single cycle datapath

Multi-cycle datapath

Instructionfetch

Data/registerread

Instructionexecution

Memory/registerread/write

Registerwrite

Instructionfetch

Data/registerread

Instructionexecution

Memory/registerread/write

Registerwrite

Page 6: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Outline Overview A pipeline datapath Pipelined control Data hazards

Forwarding Stalls

Branch hazards Superscalar and dynamic pipelining

Page 7: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Taskorder

Taskorder

What’s pipelining?Time

76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Taskorder

Taskorder

洗衣 乾衣折疊收藏

non-pipelined

pipelining

Use different resources at the same time

Page 8: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Pipelining Definition: an implementation

technique in which multiple instructions are overlapped in execution

How to achieve pipelining? An instruction is divided into steps

(stages) We have separate resources for each

stage

Page 9: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

Single-cycle implementation

Pipelined implementation

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

Page 10: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

What does pipelining improve?

Pipelining improves performance by Increasing the instruction throughput 單位時間完成的指令數目增加 Decreasing the execution time of an individu

al instruction 單一指令執行時間不變 Ideal conditions for saving

Time between inst. pipelined = Time between inst. non-pipelined

Number of pipe stages

Page 11: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Design instruction set for pipelining

MIPS has been designed for pipelining1. Instructions are of the same length

Easier to fetch and decode Ex. 80x86 inst.s ranges from 1 to 17 bytes

2. A few instruction formats, with the source register fields located in the same place

可以在決定是甚麼指令前讀暫存器

Page 12: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Design instruction set for pipelining (cont.)

3. Memory operands only appear for loads and stores

Calculate memory address in execution stage

4. Operands are aligned in memory One memory access for data transfer

Instructionfetch

Instruction Decoding,

Data/registerread

Instructionexecution

Memory/registerread/write

Registerwrite

0 1 2 3

Aligned

NotAligned

Page 13: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Pipeline hazards Hazards: the situations in pipelining

when the next instruction cannot execute in the following clock cycle使下一個指令無法在下一個 cycle 執行

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

危險

甚麼時候下個指令無法執行?

Page 14: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

1. Structural hazards The hardware cannot support the

combination of instructions that we want to execute in the same clock cycle

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

硬體結構問題

2 memory access at the sameclock?

Page 15: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Structural Hazard #1: Single Memory

Solution:Second memory

infeasible and inefficient to createTwo Level 1 Caches

Cache: a temporary smaller [of usually most recently used] copy of memory

have both an L1 Instruction Cache and an L1 Data Cache

need more complex hardware to control when both caches miss

Page 16: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

1. Structural hazard #2: single register

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

Can’t read and write to registers simultaneously?

Page 17: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Structural Hazard #2: Registers

Fact: Register access is VERY fast: takes less than half the time of ALU stage

Solution: introduce conventionalways Write to Registers during first half of each clock cycle

always Read from Registers during second half of each clock cycle

Result: can perform Read and Write during same clock cycle

Page 18: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

2. Control hazards The need to make a decision based on

the results of one instruction while others are executing Ex. Branch instruction

下一個指令需等待前面指令的執行結果

add $4, $5, $6 beq $1, $2, 40 lw $3, 300($0)

40: or $7, $8, $9

? …

無法判斷下一個指令是那個…

Page 19: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Solution to control hazards #1: stall 拖延

stall (bubble): 暫停下一個指令執行

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

Instructionfetch

Reg ALUData

accessReg

2ns

Instructionfetch

Reg ALUData

accessReg

2ns

2 4 6 8 10 12 14 16

Programexecutionorder(in instructions)

nop: no operationnop: no operationInstruction

fetchReg ALU

Dataaccess

Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

Instructionfetch

Reg ALUData

accessReg

2ns

Instructionfetch

Reg ALUData

accessReg

2ns

2 4 6 8 10 12 14 16

Programexecutionorder(in instructions)

completion of branch

Filling two nops after branch is inefficient!

Page 20: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Solution to control hazards #1: stall

Stall (bubble): 暫停下一個指令執行

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

Instructionfetch

Reg ALUData

accessReg

2ns

Instructionfetch

Reg ALUData

accessReg

2ns

2 4 6 8 10 12 14 16

Programexecutionorder(in instructions)

假設 branch 的位址計算,比較都可用另外的 hardware 在此 stage 完成

Page 21: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Solution to control hazards #2:

delayed branchadd $4, $5, $6 beq $1, $2, 40 lw $3, 300($0)

40: or $7, $8, $9

? …

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch

Reg ALUData

accessReg

2 ns

Instructionfetch

Reg ALUData

accessReg

2 ns

2 4 6 8 10 12 14

2 ns

(Delayed branch slot)

Programexecutionorder(in instructions)

Page 22: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Redefine branchesOld definition: if we take the branch, none of the instructions after the branch get executed by accident

New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (called the branch-delay slot)

The term “Delayed Branch” meanswe always execute inst after branch

Solution to control hazards #2:

delayed branch (cont.)

Page 23: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Notes on Branch-Delay SlotWorst-Case Scenario: can always put a no-op in the branch-delay slot

Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program

re-ordering instructions is a common method of speeding up programs

compiler must be very smart in order to find instructions to do this

usually can find such an instruction at least 50% of the time

Solution to control hazards #2:

delayed branch (cont.)

Page 24: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Solution to control hazards #3: predict 預測

預測下一個可能執行的指令 . Ex. 猜 branch always not taken

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch

Reg ALUData

accessReg

2 ns

Instructionfetch

Reg ALUData

accessReg

2 ns

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

Instructionfetch

Reg ALUData

accessReg

2 4 6 8 10 12 14

2 4 6 8 10 12 14

Instructionfetch

Reg ALUData

accessReg

2 ns

4 ns

bubble bubble bubble bubble bubble

Programexecutionorder(in instructions)

(not taken沒有跳走 )

(taken)

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch

Reg ALUData

accessReg

2 ns

Instructionfetch

Reg ALUData

accessReg

2 ns

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

Instructionfetch

Reg ALUData

accessReg

2 4 6 8 10 12 14

2 4 6 8 10 12 14

Instructionfetch

Reg ALUData

accessReg

2 ns

4 ns

bubble bubble bubble bubble bubble

Programexecutionorder(in instructions)

Page 25: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Solution to control hazards: predict (cont.)

Example: easy to predict

Dynamic prediction hardware Record the history of branch results 90% accuracy

Loop: … … beq $1, $2, Loop

Page 26: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

3. Data hazard An instruction depends on the results

of a previous instruction still in the pipeline

Solution 1: complier (assembler)

下一個指令需要的資料需等待前一個指令執行結果

Example:

add $s0, $t0, $t1sub $t2, $s0, $t3

Page 27: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Standard notation for steps in MIPS

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

Instructionfetch

Instruction Decoding,

Data/registerread

Instructionexecution

Memory/registerread/write

Registerwrite

讀出 寫入

Page 28: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Solution to data hazards: forwarding

Forwarding: Getting the missing item early from the internal resource

add $s0, $t0, $t1sub $t2, $s0, $t3

等待指令執行完畢?

add $s0, $t0, $t1

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

MEM

WBMEM

$t0+$t1

Page 29: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

Another example of forwarding

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

?

Mem[20($t1)]

Page 30: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Example: Reordering codes

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

Swap 0($t1) and 4($t1)

What’s wrong with pipelining?

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEMTime

2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

lw $t2, 4($t1)

sw $t2, 0($t1)

Page 31: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Example: Reordering codes

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

lw $t2, 4($t1)

sw $t2, 0($t1)

lw $t0, 0($t1)lw $t2, 4($t1) sw $t0, 4($t1)sw $t2, 0($t1)

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEMsw $t0, 4($t1)

Page 32: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Peer Instruction

A. Thanks to pipelining, I have reduced the time it took me to wash my shirt.

B. Longer pipelines are always a win (since less work per stage & a faster clock).

C. We can rely on compilers to help us avoid data hazards by reordering instrs.

ABC1: FFF2: FFT3: FTF4: FTT5: TFF6: TFT7: TTF8: TTT

Page 33: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Outline Overview A pipeline datapath: How to build? Pipelined control Data hazards

Forwarding Stalls

Branch hazards Superscalar and dynamic pipelining

Page 34: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Multiple instructions execute using pipelining

IM Reg DM RegALU

IM Reg DM RegALU

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

Time (in clock cycles)

lw $2, 200($0)

lw $3, 300($0)

Programexecutionorder(in instructions)

lw $1, 100($0) IM Reg DM RegALU

How to combine these multiple datapaths? Shared units during one clock cycle add buffer (registers) to hold data (as we do in multi-cycle)

Page 35: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Divide single-cycle datapath into stages

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back

1

2

Data flow

Page 36: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Pipelined datapath with pipeline registers (Fig 6.11)

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

64bits 128b 97b 64b

Page 37: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Example: (1) instruction fetch

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw

Address

Datamemory

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Instruction decode

lw

Address

Datamemory

Page 38: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Example: (2) instruction decode

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw

Address

Datamemory

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Instruction decode

lw

Address

Datamemory

Page 39: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Example: (3) execute

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Execution

lw

Address

Datamemory

Page 40: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Example: (4) memory access

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memory

lw

Address

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writedata

ReaddataData

memory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write back

lw

Writeregister

Address

97108/Patterson Figure 06.15

Page 41: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Example: (5) write back

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memory

lw

Address

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writedata

ReaddataData

memory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write back

lw

Writeregister

Address

97108/Patterson Figure 06.15

?

寫回暫存器的號碼對嗎?

Page 42: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Preserve the Destination Register number

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX

add 5-bit buffer

Page 43: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Trace another pipelining Trace helps you understand how pipelini

ng works !!! Single-clock-cycle pipeline diagram example: lw $10, 20($1) sub $11, $2, $3

Page 44: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction decode

lw $10, 20($1)

Instruction fetch

sub $11, $2, $3

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw $10, 20($1)

Address

Datamemory

Address

Datamemory

Clock 1

Clock 2

Page 45: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction decode

lw $10, 20($1)

Instruction fetch

sub $11, $2, $3

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw $10, 20($1)

Address

Datamemory

Address

Datamemory

Clock 1

Clock 2

Page 46: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

3216Sign

extend

Writeregister

Writedata

Memory

lw $10, 20($1)

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

sub $11, $2, $3

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

lw $10, 20($1)

Instruction decode

sub $11, $2, $3

3216Sign

extend

Address

Datamemory

Datamemory

Address

Clock 3

Clock 4

Page 47: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

3216Sign

extend

Writeregister

Writedata

Memory

lw $10, 20($1)

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

sub $11, $2, $3

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

lw $10, 20($1)

Instruction decode

sub $11, $2, $3

3216Sign

extend

Address

Datamemory

Datamemory

Address

Clock 3

Clock 4

Page 48: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

lw $10, 20($1)

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

sub $11, $2, $3

Memory

sub $11, $2, $3

Address

Datamemory

Address

Datamemory

Clock 6

Clock 5

Page 49: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

lw $10, 20($1)

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

sub $11, $2, $3

Memory

sub $11, $2, $3

Address

Datamemory

Address

Datamemory

Clock 6

Clock 5

Page 50: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Multiple-clock-cycle pipeline diagram

Page 51: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Outline Overview A pipeline datapath Pipelined control Data hazards

Forwarding Stalls

Branch hazards Superscalar and dynamic pipelining

Page 52: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Label the control lines (Fig 6.22)

PC

Instructionmemory

Address

Inst

ruct

ion

Instruction[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead

Instruction[15– 11]

6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

AddAdd

result

Shiftleft 2

ALUresult

ALU

Zero

Add

0

1

Mux

0

1

Mux

written during each clock cycle

Page 53: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Divide the control lines into groups

Fig. 6.25

How to set control lines at each stage? Extend the pipeline registers to include contr

ol information 將控制訊號儲存在 pipeline registers

Page 54: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Propagation of control (Fig.6.26)

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Page 55: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

PC

Instructionmemory

Inst

ruct

ion

Add

Instruction[20– 16]

Me

mto

Re

g

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Me

mW

rite

AddressData

memory

Address

Fig. 6.27

Page 56: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Instructionmemory

Instruction[20– 16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

Instruction[15– 0]

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000

00

0000

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

Mux

0

1

Add Addresult

Writeregister

Writedata

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Re

gWrit

e

ALU

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010

11

0001

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

X

10

20

X

1

Instruction[20– 16]

Instruction[15– 0] Sign

extend

Instruction[15– 11]

20

$X

$1

10

X

Me

mW

rite

MemRead

Me

mW

rite

Datamemory

Address

Address

Address

Clock 2

Clock 1

Old bookFig 6.31

Page 57: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Instructionmemory

Instruction[20– 16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

Instruction[15– 0]

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000

00

0000

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

Mux

0

1

Add Addresult

Writeregister

Writedata

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Re

gWrit

e

ALU

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010

11

0001

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

X

10

20

X

1

Instruction[20– 16]

Instruction[15– 0] Sign

extend

Instruction[15– 11]

20

$X

$1

10

X

Me

mW

rite

MemRead

Me

mW

rite

Datamemory

Address

Address

Address

Clock 2

Clock 1

Page 58: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Instructionmemory

Address

Instruction[20– 16]

Mem

toR

eg

Branch

ALUSrc

4

Instruction[15– 0]

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000

10

1100

010

11

000

1

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

000

10

1100

000

10

101

0

11

10

0

0

0

Mux

0

1

Add

PC

0Writedata

Mux

1

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

12

X

X

5

4

Instruction[20– 16]

Instruction[15– 0]

Instruction[15– 11]

X

$5

$4

X

12

Me

mW

rite

MemRead

Me

mW

rite

sub

11

X

X

3

2

X

$3

$2

X

11

$1

20

10

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$3

$2

11

Mux

Mux

ALUAddress Read

dataData

memory

10

WB

Zero

Zero

Signextend

Signextend

Datamemory

Address

Clock 3

Clock 4

20($1)

Page 59: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Instructionmemory

Address

Instruction[20– 16]

Mem

toR

eg

Branch

ALUSrc

4

Instruction[15– 0]

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000

10

1100

010

11

000

1

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

000

10

1100

000

10

101

0

11

10

0

0

0

Mux

0

1

Add

PC

0Writedata

Mux

1

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

12

X

X

5

4

Instruction[20– 16]

Instruction[15– 0]

Instruction[15– 11]

X

$5

$4

X

12

Me

mW

rite

MemRead

Me

mW

rite

sub

11

X

X

3

2

X

$3

$2

X

11

$1

20

10

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$3

$2

11

Mux

Mux

ALUAddress Read

dataData

memory

10

WB

Zero

Zero

Signextend

Signextend

Datamemory

Address

Clock 3

Clock 4

20($1)

Page 60: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Instructionmemory

Address

Instruction[20– 16]

Branch

ALUSrc

4

Instruction[15– 0]

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

MEM/WB

IF: add $14, $8, $9

000

10

1100

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

ALUcontrol

Shiftleft 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

MEM/WB

IF: after<1>

000

10

1100

000

10

101

0

10

00

0

1

0

Mux

0

1

Add

PC

0Writedata

Mux

1

addControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

14

X

X

9

8

Instruction[20– 16]

Instruction[15– 0]

Instruction[15– 11]

X

$9

$8

X

14

Me

mW

rite

MemRead

Me

mW

rite

or

13

X

X

7

6

X

$7

$6

X

13

$4

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$7

$6

13

Mux

Mux

ALUReaddata

12

WB

11 10

10$5

12

WB

Mem

toR

eg

1

1

11

11

Writeregister

Writedata

Zero

Zero

Datamemory

Address

Datamemory

Address

Signextend

Signextend

Clock 5

Clock 6

20($1)

Page 61: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Outline Overview A pipeline datapath Pipelined control Data hazards

Forwarding Stalls

Branch hazards Superscalar and dynamic pipelining

Page 62: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

Example: data hazards

Both r/w is oK!

?

?

Page 63: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Solution 1: software Insert independent instruction or nop (no

operation)

sub $2, $1, $3nopnopand $12, $2, $6or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Page 64: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Hardware approach: forwarding

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

1

2

EX/MEM MEM/WB1EX hazard

2MEM hazard

Page 65: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Registers

Mux M

ux

ALU

ID/EX MEM/WB

Datamemory

Mux

Forwardingunit

EX/MEM

b. With forwarding

ForwardB

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

RtRtRs

ForwardA

Mux

ALU

ID/EX MEM/WB

Datamemory

EX/MEM

a. No forwarding

Registers

Mux

Page 66: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Control for forwarding Set ForwardA and ForwardB

EX hazard:

If ( EX/MEM.RegWriteand (EX/MEM.RegisterRd <> 0)and (EX/MEM.RegisterRd = ID/EX.RegisterRs))

ForwardA = 10

// 排除寫入 $0// 前一指令有寫入暫存器動作

// 前一指令有寫入暫存器號碼與現在欲使用暫存器相同

If ( EX/MEM.RegWriteand (EX/MEM.RegisterRd <> 0)and (EX/MEM.RegisterRd = ID/EX.RegisterRt))

ForwardB = 10

// 排除寫入 $0// 前一指令有寫入暫存器動作

// 前一指令有寫入暫存器號碼與現在欲使用暫存器相同

Page 67: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Outline Overview A pipeline datapath Pipelined control Data hazards

Forwarding Stalls

Branch hazards Superscalar and dynamic pipelining

Page 68: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Unsolved data hazards

Reg

IM

Reg

Reg

IM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

?

Page 69: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Stall

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

Page 70: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Hazard detection unit When to stall?

If ( ID/EX.MemRead and (( ID/EX.RegisterRt = IF/ID.RegisterRs) or ( ID/EX.RegisterRt = IF/ID.RegisterRt) )

Stall the pipeline

// 前一指令是 load

Page 71: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Outline Overview A pipeline datapath Pipelined control Data hazards

Forwarding Stalls

Branch hazards Advanced pipelining

Page 72: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Advanced technology Single-cycle implementation multi-cycle implementation Pipelining (instruction-level

parallelism)* How to build fast processors ?

1. Increase depth of the pipeline: Superpipelining

2. Replicate the internal components of the computer : multiple issue

Dynamic multiple issue (hardware): superscalar

Static multiple issue (compiler): VLIWHow to schedule?

Page 73: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Superpipelining (longer pipelines)

Ideal improvement of pipelining

Time between inst. pipelined = Time between inst. nonpipelined

Number of pipe stages

* Divide instruction up to 8 or more stages* How to balance the length of each cycle?=> Clock cycle depends on the step of longest time

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

Page 74: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Multiple-issue pipeline Two problems

How to package instructions into issue slots

Static issue processor: compiler Dynamic issue processor: at runtime by

hardware Dealing with data and control hazards

Static issue processor: compiler Dynamic issue processor: at runtime by

hardware

Page 75: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Static multiple issueTime

2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

Issue packet

• The mix of instructions can be initiated in a given clock is usually restricted Issue packet can be taken as a single instruction allowing several operations Very long instruction word (VLIW)

Page 76: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Ex. Two-issue MIPS 2 instructions are issued per clock

cycle One for integer ALU or branch The other for load or store

* One slot can be nop

Page 77: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

A static two-issue datapath

PCInstruction

memory

4

RegistersMux

Mux

ALU

Mux

Datamemory

Mux

40000040

Signextend Sign

extend

ALU Address

Writedata

1

2

avoidStructural hazard

64-bit

Page 78: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Superscalar: Dynamic pipeline scheduling

Dynamic pipeline scheduling Hardware support for reordering the

order of instruction execution so as to avoid stalls

Ex. Unsolved pipeline stalllw $t0, 20($s2)addu $t1, $t0, $t2sub $s4, $s4, $t3slti $t5, $s4, 20

Stall (memory is slow)

Page 79: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Dynamic pipeline scheduling

Commitunit

Instruction fetchand decode unit

In-order issue

In-order commit

Load/Store

Floatingpoint

IntegerInteger …Functionalunits

Out-of-order execute

Reservationstation

Reservationstation

Reservationstation

Reservationstation

1

2

3

Buffers tohold operandsand operation

Reorder buffer,decide when it’s safeto put the result back

In the orderof programexecution

Page 80: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

dynamic pipelining

Static pipelining Advantage of dynamic pipelining

Hide memory latency Avoid the stalls that compiler could not

schedule Speculative execute instructions

(combined with branch prediction) while waiting for hazards to be solved

Page 81: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Fallacies and pitfalls Pipelining is easy Pipelining ideas can be implemented

independent of technology No. of transistors on-chip increase,

additional hardware makes sense Ex. Extra hardware in dynamic pipelining

Increase the depth of pipelining always increase performance

Page 82: Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c.

Depth of pipelining

Data hazards: large percentage of cycles become stalls

Control hazards: slower branches Pipeline register overhead

1 2 4 8 16

Pipeline depth

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Rel

ativ

e p

erfo

rman

ce

Fp pipelining, 19863 factors:


Recommended