Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are...

Chap.6: Enhancing Performance with Pipelining

Jen-Chang Liu, Spring 2006

Parts of the slides are duplicated frominst.eecs.berkeley.edu/~cs61c

Review Datapath (1/3)Datapath is the hardware that performs

operations necessary to execute programs.

Control instructs datapath on what to do next.

Datapath needs: access to storage (general purpose register

s and memory) computational ability (ALU) helper hardware (local registers and PC)

Review Datapath (2/3)Five stages of datapath (executing an in

struction):1. Instruction Fetch (Increment PC)2. Instruction Decode (Read Registers)3. ALU (Computation)4. Memory Access5. Write to Registers

ALL instructions must go through ALL five stages.

Review Datapath (3/3)P

C

inst

ruct

ion

me

mor

y

+4

rtrsrd

regi

ste

rs

ALU

Da

tam

em

ory

imm

1. InstructionFetch

2. Decode/ Register

Read

3. Execute 4. Memory5. Write

Back

Review Single cycle datapath

Multi-cycle datapath

Instructionfetch

Data/registerread

Instructionexecution

Memory/registerread/write

Registerwrite

Instructionfetch

Data/registerread



Registerwrite

Outline Overview A pipeline datapath Pipelined control Data hazards

Forwarding Stalls

Branch hazards Superscalar and dynamic pipelining

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Taskorder

Taskorder

What’s pipelining?Time

76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Taskorder

Taskorder

洗衣乾衣折疊收藏

non-pipelined

pipelining

Use different resources at the same time

Pipelining Definition: an implementation

technique in which multiple instructions are overlapped in execution

How to achieve pipelining? An instruction is divided into steps

(stages) We have separate resources for each

stage

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns


Single-cycle implementation

Pipelined implementation

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...


Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg



What does pipelining improve?

Pipelining improves performance by Increasing the instruction throughput 單位時間完成的指令數目增加 Decreasing the execution time of an individu

al instruction 單一指令執行時間不變 Ideal conditions for saving

Time between inst. pipelined = Time between inst. non-pipelined

Number of pipe stages

Design instruction set for pipelining

MIPS has been designed for pipelining1. Instructions are of the same length

Easier to fetch and decode Ex. 80x86 inst.s ranges from 1 to 17 bytes

2. A few instruction formats, with the source register fields located in the same place

可以在決定是甚麼指令前讀暫存器

Design instruction set for pipelining (cont.)

3. Memory operands only appear for loads and stores

Calculate memory address in execution stage

4. Operands are aligned in memory One memory access for data transfer

Instructionfetch

Instruction Decoding,

Data/registerread



Registerwrite

0 1 2 3

Aligned

NotAligned

Pipeline hazards Hazards: the situations in pipelining

when the next instruction cannot execute in the following clock cycle使下一個指令無法在下一個 cycle 執行

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...


Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg



危險

甚麼時候下個指令無法執行？

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...


Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg



1. Structural hazards The hardware cannot support the

combination of instructions that we want to execute in the same clock cycle

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...


Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg



硬體結構問題

2 memory access at the sameclock?

Structural Hazard #1: Single Memory

Solution:Second memory

infeasible and inefficient to createTwo Level 1 Caches

Cache: a temporary smaller [of usually most recently used] copy of memory

have both an L1 Instruction Cache and an L1 Data Cache

need more complex hardware to control when both caches miss

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...


Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg



1. Structural hazard #2: single register

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...


Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg



Can’t read and write to registers simultaneously?

Structural Hazard #2: Registers

Fact: Register access is VERY fast: takes less than half the time of ALU stage

Solution: introduce conventionalways Write to Registers during first half of each clock cycle

always Read from Registers during second half of each clock cycle

Result: can perform Read and Write during same clock cycle

2. Control hazards The need to make a decision based on

the results of one instruction while others are executing Ex. Branch instruction

下一個指令需等待前面指令的執行結果

add $4, $5, $6 beq $1, $2, 40 lw $3, 300($0)

40: or $7, $8, $9

? …

無法判斷下一個指令是那個…

Solution to control hazards #1: stall 拖延

stall (bubble): 暫停下一個指令執行

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

Instructionfetch

Reg ALUData

accessReg

2ns

Instructionfetch

Reg ALUData

accessReg

2ns

2 4 6 8 10 12 14 16


nop: no operationnop: no operationInstruction

fetchReg ALU

Dataaccess

Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

Instructionfetch

Reg ALUData

accessReg

2ns

Instructionfetch

Reg ALUData

accessReg

2ns

2 4 6 8 10 12 14 16


completion of branch

Filling two nops after branch is inefficient!

Solution to control hazards #1: stall

Stall (bubble): 暫停下一個指令執行

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

Instructionfetch

Reg ALUData

accessReg

2ns

Instructionfetch

Reg ALUData

accessReg

2ns

2 4 6 8 10 12 14 16


假設 branch 的位址計算，比較都可用另外的 hardware 在此 stage 完成

Solution to control hazards #2:

delayed branchadd $4, $5, $6 beq $1, $2, 40 lw $3, 300($0)

40: or $7, $8, $9

? …

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch

Reg ALUData

accessReg

2 ns

Instructionfetch

Reg ALUData

accessReg

2 ns

2 4 6 8 10 12 14

2 ns

(Delayed branch slot)


Redefine branchesOld definition: if we take the branch, none of the instructions after the branch get executed by accident

New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (called the branch-delay slot)

The term “Delayed Branch” meanswe always execute inst after branch


delayed branch (cont.)

Notes on Branch-Delay SlotWorst-Case Scenario: can always put a no-op in the branch-delay slot

Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program

re-ordering instructions is a common method of speeding up programs

compiler must be very smart in order to find instructions to do this

usually can find such an instruction at least 50% of the time


delayed branch (cont.)

Solution to control hazards #3: predict 預測

預測下一個可能執行的指令 . Ex. 猜 branch always not taken

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch

Reg ALUData

accessReg

2 ns

Instructionfetch

Reg ALUData

accessReg

2 ns


Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

Instructionfetch

Reg ALUData

accessReg

2 4 6 8 10 12 14

2 4 6 8 10 12 14

Instructionfetch

Reg ALUData

accessReg

2 ns

4 ns

bubble bubble bubble bubble bubble


(not taken沒有跳走 )

(taken)

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch

Reg ALUData

accessReg

2 ns

Instructionfetch

Reg ALUData

accessReg

2 ns


Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

Instructionfetch

Reg ALUData

accessReg

2 4 6 8 10 12 14

2 4 6 8 10 12 14

Instructionfetch

Reg ALUData

accessReg

2 ns

4 ns



Solution to control hazards: predict (cont.)

Example: easy to predict

Dynamic prediction hardware Record the history of branch results 90% accuracy

Loop: … … beq $1, $2, Loop

3. Data hazard An instruction depends on the results

of a previous instruction still in the pipeline

Solution 1: complier (assembler)

下一個指令需要的資料需等待前一個指令執行結果

Example:

add $s0, $t0, $t1sub $t2, $s0, $t3

Standard notation for steps in MIPS

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

Instructionfetch

Instruction Decoding,

Data/registerread



Registerwrite

讀出寫入

Solution to data hazards: forwarding

Forwarding: Getting the missing item early from the internal resource

add $s0, $t0, $t1sub $t2, $s0, $t3

等待指令執行完畢？

add $s0, $t0, $t1

sub $t2, $s0, $t3


IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

MEM

WBMEM

$t0+$t1

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3


IF ID WBMEMEX

IF ID WBMEMEX


Another example of forwarding

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3


IF ID WBMEMEX

IF ID WBMEMEX


Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3


IF ID WBMEMEX

IF ID WBMEMEX


Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3


IF ID WBMEMEX

IF ID WBMEMEX


?

Mem[20($t1)]

Example: Reordering codes

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

Swap 0($t1) and 4($t1)

What’s wrong with pipelining?

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEMTime

2 4 6 8 10


lw $t2, 4($t1)

sw $t2, 0($t1)

Example: Reordering codes

Time2 4 6 8 10


Time2 4 6 8 10


lw $t2, 4($t1)

sw $t2, 0($t1)

lw $t0, 0($t1)lw $t2, 4($t1) sw $t0, 4($t1)sw $t2, 0($t1)

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEMsw $t0, 4($t1)

Peer Instruction

A. Thanks to pipelining, I have reduced the time it took me to wash my shirt.

B. Longer pipelines are always a win (since less work per stage & a faster clock).

C. We can rely on compilers to help us avoid data hazards by reordering instrs.

ABC1: FFF2: FFT3: FTF4: FTT5: TFF6: TFT7: TTF8: TTT

Outline Overview A pipeline datapath: How to build? Pipelined control Data hazards

Forwarding Stalls


Multiple instructions execute using pipelining

IM Reg DM RegALU

IM Reg DM RegALU

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

Time (in clock cycles)

lw $2, 200($0)

lw $3, 300($0)


lw $1, 100($0) IM Reg DM RegALU

How to combine these multiple datapaths? Shared units during one clock cycle add buffer (registers) to hold data (as we do in multi-cycle)

Divide single-cycle datapath into stages

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back

1

2

Data flow

Pipelined datapath with pipeline registers (Fig 6.11)

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

64bits 128b 97b 64b

Example: (1) instruction fetch

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw

Address

Datamemory

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Instruction decode

lw

Address

Datamemory

Example: (2) instruction decode

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw

Address

Datamemory

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Instruction decode

lw

Address

Datamemory

Example: (3) execute

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Execution

lw

Address

Datamemory

Example: (4) memory access

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memory

lw

Address

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writedata

ReaddataData

memory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write back

lw

Writeregister

Address

97108/Patterson Figure 06.15

Example: (5) write back

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memory

lw

Address

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writedata

ReaddataData

memory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write back

lw

Writeregister

Address

97108/Patterson Figure 06.15

?

寫回暫存器的號碼對嗎？

Preserve the Destination Register number

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX

add 5-bit buffer

Trace another pipelining Trace helps you understand how pipelini

ng works !!! Single-clock-cycle pipeline diagram example: lw $10, 20($1) sub $11, $2, $3

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction decode

lw $10, 20($1)

Instruction fetch

sub $11, $2, $3

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw $10, 20($1)

Address

Datamemory

Address

Datamemory

Clock 1

Clock 2

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction decode

lw $10, 20($1)

Instruction fetch

sub $11, $2, $3

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw $10, 20($1)

Address

Datamemory

Address

Datamemory

Clock 1

Clock 2

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

3216Sign

extend

Writeregister

Writedata

Memory

lw $10, 20($1)

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

sub $11, $2, $3

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

lw $10, 20($1)

Instruction decode

sub $11, $2, $3

3216Sign

extend

Address

Datamemory

Datamemory

Address

Clock 3

Clock 4

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

3216Sign

extend

Writeregister

Writedata

Memory

lw $10, 20($1)

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

sub $11, $2, $3

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

lw $10, 20($1)

Instruction decode

sub $11, $2, $3

3216Sign

extend

Address

Datamemory

Datamemory

Address

Clock 3

Clock 4

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

lw $10, 20($1)

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion


Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

sub $11, $2, $3

Memory

sub $11, $2, $3

Address

Datamemory

Address

Datamemory

Clock 6

Clock 5

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion


Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

lw $10, 20($1)

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion


Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

sub $11, $2, $3

Memory

sub $11, $2, $3

Address

Datamemory

Address

Datamemory

Clock 6

Clock 5

Multiple-clock-cycle pipeline diagram


Forwarding Stalls


Label the control lines (Fig 6.22)

PC

Instructionmemory

Address

Inst

ruct

ion

Instruction[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead


6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

AddAdd

result

Shiftleft 2

ALUresult

ALU

Zero

Add

0

1

Mux

0

1

Mux

written during each clock cycle

Divide the control lines into groups

Fig. 6.25

How to set control lines at each stage? Extend the pipeline registers to include contr

ol information 將控制訊號儲存在 pipeline registers

Propagation of control (Fig.6.26)

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

PC

Instructionmemory

Inst

ruct

ion

Add


Me

mto

Re

g

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU


6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Me

mW

rite

AddressData

memory

Address

Fig. 6.27

Instructionmemory


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4


0

Mux

0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU


EX

M

WB

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000

00

0000

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

Mux

0

1

Add Addresult

Writeregister

Writedata

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Re

gWrit

e

ALU

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010

11

0001

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

X

10

20

X

1


Instruction[15– 0] Sign

extend


20

$X

$1

10

X

Me

mW

rite

MemRead

Me

mW

rite

Datamemory

Address

Address

Address

Clock 2

Clock 1

Old bookFig 6.31

Instructionmemory


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4


0

Mux

0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU


EX

M

WB

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000

00

0000

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

Mux

0

1

Add Addresult

Writeregister

Writedata

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Re

gWrit

e

ALU

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010

11

0001

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

X

10

20

X

1


Instruction[15– 0] Sign

extend


20

$X

$1

10

X

Me

mW

rite

MemRead

Me

mW

rite

Datamemory

Address

Address

Address

Clock 2

Clock 1

Instructionmemory

Address


Mem

toR

eg

Branch

ALUSrc

4


0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU


EX

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000

10

1100

010

11

000

1

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

000

10

1100

000

10

101

0

11

10

0

0

0

Mux

0

1

Add

PC

0Writedata

Mux

1

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

12

X

X

5

4




X

$5

$4

X

12

Me

mW

rite

MemRead

Me

mW

rite

sub

11

X

X

3

2

X

$3

$2

X

11

$1

20

10

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$3

$2

11

Mux

Mux

ALUAddress Read

dataData

memory

10

WB

Zero

Zero

Signextend

Signextend

Datamemory

Address

Clock 3

Clock 4

20($1)

Instructionmemory

Address


Mem

toR

eg

Branch

ALUSrc

4


0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU


EX

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000

10

1100

010

11

000

1

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

000

10

1100

000

10

101

0

11

10

0

0

0

Mux

0

1

Add

PC

0Writedata

Mux

1

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

12

X

X

5

4




X

$5

$4

X

12

Me

mW

rite

MemRead

Me

mW

rite

sub

11

X

X

3

2

X

$3

$2

X

11

$1

20

10

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$3

$2

11

Mux

Mux

ALUAddress Read

dataData

memory

10

WB

Zero

Zero

Signextend

Signextend

Datamemory

Address

Clock 3

Clock 4

20($1)

Instructionmemory

Address


Branch

ALUSrc

4


0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU


EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

MEM/WB

IF: add $14, $8, $9

000

10

1100

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

ALUcontrol

Shiftleft 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

MEM/WB

IF: after<1>

000

10

1100

000

10

101

0

10

00

0

1

0

Mux

0

1

Add

PC

0Writedata

Mux

1

addControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

14

X

X

9

8




X

$9

$8

X

14

Me

mW

rite

MemRead

Me

mW

rite

or

13

X

X

7

6

X

$7

$6

X

13

$4

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$7

$6

13

Mux

Mux

ALUReaddata

12

WB

11 10

10$5

12

WB

Mem

toR

eg

1

1

11

11

Writeregister

Writedata

Zero

Zero

Datamemory

Address

Datamemory

Address

Signextend

Signextend

Clock 5

Clock 6

20($1)


Forwarding Stalls


IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6


sub $2, $1, $3


and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

Example: data hazards

Both r/w is oK!

?

?

Solution 1: software Insert independent instruction or nop (no

operation)

sub $2, $1, $3nopnopand $12, $2, $6or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Hardware approach: forwarding

IM Reg

IM Reg



sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

1

2

EX/MEM MEM/WB1EX hazard

2MEM hazard

Registers

Mux M

ux

ALU

ID/EX MEM/WB

Datamemory

Mux

Forwardingunit

EX/MEM

b. With forwarding

ForwardB

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

RtRtRs

ForwardA

Mux

ALU

ID/EX MEM/WB

Datamemory

EX/MEM

a. No forwarding

Registers

Mux

Control for forwarding Set ForwardA and ForwardB

EX hazard:

If ( EX/MEM.RegWriteand (EX/MEM.RegisterRd <> 0)and (EX/MEM.RegisterRd = ID/EX.RegisterRs))

ForwardA = 10

// 排除寫入 $0// 前一指令有寫入暫存器動作

// 前一指令有寫入暫存器號碼與現在欲使用暫存器相同

If ( EX/MEM.RegWriteand (EX/MEM.RegisterRd <> 0)and (EX/MEM.RegisterRd = ID/EX.RegisterRt))

ForwardB = 10

// 排除寫入 $0// 前一指令有寫入暫存器動作

// 前一指令有寫入暫存器號碼與現在欲使用暫存器相同


Forwarding Stalls


Unsolved data hazards

Reg

IM

Reg

Reg

IM



lw $2, 20($1)


and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

?

Stall

lw $2, 20($1)


and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

Hazard detection unit When to stall?

If ( ID/EX.MemRead and (( ID/EX.RegisterRt = IF/ID.RegisterRs) or ( ID/EX.RegisterRt = IF/ID.RegisterRt) )

Stall the pipeline

// 前一指令是 load


Forwarding Stalls

Branch hazards Advanced pipelining

Advanced technology Single-cycle implementation multi-cycle implementation Pipelining (instruction-level

parallelism)* How to build fast processors ?

1. Increase depth of the pipeline: Superpipelining

2. Replicate the internal components of the computer : multiple issue

Dynamic multiple issue (hardware): superscalar

Static multiple issue (compiler): VLIWHow to schedule?

Superpipelining (longer pipelines)

Ideal improvement of pipelining

Time between inst. pipelined = Time between inst. nonpipelined

Number of pipe stages

* Divide instruction up to 8 or more stages* How to balance the length of each cycle?=> Clock cycle depends on the step of longest time

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...


Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg



Multiple-issue pipeline Two problems

How to package instructions into issue slots

Static issue processor: compiler Dynamic issue processor: at runtime by

hardware Dealing with data and control hazards

Static issue processor: compiler Dynamic issue processor: at runtime by

hardware

Static multiple issueTime

2 4 6 8 10


Time2 4 6 8 10


…

Issue packet

• The mix of instructions can be initiated in a given clock is usually restricted Issue packet can be taken as a single instruction allowing several operations Very long instruction word (VLIW)

Ex. Two-issue MIPS 2 instructions are issued per clock

cycle One for integer ALU or branch The other for load or store

* One slot can be nop

A static two-issue datapath

PCInstruction

memory

4

RegistersMux

Mux

ALU

Mux

Datamemory

Mux

40000040

Signextend Sign

extend

ALU Address

Writedata

1

2

avoidStructural hazard

64-bit

Superscalar: Dynamic pipeline scheduling

Dynamic pipeline scheduling Hardware support for reordering the

order of instruction execution so as to avoid stalls

Ex. Unsolved pipeline stalllw $t0, 20($s2)addu $t1, $t0, $t2sub $s4, $s4, $t3slti $t5, $s4, 20

Stall (memory is slow)

Dynamic pipeline scheduling

Commitunit

Instruction fetchand decode unit

…

In-order issue

In-order commit

Load/Store

Floatingpoint

IntegerInteger …Functionalunits

Out-of-order execute

Reservationstation

Reservationstation

Reservationstation

Reservationstation

1

2

3

Buffers tohold operandsand operation

Reorder buffer,decide when it’s safeto put the result back

In the orderof programexecution

dynamic pipelining

Static pipelining Advantage of dynamic pipelining

Hide memory latency Avoid the stalls that compiler could not

schedule Speculative execute instructions

(combined with branch prediction) while waiting for hazards to be solved

Fallacies and pitfalls Pipelining is easy Pipelining ideas can be implemented

independent of technology No. of transistors on-chip increase,

additional hardware makes sense Ex. Extra hardware in dynamic pipelining

Increase the depth of pipelining always increase performance

Depth of pipelining

Data hazards: large percentage of cycles become stalls

Control hazards: slower branches Pipeline register overhead

1 2 4 8 16

Pipeline depth

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Rel

ativ

e p

erfo

rman

ce

Fp pipelining, 19863 factors:

Date post:	19-Dec-2015
Category:	Documents
View:	217 times
Download:	1 times

Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are...

Documents