4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (12/1999)

PIPELINING: A CONTINUATION OF PROCESSOR DESIGN

• We found that the single-cycle implementation wastes time

! All instructions take as long as the instruction with the longest delay (lw)

• In the multicycle implementation:

! The clock period is much shorter than in the single-cycle implementation

! Instructions take only as many clock periods as they need

! BUT: Each functional unit is used only once or twice in executing aninstruction

◦We need an implementation in which each functional unit is busy inevery clock period◦ This is possible if we cut the execution of an instruction into stages,

and then overlap the execution of different instructions


c� C. D. Cantrell (09/2011)

STEPS IN EXECUTING AN INSTRUCTION

Step R-type Memory reference Branches Jumps

Instruction IR = M[PC]Fetch PC = PC + 4

Instruction A = Reg[IR[25–21]]decode, B = Reg[IR[20–16]]

Register Fetch ALUOut = PC + (sign-extend(IR[15–0])<<2)Execution, ALUOut = A op B ALUOut = A If A == B then PC = PC[31–28]

address comp., + (sign-extend PC = ALUOut concatenated w/branch/jump (IR[15–0]) (IR[25–0]<<2)completion

Memory access Reg[IR[15–11]] Load: MDRor = ALUOut = M[ALUOut]

R-type completion Store: M[ALUOut] = BMemory read Load: Reg[IR[20–16]]completion = MDR



PIPELINING

• In a pipelined computer architecture, a single processor can execute severalinstructions concurrently, reducing the CPI

. Execution of one instruction uses several hardware functional units(instruction memory, register file, ALU, data memory, etc.)

. The functional units are organized into stages

� Execution at each stage takes 1 clock period� Stages are separated by clock-controlled pipeline registers that pre-

serve the state of execution for the duration of a clock period

. The pipeline is subject to hazards

� Data hazards: Write/read conflicts or timing problems� Control hazards: Exceptions and branches

• The MIPS R2000 pipeline design strongly influenced the design of all sub-sequent processors



PIPELINING: PLUSES AND MINUSES

• What makes pipelining easy in the MIPS ISA:

. All instructions are the same length

. There are only a few instruction formats

. Memory operands occur only in loads and stores

• What makes pipelining hard in any ISA:

. Structural hazards (e.g., contention for the same functional unit)

. Control (branch & exception) hazards

. Data hazards (e.g., trying to read a register before it’s written)

We will build a simple pipeline to illustrate these issues

• Pipelining is even more di�cult in modern general-purpose microprocessors

. Exception handling is a challenge

. Performance improvements such as simultaneous instruction issue, out-of-order execution, etc., create lots of complications



PIPELINE DESIGN APPROACH

• Begin with the multicycle implementation

! Different functional units are all executing the same instruction, althoughthe units are active in different clock periods

! Control information does not need to be stored in the temporary registers

• Identify the changes that need to be made in the pipelined design

! Different functional units are executing different instructions

! All information needed for execution of a given instruction must propagatethrough the pipeline with the instruction

! Control information must be stored between stages, because the controlsignals are different for different instructions

◦ Control design is a source of complexity in pipeline design (think aboutwhat happens when a branch is taken)

! Results of execution may differ from the multicycle implementation

◦ Data hazards are another source of complexity


After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

MULTICYCLE DATAPATH AND CONTROL

Shiftleft 2

PCMux

0

1

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[15–11]

Mux

0

1

Mux

0

1

4

Instruction[15–0]

Signextend

3216



Instruction[15–0]

Instructionregister

ALUcontrol

ALUresult

ALUZero

Memorydata

register

A

B

IorD

MemRead

MemWrite

MemtoReg

PCWriteCond

PCWrite

IRWrite

ALUOp

ALUSrcB

ALUSrcA

RegDst

PCSource

RegWrite

Control

Outputs

Op[5–0]

Instruction[31-26]

Instruction [5–0]

Mux

0

2

Jumpaddress [31-0]Instruction [25–0] 26 28

Shiftleft 2

PC [31-28]

1

1 Mux

0

32

Mux

0

1ALUOut

Memory

MemData

Writedata

Address


SINGLE-CYCLE DATAPATH

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back


After David A. Patterson and John L. Hennessy, Computer Organization and Design, 4th Edition

SEQUENTIAL vs. PIPELINED EXECUTION

Programexecutionorder(in instructions)

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time 1000 1200 1400200 400 600 800

1000 1200 1400200 400 600 800

1600 1800

Instructionfetch

Dataaccess Reg

Instructionfetch

Dataaccess Reg

Instructionfetch

800 ps

800 ps

800 ps


lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time

Instructionfetch

Dataaccess Reg

Instructionfetch

Instructionfetch

Dataaccess Reg

Dataaccess Reg

200 ps

200 ps

200 ps 200 ps 200 ps 200 ps 200 ps

ALUReg

ALUReg

ALU

ALU

ALU

Reg

Reg

Reg



PIPELINING (2)

• Pipeline speedup:

! A pipelined processor with s stages can execute n instructions in

ETP = s + (n− 1) clock periods

(assuming no hazards)

! A serial processor executes the same n instructions in

ETS = ns clock periods

! The ideal pipeline speedup equals the number of stages:

SP =ETS

ETP=

ns

s + (n− 1)−→n¿s

s

• Amdahl’s law applies to pipelining



PROGRAMMING IMPLICATIONS OF PIPELINING

• Avoid function or subprogram calls in an inner loop

! Jumps force the pipeline to be flushed

• Avoid recursion in an inner loop

! Recursion on the elements of an array generally causes data hazards be-cause the value of v[n] has not been written before it is needed for thecomputation of v[n+1]

• Avoid scalar temporary variables in an inner loop

! Reading a memory-resident scalar variable may cause a data hazard

• Avoid case and switch statements in an inner loop

! Conditional branches cause control hazards, and the use of a jump tablemay cause data hazards



MIPS PIPELINES (1)

• MIPS R2000 integer unit pipeline stages(Patterson & Hennessy, Chapter 6)

1. Instruction Fetch (IF)

2. Instruction Decode (ID) and Register Fetch

3. Execute (EX or ALU)

! ALU operations, condition evaluation, address computation

4. Memory access (MEM)

5. Write back (WB) to register file

Clockperiods

1 2 3 4 5

IF ID WBEX MEM



R4000 PIPELINE

IF IS

Instruction memory Reg ALU Data memory Reg

RF EX DF DS TC WB

The eight-stage pipeline of the R4000



MIPS PIPELINES (2)

• MIPS R2000 floating-point unit pipeline stages

1. Instruction Fetch (IF)2. Register Fetch and Instruction Decode (RD)

! FPU decodes instruction on bus to see if it’s floating-point! FPU reads data from its registers

3. Execute (EX or ALU)4. Memory access (MEM)5. Exception processing (stage called WB for correspondence with

integer pipeline)6. Write back (FWB)

Clockperiods

1 2 3 4 5

IF ID WBEX MEM FWB

6


SINGLE-CYCLE DATAPATH

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back


PIPELINED EXECUTION IN SINGLE-CYCLE DATAPATH

IM Reg DM RegALU

IM Reg DM RegALU

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

Time (in clock cycles)

lw $2, 200($0)

lw $3, 300($0)


lw $1, 100($0) IM Reg DM RegALU

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

SINGLE-CYCLE DATAPATHWITH PIPELINE REGISTERS

Because the state of a D flip-flop changes only on clock edges,new data can be asserted on the inputs of the pipeline registers

while the data written in the previous clock period is still valid on the outputs

Inputside

Outputside

Inputside

Outputside

Inputside

Outputside

Inputside

Outputside



MASTER-SLAVE D FLIP-FLOP

• The master latch (on the left) receives the D and clock (C) inputs

. When the clock is asserted, the Q output of the master latch follows thedata (D)

. When the clock is deasserted, the master latch is closed, but the second(slave) latch is open

� The output of the slave latch follows its input, which is the output ofthe master latch

QQ

_Q

Q

_Q

Dlatch

D

C

Dlatch

DD

C

C



COMBINATIONAL LOGIC AND STATE ELEMENTS

Clock cycle

Stateelement

1Combinational logic

Stateelement

2

• Every state element has 2 control inputs: Clock signal and write enable


STAGE 1 OF A LOAD INSTRUCTION

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw

Address

Datamemory



Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Instruction decode

lw

Address

Datamemory



Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Execution

lw

Address

Datamemory



Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memory

lw

Address



Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writedata

ReaddataData

memory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write backlw

Writeregister

Address


STAGE 3 OF A STORE INSTRUCTION

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Execution

sw

Address



Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memory

sw

Address



Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write back

sw



DATAPATH MODIFICATIONS FOR PIPELINING

• The number of the register that an instruction must write to is read in theID stage

. The name of the signal is WriteReg

• Consider two instructions:

lw $10, 20($1)sub $11, $2, $3

. WriteReg signal values are 10 (for lw) and 11 (for sub)

. The WriteReg signal is read only in the WB stage

. The sub’s ID stage modifies WriteReg before lw can read it

. Therefore the value of the WriteReg signal is part of theinstruction’s state, and must be passed along in pipeline registers as theinstruction executes

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX

DATAPATH MODIFIED TOHANDLE A LOAD

WriteReg WriteReg WriteReg

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Address

Datamemory

PIPELINE STAGES USED BYA LOAD INSTRUCTION


TWO REPRESENTATIONS OF PIPELINED EXECUTION

IM Reg DM Reg

IM Reg DM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6


lw $10, 20($1)


sub $11, $2, $3

ALU

ALU


Time ( in clock cycles)


Instructionfetch

Instructiondecode

Instructionfetch

Instructiondecode Execution Write back

Execution

Dataaccess

Dataaccess Write backlw $10, $20($1)

sub $11, $2, $3



PIPELINED EXECUTION OF TWO INSTRUCTIONS

• Exercise: Show the signal values in the datapath in the pipelined executionof the instructions

lw $10, 20($1)sub $11, $2, $3

in each of the following six slides

. Assume the following register and memory contents:

($1) = 0x1000 0000(M[0x1000 0014]) = 0x7fff fffc($2) = 0x0000 000e($3) = 0x0000 0008

. Also show the values of the WriteReg signal in each stage

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw $10, 20($1)

Address

Datamemory

Clock 1

CLOCK PERIOD 1

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction decode

lw $10, 20($1)Instruction fetch

sub $11, $2, $3

Address

Datamemory

Clock 2

CLOCK PERIOD 2

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

lw $10, 20($1)Instruction decode

sub $11, $2, $3

3216Sign

extend

Address

Datamemory

Clock 3

CLOCK PERIOD 3

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

3216Sign

extend

Writeregister

Writedata

Memory

lw $10, 20($1)

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

sub $11, $2, $3

Datamemory

Address

Clock 4

CLOCK PERIOD 4

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

lw $10, 20($1)

Memory

sub $11, $2, $3

Address

Datamemory

Clock 5

CLOCK PERIOD 5

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion


Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

sub $11, $2, $3

Address

Datamemory

Clock 6

CLOCK PERIOD 6



PIPELINE STAGES: DATAPATH

Stage R-type Memory reference Branches

IF IF/ID Instruction = IM[PC]IF/ID PC = PC + 4

ID/EX PC = IF/ID PCID/EX A = Reg[IF/ID Instruction[25–21]]

ID ID/EX B = Reg[IF/ID Instruction[20–16]]ID/EX Immediate = sign-extend(IF/ID Instruction[15–0])

EX/MEM ALUOut EX/MEM ALUOut = A EX/MEM PCEX = A op B + ID/EX Immediate = ID/EX PC

WriteReg = ID/EX Inst[15–11] WriteReg = ID/EX Inst[20–16] + ((ID/EX Imm)<<2)EX/MEM B = ID/EX B

MEM/WB ALUOut = EX/MEM ALUOutAddress = EX/MEM ALUOut

MEM Load: MEM/WB ReadData PC = EX/MEM PC= DM[Address]

Store: DM[Address] = EX/MEM BMEM/WB WriteReg = EX/MEM WriteReg

WB Reg[MEM/WB WriteReg] Load: Reg[MEM/WB WriteReg]= MEM/WB ALUOut = MEM/WB ReadData

PC

Instructionmemory

Address

Inst

ruct

ion


MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15–0]

0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux1

Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead


6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

AddAdd

result

Shiftleft 2

ALUresult

ALUZero

Add

0

1

Mux

0

1

Mux

PIPELINED DATAPATHWITH CONTROL SIGNALS

ALUOut

A

B

A

B

B

ALUOut

PC PC PC

PC

Imm

WriteReg WriteReg



SETTINGS OF CONTROL LINES

Ex/Address Calc. Mem. Access WriteBackInst. Reg ALUOp ALUOp ALU Br Mem Mem Reg Mem-type Dst bit 1 bit 0 Src Read Write Write to-reg.

R-type 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw d 0 0 1 0 0 1 0 dbeq d 0 1 0 1 0 0 0 d



PIPELINED CONTROL

• The control signals for an instruction are determined in the ID stage

. The next instruction’s ID stage asserts new values of the control signals

. The current instruction’s control signals must be preserved for all stagesafter ID

. Control signals are part of the state of the instruction, and therefore mustbe passed along from stage to stage in pipeline registers, just like data

. Some instruction fields (such as Immediate) must be preserved until theyare needed in later stages

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

CONTROL LINES FOR THETHREE FINAL STAGES

RegWriteMemtoReg

RegWriteMemtoReg

BranchMemReadMemWrite

RegDst}ALUOpALUSrc

PC

Instructionmemory

Inst

ruct

ion

Add


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15–0]

0

0

Mux

0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU


6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Mem

Writ

e

AddressData

memory

Address

PIPELINED DATAPATH WITHCONTROL LOGIC AND SIGNALS

A

B

A

B

PC PC

PC

Imm

WriteReg WriteReg



PIPELINED EXECUTION OF FIVE INSTRUCTIONS

• We’ll follow what happens in the instruction sequence

[40000024] lw $10, 20($1)[40000028] sub $11, $2, $3[4000002c] and $12, $4, $5[40000030] or $13, $6, $7[40000034] add $14, $8, $9

. For each clock period, note the values of the following signals in theID/EX, EX/MEM, and MEM/WB pipeline registers:

�WB: RegWrite, MemtoReg�M: Branch, MemRead, MemWrite� EX: RegDst, ALUOp, ALUSrc

Instructionmemory


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

Instruction[15–0]

0

Mux

0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU


EX

M

WB

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000

00

0000

000

00

000

0

00

00

0

00

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

Mem

Writ

e

Address

Clock 1

WB

EX

M

Instructionmemory

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

Mux

0

1

Add Addresult

Writeregister

Writedata

Mux1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

ALU

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010

11

0001

000

00

000

0

00

00

0

00

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

X

10

20

X

1


Instruction[15–0] Sign

extend


20

$X

$1

10

X

MemRead

Mem

Writ

e

Datamemory

Address

Address

Clock 2

Instructionmemory

Address


Mem

toR

eg

Branch

ALUSrc

4

Instruction[15–0]

0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU


EX

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000

10

1100

010

11

000

1

00

00

0

00

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

Mem

Writ

e

sub

11

X

X

3

2

X

$3

$2

X

11

$1

20

10

Mux

0

Mux1

ALUOp

RegDst

ALUcontrol

M

WB

Zero

Signextend

Datamemory

Address

Clock 3

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

Reg

Writ

e

M

WB

Inst

ruct

ion


000

10

1100

000

10

101

0

11

10

0

00

Mux

0

1

Add

PC

0Writedata

Mux

1

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

12

X

X

5

4


Instruction[15–0]


X

$5

$4

X

12

MemRead

Mem

Writ

e

$3

$2

11

Mux

Mux

ALUAddress Read

dataData

memory

10

WB

Zero

Signextend

Clock 4

ID: and $12, $4, $5 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>IF: or $13, $6, $7

Instructionmemory

Address


Branch

ALUSrc

4

Instruction[15–0]

0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU


EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

MEM/WB

IF: add $14, $8, $9

000

10

1100

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

Mem

Writ

e

or

13

X

X

7

6

X

$7

$6

X

13

$4

Mux

0

Mux1

ALUOp

RegDst

ALUcontrol

M

WB

11 10

10$5

12

WB

Mem

toR

eg

11

Zero

Datamemory

Address

Signextend

Clock 5

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

ALUcontrol

Shiftleft 2

Reg

Writ

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

MEM/WB

IF: after<1>

000

10

1100

000

10

101

0

10

00

0

10

Mux

0

1

Add

PC

0Writedata

Mux

1

addControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

14

X

X

9

8


Instruction[15–0]


X

$9

$8

X

14

MemRead

Mem

Writ

e

$7

$6

13

Mux

Mux

ALUReaddata

12

WB

11

11

Writeregister

Writedata

Zero

Datamemory

Address

Signextend

Clock 6

Instructionmemory

Address


Branch

ALUSrc

4

Instruction[15–0]

0

1

Add Addresult


Writedata

ALUresult

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU


Signextend

EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .

MEM/WB

IF: after<2>

000

00

0000

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

Mem

Writ

e

$8

Mux

0

Mux1

ALUOp

RegDst

ALUcontrol

M

WB

13 12

12$9

14

WB

Mem

toR

eg

10

Readdata 1

Readdata 2

Readregister 1

Readregister 2 Zero

Datamemory

Address

Clock 7

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .

MEM/WB

IF: after<3>

000

00

0000

000

00

000

0

10

00

0

10

Mux

0

1

Add

PC

0Writedata

Mux

1

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2



extend


MemRead

Mem

Writ

e

Mux

Mux

ALUReaddata

14

WB

13

13

Writeregister

Writedata

Datamemory

Address

Clock 8

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .

MEM/WB

IF: after<4>

000

00

0000

000

00

000

0

00

00

0

10

Mux

0

1

Add

PC

0Writedata

Mux

1

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2



extend


MemRead

Mem

Writ

e

Mux

Mux

ALUReaddata

WB

14

14

Writeregister

Writedata

Datamemory

Address

Clock 9



DATA HAZARDS (1)

• Data hazards occur when the order of read and write actions is not the orderin strictly sequential execution

! Hazards are named by the ordering in the program that must be preservedin the course of pipelined execution

◦ In the following, the order of execution should be i, then j

! RAW (read after write) — j reads a source before i has written it

◦ j incorrectly gets the old value◦Most common kind of data hazard

! WAR (write after read) — j writes a destination before i reads it

◦ i incorrectly gets the new value

! WAW (write after write) — j should write an operand after i writes it,but the writes are performed in the wrong order, incorrectly leaving thevalue written by i

• Hazards limit pipeline speedup and complicate design



DATA HAZARDS (2)

• RAW hazards are generated by the instructions

sub $2,$1,$3

and $12,$2,$5

or $13,$6,$2

add $14,$2,$2

sw $15,100($2)

1 2 3 4 5 6 7

Instructionfetch Reg ALU Data

access Reg

Clock Periods

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2


access Reg


access Reg



access Reg


access Reg

add $14, $2, $2

sw $15, 100($2)

8 9



DATA HAZARDS (3)

• RAW hazards with i = sub $2,$1,$3 and source = $2 are generated bythe instructions

sub $2,$1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2)

! The instruction j = and $12,$2,$5 reads $2 before i writes it

! The instruction j = or $13,$6,$2 reads $2 before i writes it

! The instruction j = add $14,$2,$2 reads $2 in the same clock periodin which i writes it

◦ Generates a hazard if the register file’s outputs change only on the edgeof the main processor clock



SIMPLE QUESTIONS ABOUT TIMING

• How many clock periods are required to execute this program segment withvarious pipelined designs?

lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label # Assume branch not takenadd $t5, $t2, $t3sw $t5, 8($t3)

Label: ...

. What happens during clock period 8?

. In what clock period does the addition of $t2 and $t3 actually take place?

• This segment takes 21 clock periods in the multicycle implementation



PROGRAM SEGMENT TIMING (1)

• Pipeline hazards in the same program segment as for multicycle

IM Reg DM Reg

IM Reg DM Reg

IM Reg DM Reg

IM Reg DM Reg

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

lw $t2, 0($t3)

lw $t3, 4($t3)

beq $t2, $t3, Label

add $t5, $t2, $t3

sw $t5, 8($t3) IM Reg DM RegIF/ID

ID/E

X

EX/M

EM

MEM/

WB




• In the multicycle implementation, this 5-instruction segment takes 21 clocks

• If there were no hazards, pipelined execution of the segment would take 9clocks

• The hazard on register $t3 causes a delay of 3 clocks and the hazard on $t5causes an additional delay of 3 clocks, making the pipelined execution time15 clock periods—71% of the multicycle execution time!

• Pipline design improvements can reduce execution times

IM Reg DM Reg

IM Reg DM Reg

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

lw $t3, 4($t3)

beq $t2, $t3, Label3 cp




• In the multicycle implementation, this 5-instruction segment takes 21 clocks

• If there were no hazards, pipelined execution of the segment would take 9clocks

• The hazard on register $t5 causes an additional delay of 3 clocks, making thepipelined execution time 15 clock periods—71% of the multicycle executiontime!

IM Reg DM Reg

IM Reg DM Reg

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

IM Reg DM RegIF/ID

ID/E

X

EX/M

EM

MEM/

WB

CP6 CP7 CP9 CP10CP8 CP11 CP12 CP14 CP15CP13

beq $t2, $t3, Label

add $t5, $t2, $t3

sw $t5, 8($t3)



DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (1)

• Follows ideas of K. Kennedy et al.

• Definition: If control flow within a program can reach statement T afterpassing through statement S, then T depends on S

! Dependence is always defined by reference to the results ofserial execution

• Assume a loop

! Dependence analysis outside a loop is trivial

• Assume an array

! Dependences that affect pipelining arise from references to the same mem-ory location M[n] in an array




• Distinguish between statements (in a program) and instances of those state-ments (in loop instances or threads)

! Let i = loop induction variable

◦ Example: A for loop in C¶ Syntax: for ( i=0; i<n; i++ ) · · ·¶ The induction variable is i

! Let Si = instance of statement S that occurs on the value i of the inductionvariable

• Flow dependence (RAW): SiS writes M[n] and TiT reads M[n]

S: X[fS[i]] = · · ·T: · · · = F[X[fT[i]]]




• Anti-dependence (WAR): SiS reads M[n] and TiT writes M[n]

S: · · · = F[X[fS[i]]]T: X[fT[i]] = · · ·

• Output dependence (WAW): SiS and TiT both write M[n]

S: X[fS[i]] = · · ·T: X[fT[i]] = · · ·


c! C. D. Cantrell (10/2010)

LOOP OVERHEAD EXAMPLE: DOT PRODUCT

# The arguments are in registers $a0 through $a4# The first argument POINTS to the first element of vector v1# The second argument POINTS to the first element of vector v2# Third argument = value of veclen (the dimension of the vectors)# Fourth argument = data size in bytes

.text

__start:dotpro: nop # The dot product function

ori $v0,$0,0 # Initialize the dot product to 0blez $a2,beamup # Return if veclen <= 0or $t1,$0,$a2 # Register t1 will be a counter; initialized

# to veclenor $t3,$0,$a0 # Register t3 points to the component of v1or $t4,$0,$a1 # Register t4 points to the component of v2

loop2: lw $t5,0($t3) # Load word pointed to by reg. t3lw $t6,0($t4) # Load word pointed to by reg. t4mul $t2,$t5,$t6 # Multiply regs. t5 and t6, product in reg. t2add $v0,$v0,$t2 # Add product to running sum in reg. v0add $t3,$a3,$t3 # Increment the pointer to the component of v1add $t4,$a3,$t4 # Increment the pointer to the component of v2addi $t1,-1 # Decrement register t1

bgtz $t1,loop2 # Loop again if t1>0beamup: jr $ra # Beam me up....




• Requirement for the existence of an instance of dependence:A real memory location M[n] exists such that

M[n] = fS[iS] = fT [iT ]

where S is executed before T and both values of the induction variable arein the range of the loop:

p ≤ iS ≤ iT ≤ q

• Example: fS and fT are linear functions

fS[i] = aSi + bS, fT [i] = aTi + bT

The requirement for dependence implies that

aSiS + bS = aTiT + bT

⇒ aSiS − aTiT + (bS − bT ) = 0




• Example:

do 100 i=2,100T: b(i)=a(i-1)S: a(i)=c(i)

! Here, fS(i) = i, fT (i) = i− 1

! Condition fS(iS) = fT (iT ) is iS = iT − 1 ⇒ iS < iT(hence T depends on S)

! The equation iS = iT − 1 has lots of solutions such that2 ≤ iS < iT ≤ 100




• In serial execution, T3 reads from a(2) after S2 writes to a(2):

i = 2: T2: b(2)=a(1)S2: a(2)=c(2)

i = 3: T3: b(3)=a(2)S3: a(3)=c(3)

• In vector execution, T3 reads from a(2) before S2 writes to it:

b(2)=a(1)b(3)=a(2)...a(2)=c(2)a(3)=c(3)



CONTROL HAZARDS

• A control hazard occurs because a branch instruction needs to make a deci-sion based on the results of operations or instructions that are still pending

. A beq instruction cannot update the PC before the test for equality hascompleted

�We’ll see later that it’s possible to make the decision at the InstructionDecode/Register Fetch stage instead of the ALU stage

. In the MIPS ISA, only the instruction immediately followingthe branch is executed or not executed, depending on the branchdecision

� This isn’t true in longer pipelines

. Methods for dealing with control hazards

� Stall� Predict the branch� Always execute the instruction following the branch



STALLING AS A SOLUTION FOR CONTROL HAZARDS

• After a conditional branch (beq) there is a one-stage pipeline stall (bubble),even if we are able to compare the inputs to beq in the ID/RF stage


access Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns


access Reg2ns


access Reg

2ns

2 4 6 8 10 12 14 16Programexecutionorder(in instructions)



PREDICTING BRANCHES NOT TAKEN AS A SOLUTION


access Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)


access Reg2 ns


access Reg2 ns



access Reg

Time

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9


access Reg

2 4 6 8 10 12 14

2 4 6 8 10 12 14


access Reg

2 ns

4 ns

bubble bubble bubble bubble bubble




PIPELINE DELAYED BRANCH AS A SOLUTION

• After the conditional branch (beq), we insert an add instruction (which cando useful work) instead of a stall (bubble), which does nothing


access Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)


access Reg2 ns


access Reg

2 ns

2 4 6 8 10 12 14

2 ns

(Delayed branch slot)




FORWARDING AS A SOLUTION FOR DATA HAZARDS (1)

add $s0, $t0, $t1

sub $t2, $s0, $t3


IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

MEM

WBMEM



FORWARDING AS A SOLUTION FOR DATA HAZARDS (2)

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3


IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble



MIPS PIPELINES (3)

• MIPS R2000 integer unit pipeline hazards, named for pipeline registers:

IF/ID| {z }register

. ReadRegisterj| {z }name of register field

1. ID/EX.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)

2. EX/MEM.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)

3. MEM/WB.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)



MIPS PIPELINES (4)

• MIPS R2000 integer pipeline hazards generated by the instructions

sub $2,$1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2)

! The instruction and $12,$2,$5 results in hazard 1a,ID/EX.WriteRegister = IF/ID.ReadRegister1 = 2

! The instruction or $13,$6,$2 results in hazard 2b,EX/MEM.WriteRegister = IF/ID.ReadRegister2 = 2

! The instruction add $14,$2,$2 results in hazards 3a and 3b,MEM/WB.WriteRegister = IF/ID.ReadRegister1 = 2MEM/WB.WriteRegister = IF/ID.ReadRegister2 = 2



PIPELINED DEPENDENCIES IN AN INSTRUCTION SEQUENCE

IM Reg

IM Reg



sub $2, $1, $3


and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/–20 –20 –20 –20 –20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

IM Reg

IM Reg



sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/–20 –20 –20 –20 –20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X –20 X X X X XValue of EX/MEM :X X X X –20 X X X XValue of MEM/WB :

DM

Mux

ALU

ID/EX MEM/WB

Datamemory

EX/MEM

Registers

PIPELINED DATAPATHWITHOUT FORWARDING

Registers

Mux M

ux

ALU

ID/EX MEM/WB

Datamemory

Mux

Forwardingunit

EX/MEM

ForwardB

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

RtRtRs

ForwardA

Mux

PIPELINED DATAPATHWITH FORWARDING

PC Instructionmemory

Registers

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Forwardingunit

IF/ID

Inst

ruct

ion

Mux

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

DATAPATH MODIFIED TO RESOLVE HAZARDS BY FORWARDING



PIPELINED EXECUTION WITH FORWARDING


[40000028] sub $2, $1, $3[4000002c] and $4, $2, $5[40000030] or $4, $4, $2[40000034] add $9, $4, $2

! Without forwarding, there would be RAW hazards on register $2 in theand instruction and on register $4 in the or and add instructions


Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5 sub $2, $1, $3

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

10 10

$2

$5

5

2

4

$1

$3

3

1

2

Control

ALU

M

WB

ID/EX.WriteRegister = IF/ID.ReadRegister1 = 2(RAW data hazard)


Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

or $4, $4, $2 and $4, $2, $5

ID/EX

sub $2, . . .

EX/MEM

before<1>

MEM/WB

add $9, $4, $2

Clock 4

4

2

10 10

$4

$2

2

4

4

$2

$5

5

2

4

Control

ALU

10

2

WB

ID/EX.WriteRegister = IF/ID.ReadRegister1 = 4 andEX/MEM.WriteRegister = IF/ID.ReadRegister2 = 2(RAW data hazards)

EX/MEM.WriteRegister = ID/EX.ReadRegister1 = 2(test by which the need for forwarding to ALUIn1 is actually detected)

2

ALUIn1

ALUIn2


Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

add $9, $4, $2 or $4, $4, $2

ID/EX

and $4, . . .

EX/MEM

sub $2, . . .

MEM/WB

after<1>

Clock 5

4

2

10 10

$4

$2

2

4

9

$4

$2

4

2

24

Control

ALU

10

WB

2

1

4

EX/MEM.WriteRegister = IF/ID.ReadRegister1 = 4 andMEM/WB.WriteRegister = IF/ID.ReadRegister2 = 2

EX/MEM.WriteRegister = ID/EX.ReadRegister1 = 4 andMEM/WB.WriteRegister = ID/EX.ReadRegister2 = 2


Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

after<1>after<2> add $9, $4, $2 or $4, . . .

EX/MEM

and $4, . . .

MEM/WB

Clock 6

10

$4

$2

2

4

9

ALU

10

4

WB

4

1

Registers

Inst

ruct

ion

IF/ID

ID/EX

4

Control

EX/MEM.WriteRegister = ID/EX.ReadRegister2 = 4

Who wrote this value?

ALUSrcRegisters

Mux

Mux

Mux

ALU

ID/EX MEM/WB

Datamemory

Mux

Forwardingunit

EX/MEM

Mux

ADDITION OF A MULTIPLEXOR TOCHOOSE THE IMMEDIATE VALUE



MIPS PIPELINES (3)

• MIPS R2000 integer unit pipeline hazards, named for pipeline registers:

IF/ID| {z }register

. ReadRegisterj| {z }name of register field

1. ID/EX.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)

2. EX/MEM.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)

3. MEM/WB.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)

Reg

IM

Reg

Reg

IM



lw $2, 20($1)


and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

A DATA HAZARD THAT CANNOTBE RESOLVED BY FORWARDING

Data dependence goesbackward in time

lw $2, 20($1)


and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

HOW STALLS ARE INSERTEDINTO A PIPELINE



HAZARD DETECTION UNIT

• The control logic for the hazard detection unit is:

If (ID/EX.MemRead and((ID/EX.RegisterRt = IF/ID.RegisterRs) or(ID/EX.RegisterRt = IF/ID.RegisterRt)))

Thenstall the pipeline

. The first line tests whether the instruction is a load in the EX stage

� The next two lines check whether the destination register of the load isthe same as either of the source registers of the instruction that is inthe ID stage

• For a stall, all control signals are deasserted in the EX stage


Registers

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

0

Mux

IF/ID

Inst

ruct

ion

ID/EX.MemRead

IF/I

DW

rite

PCW

rite

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRtIF/ID.RegisterRt

IF/ID.RegisterRs

RtRs

Rd

Rt EX/MEM.RegisterRd

MEM/WB.RegisterRd

OVERVIEW OF PIPELINED CONTROL



PIPELINED EXECUTION WITH A STALL


[40000028] lw $2, 20($1)[4000002c] and $4, $2, $5[40000030] or $4, $4, $2[40000034] add $9, $4, $2

. The hardware inserts a stall after the lw instruction

� The stall creates the same e↵ect as a nop

� For a stall, all control signals are deasserted in the EX stage� Deasserted control signals are forwarded to the MEM and WB stages� Nothing is written to memory or the register file

. After the stall, forwarding resolves the RAW hazards on register $2 inthe and instruction and on register $4 in the or and add instructions

Hazarddetection

unit

0

MuxIF

/ID

Writ

e

PCW

rite

ID/EX.RegisterRt

ID/EX.MemRead

M

WB

$1

$X

X

1

2

before<3>


Registers

Mux

Mux

Mux

EX WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

ID/EX

EX/MEM

MEM/WB

and $4, $2, $5 lw $2, 20($1) before<1> before<2>

Clock 2

1

1

X

X11

Control

ALU

M

WB

Hazarddetection

unit

0

MuxIF

/ID

Writ

e

PCW

rite

ID/EX.RegisterRt

lw $2, 20($1)


Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

2

500 11

$2

$5

5

2

4

$1

$X

X

1

2

Control

ALU

M

WB

ID/EX.MemRead

$2

$5

5

2

24

WB

Hazarddetection

unit

0

MuxIF

/ID

Writ

e

PCW

rite

ID/EX.RegisterRt


Registers

Mux

Mux

Mux

EX

M

WB

Datamemory

Mux

Inst

ruct

ion

IF/ID

and $4, $2, $5 bubble

ID/EX

lw $2, . . .

EX/MEM

before<1>

MEM/WB

Clock 4

2

2

5

510

11

00

$2

$5

5

2

4

Control

ALU

M

WB

Forwardingunit

ID/EX.MemRead

or $4, $4, $2

000

Hazarddetection

unit

0

MuxIF

/ID

Writ

e

PCW

rite

ID/EX.RegisterRt

2

bubble lw $2, . . .


Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

EX/MEM

MEM/WB

add $9, $4, $2

Clock 5

2

210 10

11

$4

$2

2

4

4

4

2

4

$2

$5

5

2

4

Control

ALU

00

WB

ID/EX.MemRead

or $4, $4, $2


Hazarddetection

unit

0

MuxIF

/ID

Writ

e

PCW

rite

ID/EX.RegisterRt

bubble

Registers

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Inst

ruct

ion

IF/ID

add $9, $4, $2

ID/EX

and $4, . . .

EX/MEM

MEM/WB

Clock 6

4

4

2

210 10

$4

$2

2

4

49

$2

2

Control

ALU

10

WB00

after<1>

Forwardingunit

$4

4

4

or $4, $4, $2

ID/EX.MemRead

Mux

Registers

Inst

ruct

ion

ID/EX

4

Control


IF/I

DW

rite

PCW

rite

add $9, $4, $2 or $4, . . . and $4, . . .after<2> after<1>

Clock 7

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

EX/MEM

MEM/WB

10 10

$4

$2

2

4

9

ALU

10

WB

44

10

Hazarddetection

unit

0

Mux

ID/EX.RegisterRt

ID/EX.MemRead

IF/ID

Reg

Reg

CC 1


40 beq $1, $3, 7


IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

EFFECT OF A PIPELINE ON ABRANCH INSTRUCTION



MAKE THE BRANCH DECISION EARLY

• In the unoptimized example, taking the branch costs 3 clock periods

! The cost is higher in modern pipelines, which are much deeper

• The branch target calculation PC = (PC + 4) + offset*4 for the instruc-tion beq Rs, Rt, offset can be done in the ID/RF stage

• There is a faster way than a sub to compare the contents of Rs and Rt

! With a small combinational logic block, take the bitwise XOR of theregister contents

◦ This produces a word with 1 bits wherever the operands differ

! Then OR all of the bits of the resulting word

◦ The result is 1 if, and only if, the operands differ ⇒ branch not taken

! This can also be done in the ID/RF stage

• Result: In the MIPS ISA, there is only a one-cycle delay after a branch


4

Registers

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

Signextend

Control

Mux

=

Shiftleft 2

Mux

PIPELINED DATAPATH INCLUDINGSUPPORT FOR BRANCHES



A PIPELINED BRANCH


[40000024] sub $10, $4, $8[40000028] beq $1, $3, 7 # PC-relative branch to offset[4000002c] and $12, $2, $5 # (40 + 4) +7*4 = 72 = 0x48[40000030] or $14, $2, $6[40000034] add $14, $4, $2[4000003c] slt $15, $6, $7. . .[40000048] lw $14, 50($7)


4

Registers

Signextend

Mux

Mux

Control

EX

M

WB

M

WB

WB

Mux

Hazarddetection

unit

Forwardingunit

Mux

IF.Flush

IF/ID

and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8

MEM/WB

EX/MEM

ID/EX

Clock 3

72 44

48 44

28

7

$1

$3

10

48

72

72

0

$4

$8

ALU Datamemory

Mux

Shiftleft 2

before<1> before<2>

=

Mux

0

bubble (nop)lw $4, 50($7)

Clock 4

beq $1, $3, 7 sub $10, . . . before<1>


4

Registers

Signextend

Mux

Mux

Control

EX

M

WB

M

WB

WB

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

MEM/WB

EX/MEM

ID/EX

76 72

76 72

$1

$3

10

76

ALU Datamemory

Mux

Shiftleft 2

=

Reg

CC 1


40 beq $1, $3, 7


IM Reg

IM DM

DM

IM DM

DM Reg

Reg Reg

Reg

Reg

IM72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

EFFECT OF AN OPTIMIZED PIPELINEON A BRANCH INSTRUCTION

44 and $12, $2, $5



STATIC BRANCH PREDICTION

• Predicted behavior is based only on the branch instruction itself

. The early SPARC and MIPS architectures predicted that a branch wouldnot be taken

. A more sophisticated static prediction scheme would base the predictionon a comparison of the target address with the current value of the PC

� If the branch goes to a later instruction (i.e., to a higher address) thenit is never taken� If the branch goes to an earlier instruction, then it is always taken

• Problems with static prediction

. Predict correctly only for certain types of branches

. Example: If beq is the only available branch instruction, then it must betaken to exit from a loop

. If the beq target is a later instruction, then the branch is almost alwaysmispredicted in the “not taken to a later instruction” approach



DYNAMIC BRANCH PREDICTION (1)

• Base the predicted branch behavior on the history of the branch

• A common branch prediction scheme uses a branch history table

. Each entry in the memory is indexed by the lower 16 bits of the addressof the branch instruction

. Each entry consists of a bit that is set if the branch was recently taken

. If the branch is not taken, the bit is toggled

. Performance shortcoming: If a branch is almost always taken (or nottaken), then the bit gets toggled on a wrong prediction, and the nextbranch is likely to be mispredicted

� Example: A loop that is executed 10 times, using branch to the head� The branch is mispredicted at the beginning and end (80% accuracy)� Here, branch frequency (90% taken) 6= predicted frequency (80%)




• A 2-bit branch prediction scheme uses a branch history table in which eachentry contains 2 bits to indicate the state of a branch prediction FSM (nextslide)

. This scheme mispredicts only once if a branch almost always goes oneway

Look up Predicted PC

Number ofentriesin branch-targetbuffer

No: instruction isnot predicted to bebranch. Proceed normally

=

Yes: then instruction is branch and predictedPC should be used as the next PC

Branchpredictedtaken oruntaken

PC of instruction to fetch

A branch-target buffer

Taken

Taken

Taken

Taken

Not taken

Not taken

Not taken

Not taken

Predict taken Predict taken

Predict not taken Predict not taken

STATES IN A 2-BIT BRANCHPREDICTION SCHEME


After John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, 4th Edition


• Branch prediction accuracy for a 4096-entry, 2-bit prediction bu↵er

a. From before b. From target c. From fall through

sub $t4, $t5, $t6

…

add $s1, $s2, $s3

if $s1 = 0 then

add $s1, $s2, $s3

if $s1 = 0 then

add $s1, $s2, $s3

if $s1 = 0 then

sub $t4, $t5, $t6add $s1, $s2, $s3

if $s1 = 0 then

sub $t4, $t5, $t6

add $s1, $s2, $s3

if $s2 = 0 then

BecomesBecomesBecomes

Delay slot

Delay slot

Delay slot

sub $t4, $t5, $t6

if $s2 = 0 then

add $s1, $s2, $s3

SCHEDULING THE BRANCHDELAY SLOT



EXCEPTIONS (1)

• In the MIPS ISA, an exception is a synchronous (clocked) event thatcauses a process to stop executing

! System call (explicit instruction, e.g. for I/O)

◦ Stopping execution permits another process to execute while the processthat made the syscall waits for I/O

! Exception associated with execution of the current instruction

◦ Bus error (I/O timeout, load/store kernel physical address)◦ Protection exception◦ Attempt to execute a reserved instruction◦ Cache/TLB miss◦ Floating-point arithmetic exception

• An interrupt is an asynchronous event, external to the current instruction,that stops the execution of the current process

! Example: Hardware controller signals end of I/O



EXCEPTIONS (2)

• In the R2000 ISA, exceptions are handled by coprocessor 0

• How the R2000 processor and the UNIX kernel performexception handling:

1. Processor exits user mode & is forced into kernel mode.

2. The address of an exception vector (exception handling program) isloaded into the program counter (PC).

! Reset exception (reboot): the processor transfers control to the Resetexception vector at address 0xbfc00000

! UTLB Miss: Control is transferred to the exception vector pointed toby the contents of address 0x80000000

! All other exceptions are handled by the kernel◦ The general exception handler pointed to by the contents of

address 0x80000080 takes control, gets the cause from the Causeregister and transfers the correct exception handler

MIPS R2000 CPU AND COPROCESSORS

CPU

Registers$0

$31

Arithmeticunit

Multiplydivide

Lo Hi

Coprocessor 1 (FPU)

Registers$0

$31

Arithmeticunit

Registers

BadVAddr

Coprocessor 0 (traps and memory)

StatusCauseEPC

Memory

PC

MIPS CP0 and Exception Handling Registers

TLBEntryHi

TLBEntryLo

TLB(TranslationLookaside

Buffer)

“Safe”Entries

IndexRegister

RandomRegister

ContextRegister

BadVAddrRegister

EPCRegister

PRIdRegister

StatusRegister

CauseRegister

Used with virtual memory

Used for exception processing



MIPS R2000 COPROCESSOR 0

• BadVaddr register (coprocessor 0, register 8)

! Memory address at which an addressing exception occurred

• Status register (coprocessor 0, register 12)

! Interrupt mask and interrupt enable bits

! Kernel/user bits for old, previous and current processes

• Cause register (coprocessor 0, register 13)

! Holds a code for the cause of an exception

• Exception program counter (EPC) (coprocessor 0, register 14)

! Holds address of instruction that caused an exception



MIPS KERNEL CONVENTIONS (1)

• The MIPS kernel recognizes 4 memory segments: kuseg, kseg0, kseg1 andkseg2

. Addresses between 0x00400000 and 0x7fffffff belong to kuseg

� User address space

. Virtual addresses between 0x80000000 and 0x9fffffff belong to kseg0

� Addresses between 0x80000000 and 0x8fffffff are used for kerneltext (.ktext; executable instructions)� Addresses between 0x90000000 and 0x9fffffff are used for kernel

data (.kdata)� Addresses in this range are translated to physical memory by clearing

the high bit and mapping contiguously into the low 512 MB of memory



MIPS KERNEL CONVENTIONS (2)

• The MIPS kernel recognizes 4 memory segments: kuseg, kseg0, kseg1 andkseg2

. Addresses between 0xa0000000 and 0xbfffffff belong to kseg1

� Typically used for I/O registers, memory-resident ROM code and diskbu↵ers� Direct-mapped, uncached

. Addresses above 0xbfffffff belong to kseg2

� Process structures (remapped on context switches)� User page table entries� Caching and remapping via paging, not via swapping entire processes

kuseg

Virtual Physical

0x1fffffff

0x20000000

0x80000000

MIPS R2000 Memory Map

0xffffffff

0x7fffffff

UserMapped

Cacheable

0x00000000

kseg0Kernel

UnmappedCached

kseg1Kernel

UnmappedUncached

kseg2Kernel

MappedCacheable

0x9fffffff0xa0000000

0xbfffffff0xc0000000

# SPIM TRAP HANDLER DATA .kdata__m1_: .asciiz " Exception "__m2_: .asciiz " caught by trap handler.\n"__m3_: .asciiz "Continuing. . .\n"__m4_: .asciiz "Halting.\n"__e0_: .asciiz " [Interrupt]"__e1_: .asciiz " [TLB modification !BUG!]"__e2_: .asciiz " [TLB miss !BUG!]"__e3_: .asciiz " [TLB miss !BUG!]"__e4_: .asciiz " [Unaligned address in inst/data fetch]"__e5_: .asciiz " [Unaligned address in store]"__e6_: .asciiz " [Bad address in text read]"__e7_: .asciiz " [Bad address in data/stack read]"__e8_: .asciiz " [Error in syscall]"__e9_: .asciiz " [Breakpoint]"__e10_: .asciiz " [Reserved instruction]"__e11_: .asciiz " [Syscall exception !BUG!]"__e12_: .asciiz " [Arithmetic overflow]"__e13_: .asciiz " [Inexact floating point result]"__e14_: .asciiz " [Invalid floating point result]"__e15_: .asciiz " [Divide by 0]"__e16_: .asciiz " [Floating point overflow]"__e17_: .asciiz " [Floating point underflow]"__excp: .word __e0_,__e1_,__e2_,__e3_,__e4_,__e5_,__e6_,__e7_,__e8_,__e9_ .word __e10_,__e11_,__e12_,__e13_,__e14_,__e15_,__e16_,__e17_s1: .word 0s2: .word 0

# SPIM TRAP HANDLER CODE .ktext .space 0x80 # Put trap handler at 0x8000080 sw $v0 s1 # Not re-entrant sw $a0 s2 # Don't need to save k0/k1 mfc0 $k0 $13 # Cause and $k0 $k0 0xff# Use just ExcCode field mfc0 $k1 $14 # EPC li $v0 4 # Print " Exception " la $a0 __m1_ syscall li $v0 1 # Print exception number srl $a0 $k0 2 syscall li $v0 4 # Print type of exception lw $a0 __excp($k0) syscall li $v0 4 # Print " occurred.\n" la $a0 __m2_ syscall srl $a0 $k0 2 beq $a0 12 ret # continue on overflow beq $a0 13 ret # continue on inexact fp result beq $a0 14 ret # continue on invalid fp result beq $a0 16 ret # continue on fp overflow beq $a0 17 ret # continue on fp underflow li $v0 4 # Print "Halting.\n" la $a0 __m4_ syscall li $v0 10 # Exit on all bug overflow exceptions syscall # syscall 10 (exit)

ret: li $v0 4 # Print "Continuing. . .\n" la $a0 __m3_ syscall

lw $v0 s1 lw $a0 s2 addiu $k1 $k1 4 # Return to next instruction rfe # Return from exception handler jr $k1

.text .globl __start



EXCEPTIONS IN A PIPELINED PROCESSOR

• Five instructions are active in any given clock period

. Multiple exceptions can occur simultaneously

. If execution is not stopped soon enough, the value in the register thathelped cause the exception may be overwritten in the WB stage

. To flush the instructions that follow the instruction that caused the ex-ception, we add two new signals, ID.Flush and EX.Flush

. ID.Flush is ORed with the stall signal from the hazard detection unit toflush an instruction during its ID stage

. To flush an instruction in its EX stage, we add an input to the PC mul-tiplexor that sends 0x80000080 to the PC


4

Registers

Signextend

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Mux

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

=

ExceptPC

80000080

0

Mux

0

Mux

0

Mux

ID.Flush EX.Flush

Cause

Shiftleft 2

DATAPATH WITH CONTROLSTO HANDLE EXCEPTIONS



A PIPELINED EXCEPTION


[40000040] sub $11, $2, $4[40000044] and $12, $2, $5[40000048] or $13, $2, $6[4000004c] add $1, $2, $1 # overflow exception occurs here[40000050] slt $15, $6, $7[40000054] lw $16, 50($7). . .

given that the instructions to execute when an exception occurs are

[80000080] lui $1, -28672 # -28672 (base 10) = 0x9000[80000084] sw $2, 592($1) # sw $v0 s1. . .



A PIPELINED EXCEPTION

• The value of the PC when an instruction is issued is part of the instruction’scontrol state, and must be passed along in pipeline registers

! Otherwise, you’d never know exactly where you were clobbered

! Some architectures have imprecise exceptions (e.g., the IBM 360/91)

◦ In these cases, it’s usually the address that is in the PC when theexception occurs that is reported, not the address of the instructionthat actually caused the exception◦ This is especially annoying when a branch is taken immediately after

the exception-causing instruction!

• In the example, an integer overflow occurs, asserting the Overflow signal

! The Overflow signal must be routed to the Control block, which thenasserts the EX.Flush, ID.Flush and IF.Flush signals, and asserts acontrol signal that causes the PC Source multiplexor to load 0x80000080into the PC

slt $15, $6, $7lw $16, 50($7) add $1, $2, $1 or $13, . . . and $12, . . .

Clock 5

0x80000080

0

0

0

010

10

0

0

10

58 54

54

12

($6)

($7)

Write register 15

50

($2)

($1)

1

13 12

DatamemoryPC

4

Registers

Signextend

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Mux

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

=

ExceptPC

0x80000080

0

Mux

0

Mux

0

Mux

ID.Flush EX.Flush

Cause

Shiftleft 2

Instructionmemory

($2)

($1)

Write register 12

To Causeand Control

Overflow

bubble (nop)lui $1, -28672 bubble bubble or $13, . . .

Clock 6

80000084

80000084

13

0

0

0

000

0000

00

10

13

Datamemory

80000080

PC

4

Registers

Signextend

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Mux

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

=

ExceptPC

80000080

0

Mux

0

Mux

0

Mux

ID.Flush EX.Flush

Cause

Shiftleft 2

Instructionmemory

Control

PC

Instructionmemory

4

Registers

Signextend

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Mux

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

Mux

ExceptPC

80000080

0

Mux

0

Mux

0

Mux

ID.Flush EX.Flush

Cause

Shiftleft 2

Writedata

Readdata

Address

Readdata

Address Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1Readregister 2

ALUcontrol

3216

Inst

ruct

ion

Instruction [15–11]

Instruction [20–16]Instruction [20–16]

Instruction [25–21]

Reg

Writ

e

ALUOp

ALUSrc

RegDst

Mem

Writ

e

MemRead

Mem

toR

eg

Branch

=

PIPELINED DATAPATHWITH CONTROL



PIPELINE ENHANCEMENTS

• Design a deeper pipeline (many stages)

! The Pentium 4 “Willamette” pipeline had 20 stages; the “Prescott”, 31

◦ This permitted high clock frequencies ⇒ high power consumption

! The Core 2 Duo pipeline has 14 stages

• Issue more than one instruction per clock period (“superscalar” architecture)

! Theoretically, this divides the CPI by the number issued per clock

! We will study the modifications needed for issuing 2 instructions/clock

◦ Double the number of ALUs and read/write ports on the register file

• Schedule the pipeline dynamically

! Find useful instructions to schedule during a stall

! Major pipeline units:

◦ Instruction fetch/issue unit◦ Execution unit◦ Commit unit



DYNAMIC PIPELINE SCHEDULING (1)

• Major limitation of the statically scheduled pipeline that we have studied:In-order instruction issue and execution

! This permits head-of-line blocking of instructions that could execute

• One approach is to allow in-order issue and out-of-order execution

! For in-order issue, the IF/ID unit must check for structural hazards

! For out-of-order execution:

◦ Need multiple functional units¶ Execution occurs whenever there are no data dependences or hazards

◦ A new kind of unit, a scoreboard, must check for data hazards

! Out-of-order completion means that there are imprecise exceptions

! This is the design approach used in the CDC 6600



BOOKKEEPING FOR DYNAMIC SCHEDULING

• The scoreboarding technique was introduced in the CDC 6600 (1963)

! Goal: Maintain a low CPI by executing as early as possible

! The scoreboard maintains several status tables

◦ Status of each instruction: Issued, Operands Read, Execution Com-pleted, Results Written

◦ Once an instruction has issued, the functional unit table keeps a recordof the operands¶ Functional unit status: Busy, Operation Underway, Destination Reg-

ister Name, Source Register Names, Units Producing Source RegisterOperands, Flags (indicating when the source register operands areready)

◦ Register result status¶ Indicates which functional unit will write to the register



PIPELINE DEPTH vs. SPEEDUP

• The graph on the following page shows the speedup achieved by increasingthe number of stages, assuming:

! Constant clock frequency

! A single instruction queue with in-order issue and completion

• The speedup achieved under these circumstances is much less than the num-ber of stages

! This is a result of data and control hazards ⇒ pipeline stalls

• In reality, increasing the number of stages so that there are fewer levels oflogic per stage makes it possible to increase the clock frequency

! This shifts the curve toward the right (higher number of stages)

• To achieve a much higher speedup than is shown in the graph, designersresort to multiple issue and dynamic scheduling

1 2 4 8 16

Pipeline depth

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Rel

ativ

e pe

rfor

man

ce

PIPELINE DEPTH vs.SPEEDUP



SUPERSCALAR EXAMPLE


[40000040] lw $8, 0($17) # $17=pointer[40000044] addu $8, $8, $18 # $8=array element[40000048] sw $8, 0($17)[4000004c] addi $17, $17, -4 # decrement pointer[40000050] bne $17, $19, -5

assuming a static 2-issue MIPS pipeline

. Note that a comparison with $0 would be invalid, since a null pointer isnot a valid address (this corrects the code in the textbook)

• The new hardware for the example is a second pipeline for data transfer(load and store) instructions

. The original pipeline is used for ALU and branch instructions


4

RegistersMux

Mux

ALU

Mux

Datamemory

Mux

80000080

Signextend Sign

extend

ALU Address

Writedata

SUPERSCALAR DATAPATH

ADDRESS CALC.



SUPERSCALAR EXAMPLE: SCHEDULING

ALU or branch instruction Data transfer instruction Clock cycle

Loop: lw $t0, 0($s1) 1

addi $s1,$s1,–4 2

addu $t0,$t0,$s2 3

bne $s1,$s3,Loop sw $t0, 4($s1) 4

• The resulting CPI is 0.8 instead of the theoretical value, 0.5

! Speedup = 1.25 instead of 2.0

• The problem is that we are taking 4 clocks to execute 5 instructions

! Two of these instructions (addi and bne) are loop overhead

• In some memory architectures there may be a hardware conflict between thestore in one loop instance and the load in the next instance



LOOP UNROLLING

• The goal of loop unrolling is to minimize the performance impact of loopoverhead by executing several instances of the loop for one set of overheadinstructions

! If a loop is unrolled in hardware, then the original register targets ofdata-transfer and computational instructions must be renamed

! Register renaming is especially important in executing x86 instruc-tions, because there are very few general-purpose x86 architecturalregisters

◦ The architectural registers can be renamed into a larger set of physicalregisters

! Loops can also be unrolled in software

◦ Compiler unrolling (often performed by optimizing compilers)◦ Unrolling in a higher-level language



SUPERSCALAR EXAMPLE: LOOP UNROLLING

ALU or branch instruction Data transfer instruction Clock cycle

Loop: addi $s1,$s1,–16 lw $t0, 0($s1) 1lw $t1,12($s1) 2

addu $t0,$t0,$s2 lw $t2, 8($s1) 3addu $t1,$t1,$s2 lw $t3, 4($s1) 4addu $t2,$t2,$s2 sw $t0, 16($s1) 5addu $t3,$t3,$s2 sw $t1,12($s1) 6

sw $t2, 8($s1) 7bne $s1,$s3,Loop sw $t3, 4($s1) 8

• The loop in this example is unrolled to a depth of 4

• The resulting CPI is 0.57, much closer to the theoretical value of 0.5

! Speedup = 1.75, much closer to 2.0



LOOP UNROLLING IN SOFTWARE

• Computation of one component of a matrix-vector product y = Ax in C:

for (j=0; j<n; j++){y[i] = y[i] + a[i][j] * x[j];}

! The vector y must be initialized to 0 in a previous loop

• The same loop, unrolled to a depth of 4:

for (j=0; j<n; j+=4){y[i] = (((y[i] + a[i][j-3]*x[j-3]) + a[i][j-2]*x[j-2]) \

+ a[i][j-1]*x[j-1]) + a[i][j]*x[j] ;}

! The programmer has to ensure that n is a multiple of 4



DYNAMIC PIPELINE SCHEDULING (2)

• When several instructions are issued in a clock period, it is possible to re-order the executions to minimize pipeline stalls

• There are multiple pipelines, divided into three major types of unit:

! Instruction fetch/instruction decode unit

! Reservation stations

◦ These are buffers that hold the instructions’ operands and control state

! Integer and floating-point out-of-order execution units

◦ Execution occurs whenever there are no data dependences or hazards

! Commit unit

◦Maintains a reorder buffer◦ Buffers the results of execution until it is safe to write the results◦ The reorder buffer can also provide operands, like the forwarding units

in a statically scheduled pipeline



A DYNAMICALLY SCHEDULED PIPELINE

Commitunit

Instruction fetchand decode unit

…

In-order issue

In-order commit

Load/Store

Floatingpoint

IntegerInteger …Functionalunits

Out-of-order execute

Reservationstation

Reservationstation

Reservationstation

Reservationstation



TOMASULO’S ALGORITHM (1)

• Focuses on floating-point execution

! Originally designed for the IBM 360/91, with long memory-access andfloating-point execution times

! Can support overlapping execution of multiple loop instances

• Tomasulo addressed limitations of the scoreboard approach

! Hazard detection and control of execution are distributed to the reserva-tion stations

! Results are forwarded directly to the functional units instead of goingthrough registers

◦ Results are broadcast on a common data bus

IMPLEMENTATION OF TOMASULO’S ALGORITHM

From instruction unitFloating-pointoperationqueue

Frommemory

Load buffersFP registers

Store buffers

Tomemory

654321 3

21

Reservationstations

FP adders FP multipliers

321

21

Common data bus (CDB)

Operation bus

Operandbuses



TOMASULO’S ALGORITHM (2)

• Control fields in each reservation station:

! Operation

! The IDs of the reservation stations that will produce the operands

◦ The reservation stations can rename registers◦ This enables overlapping different loop iterations

! The values of the operands

◦ Note that values are available sooner than if the functional units hadto contend for access to write to a register

! A “busy” flag

• Control fields for each register and store buffer:

! The number of the functional unit that will produce the value to be written

! A “busy” flag


After David A. Patterson and John L. Hennessy, Computer Organization and Design, 3rd Edition

PIPELINING IMPROVES THROUGHPUT

Slo

wer

Clo

ck r

ate

FasterSlower

Instruction throughput(instructions per clock cycle or 1/CPI)

Multicycledatapath

Pipelineddatapath

Single-cycledatapath

Fast

er

Multiple-issuepipelined

Deeplypipelined

Multiple issuewith deep pipeline


After David A. Patterson and John L. Hennessy, Computer Organization and Design, 3rd Edition

PIPELINING DOES NOT IMPROVE LATENCYS

hare

d

Har

dwar

e

Several1

Clock cycles of latency for an instruction

Single-cycledatapath

Pipelineddatapath

Multicycledatapath

Spe

cial

ized

Deeplypipelined

Multiple issuewith deep pipeline

Multiple-issuepipelined



MICROPROCESSOR PIPELINES

Microprocessor Year Clock RatePipeline Stages

Issue Width

Out-of-Order/ Speculation

Cores/ Chip Power

Intel 486 1989 25 MHz 5 1 No 1 5 W

Intel Pentium 1993 66 MHz 5 2 No 1 10 W

Intel Pentium Pro 1997 200 MHz 10 3 Yes 1 29 W

Intel Pentium 4 Willamette 2001 2000 MHz 20 3 Yes 1 75 W

Intel Pentium 4 Prescott 2004 3600 MHz 31 3 Yes 1 103 W

Intel Core 2006 2930 MHz 14 4 Yes 2 75 W

Sun UltraSPARC III 2003 1950 MHz 14 4 No 1 90 W

Sun UltraSPARC T1 (Niagara) 2005 1200 MHz 6 1 No 8 70 W


c©Intel

PENTIUM 4 “WILLAMETTE” CHIP LAYOUT

400MHz

SystemBus

AdvancedTransferCache

Hyperpipeline(20 stages)

EnhancedFloatingPoint &

Multimedia

ExecutionTrace Cache

RapidExecution

Advanced DynamicExecution (A.D.E.)

A.D.E.

A.D.E.

DataCache



PENTIUM 4 FUNCTIONAL UNITS

• “400 MHz” system bus

! 100 MHz, 4 instructions wide

• Advanced transfer cache (L2 cache, 256 kB, instructions + data)

• Execution trace cache

! L1 instruction cache; stores decoded CISC instructions

• Hyperpipelined unit

! Used for uniform-length microinstructions (“micro-ops,” in Intel-speak)

• Enhanced floating-point & multimedia unit

• Rapid execution engine

! Parallel, partly double-clocked execution of microinstructions

• Advanced dynamic execution

! Deep, out-of-order speculative execution & branch prediction



AMD OPTERON X4 MICROARCHITECTURE

Instruction prefetchand decodeBranch

prediction

Register file

IntegerALU

IntegerALU.

Multiplier Integer

ALU

Floatingpoint

Adder/SSE

Floatingpoint

Multiplier/SSE

FloatingpointMisc

Datacache

Instruction cache

RISC-operation queue

Dispatch and register renaming

Integer and floating-point operation queue

Load/Store queue

Commitunit



AMD OPTERON X4 PIPELINE

Number ofclock cycles

Reorderbuffer

allocation +register

renaming

InstructionFetch

Scheduling+ dispatch

unit

Decodeand

translateExecution Data Cache/

Commit

RISC-operationqueue

Reorderbuffer

3 22 22 1

Instructionmemory

Address


Branch

ALUSrc

4

Instruction[15–0]

0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

ALUresult

Zero

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU


EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: EX: MEM: WB:

MEM/WB

IF:

000

10

1100

001

00

000

1

00

01

0

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

Mem

Writ

e

1

11

11

10

$11

$10

11

1

$5

Mux

0

Mux1

ALUOp

RegDst

ALUcontrol

M

WB

31 15

15$6

0

WB

Mem

toR

eg

11

2090 16

6

Date post:	08-Dec-2015
Category:	Documents
Upload:	safer-muhammet
View:	221 times
Download:	4 times

4304-6-pipe

Documents