+ All Categories
Home > Documents > 4304-6-pipe

4304-6-pipe

Date post: 08-Dec-2015
Category:
Upload: safer-muhammet
View: 221 times
Download: 4 times
Share this document with a friend
Description:
4304
Popular Tags:
155
The University of Texas at Dallas Erik Jonsson School of Engineering & Computer Science c C. D. Cantrell (12/1999) PIPELINING: A CONTINUATION OF PROCESSOR DESIGN We found that the single-cycle implementation wastes time All instructions take as long as the instruction with the longest delay (lw) In the multicycle implementation: The clock period is much shorter than in the single-cycle implementation Instructions take only as many clock periods as they need BUT: Each functional unit is used only once or twice in executing an instruction We need an implementation in which each functional unit is busy in every clock period This is possible if we cut the execution of an instruction into stages, and then overlap the execution of different instructions
Transcript
Page 1: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (12/1999)

PIPELINING: A CONTINUATION OF PROCESSOR DESIGN

• We found that the single-cycle implementation wastes time

! All instructions take as long as the instruction with the longest delay (lw)

• In the multicycle implementation:

! The clock period is much shorter than in the single-cycle implementation

! Instructions take only as many clock periods as they need

! BUT: Each functional unit is used only once or twice in executing aninstruction

◦We need an implementation in which each functional unit is busy inevery clock period◦ This is possible if we cut the execution of an instruction into stages,

and then overlap the execution of different instructions

Page 2: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (09/2011)

STEPS IN EXECUTING AN INSTRUCTION

Step R-type Memory reference Branches Jumps

Instruction IR = M[PC]Fetch PC = PC + 4

Instruction A = Reg[IR[25–21]]decode, B = Reg[IR[20–16]]

Register Fetch ALUOut = PC + (sign-extend(IR[15–0])<<2)Execution, ALUOut = A op B ALUOut = A If A == B then PC = PC[31–28]

address comp., + (sign-extend PC = ALUOut concatenated w/branch/jump (IR[15–0]) (IR[25–0]<<2)completion

Memory access Reg[IR[15–11]] Load: MDRor = ALUOut = M[ALUOut]

R-type completion Store: M[ALUOut] = BMemory read Load: Reg[IR[20–16]]completion = MDR

Page 3: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2011)

PIPELINING

• In a pipelined computer architecture, a single processor can execute severalinstructions concurrently, reducing the CPI

. Execution of one instruction uses several hardware functional units(instruction memory, register file, ALU, data memory, etc.)

. The functional units are organized into stages

� Execution at each stage takes 1 clock period� Stages are separated by clock-controlled pipeline registers that pre-

serve the state of execution for the duration of a clock period

. The pipeline is subject to hazards

� Data hazards: Write/read conflicts or timing problems� Control hazards: Exceptions and branches

• The MIPS R2000 pipeline design strongly influenced the design of all sub-sequent processors

Page 4: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2011)

PIPELINING: PLUSES AND MINUSES

• What makes pipelining easy in the MIPS ISA:

. All instructions are the same length

. There are only a few instruction formats

. Memory operands occur only in loads and stores

• What makes pipelining hard in any ISA:

. Structural hazards (e.g., contention for the same functional unit)

. Control (branch & exception) hazards

. Data hazards (e.g., trying to read a register before it’s written)

We will build a simple pipeline to illustrate these issues

• Pipelining is even more di�cult in modern general-purpose microprocessors

. Exception handling is a challenge

. Performance improvements such as simultaneous instruction issue, out-of-order execution, etc., create lots of complications

Page 5: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (09/2010)

PIPELINE DESIGN APPROACH

• Begin with the multicycle implementation

! Different functional units are all executing the same instruction, althoughthe units are active in different clock periods

! Control information does not need to be stored in the temporary registers

• Identify the changes that need to be made in the pipelined design

! Different functional units are executing different instructions

! All information needed for execution of a given instruction must propagatethrough the pipeline with the instruction

! Control information must be stored between stages, because the controlsignals are different for different instructions

◦ Control design is a source of complexity in pipeline design (think aboutwhat happens when a branch is taken)

! Results of execution may differ from the multicycle implementation

◦ Data hazards are another source of complexity

Page 6: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

MULTICYCLE DATAPATH AND CONTROL

Shiftleft 2

PCMux

0

1

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[15–11]

Mux

0

1

Mux

0

1

4

Instruction[15–0]

Signextend

3216

Instruction[25–21]

Instruction[20–16]

Instruction[15–0]

Instructionregister

ALUcontrol

ALUresult

ALUZero

Memorydata

register

A

B

IorD

MemRead

MemWrite

MemtoReg

PCWriteCond

PCWrite

IRWrite

ALUOp

ALUSrcB

ALUSrcA

RegDst

PCSource

RegWrite

Control

Outputs

Op[5–0]

Instruction[31-26]

Instruction [5–0]

Mux

0

2

Jumpaddress [31-0]Instruction [25–0] 26 28

Shiftleft 2

PC [31-28]

1

1 Mux

0

32

Mux

0

1ALUOut

Memory

MemData

Writedata

Address

Page 7: 4304-6-pipe

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

SINGLE-CYCLE DATAPATH

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back

Page 8: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 4th Edition

SEQUENTIAL vs. PIPELINED EXECUTION

Programexecutionorder(in instructions)

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time 1000 1200 1400200 400 600 800

1000 1200 1400200 400 600 800

1600 1800

Instructionfetch

Dataaccess Reg

Instructionfetch

Dataaccess Reg

Instructionfetch

800 ps

800 ps

800 ps

Programexecutionorder(in instructions)

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time

Instructionfetch

Dataaccess Reg

Instructionfetch

Instructionfetch

Dataaccess Reg

Dataaccess Reg

200 ps

200 ps

200 ps 200 ps 200 ps 200 ps 200 ps

ALUReg

ALUReg

ALU

ALU

ALU

Reg

Reg

Reg

Page 9: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (11/1999)

PIPELINING (2)

• Pipeline speedup:

! A pipelined processor with s stages can execute n instructions in

ETP = s + (n− 1) clock periods

(assuming no hazards)

! A serial processor executes the same n instructions in

ETS = ns clock periods

! The ideal pipeline speedup equals the number of stages:

SP =ETS

ETP=

ns

s + (n− 1)−→n¿s

s

• Amdahl’s law applies to pipelining

Page 10: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (05/1999)

PROGRAMMING IMPLICATIONS OF PIPELINING

• Avoid function or subprogram calls in an inner loop

! Jumps force the pipeline to be flushed

• Avoid recursion in an inner loop

! Recursion on the elements of an array generally causes data hazards be-cause the value of v[n] has not been written before it is needed for thecomputation of v[n+1]

• Avoid scalar temporary variables in an inner loop

! Reading a memory-resident scalar variable may cause a data hazard

• Avoid case and switch statements in an inner loop

! Conditional branches cause control hazards, and the use of a jump tablemay cause data hazards

Page 11: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (05/1999)

MIPS PIPELINES (1)

• MIPS R2000 integer unit pipeline stages(Patterson & Hennessy, Chapter 6)

1. Instruction Fetch (IF)

2. Instruction Decode (ID) and Register Fetch

3. Execute (EX or ALU)

! ALU operations, condition evaluation, address computation

4. Memory access (MEM)

5. Write back (WB) to register file

Clockperiods

1 2 3 4 5

IF ID WBEX MEM

Page 12: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (05/1999)

R4000 PIPELINE

IF IS

Instruction memory Reg ALU Data memory Reg

RF EX DF DS TC WB

The eight-stage pipeline of the R4000

Page 13: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (05/1999)

MIPS PIPELINES (2)

• MIPS R2000 floating-point unit pipeline stages

1. Instruction Fetch (IF)2. Register Fetch and Instruction Decode (RD)

! FPU decodes instruction on bus to see if it’s floating-point! FPU reads data from its registers

3. Execute (EX or ALU)4. Memory access (MEM)5. Exception processing (stage called WB for correspondence with

integer pipeline)6. Write back (FWB)

Clockperiods

1 2 3 4 5

IF ID WBEX MEM FWB

6

Page 14: 4304-6-pipe

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

SINGLE-CYCLE DATAPATH

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back

Page 15: 4304-6-pipe

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

PIPELINED EXECUTION IN SINGLE-CYCLE DATAPATH

IM Reg DM RegALU

IM Reg DM RegALU

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

Time (in clock cycles)

lw $2, 200($0)

lw $3, 300($0)

Programexecutionorder(in instructions)

lw $1, 100($0) IM Reg DM RegALU

Page 16: 4304-6-pipe

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

SINGLE-CYCLE DATAPATHWITH PIPELINE REGISTERS

Because the state of a D flip-flop changes only on clock edges,new data can be asserted on the inputs of the pipeline registers

while the data written in the previous clock period is still valid on the outputs

Inputside

Outputside

Inputside

Outputside

Inputside

Outputside

Inputside

Outputside

Page 17: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (09/2011)

MASTER-SLAVE D FLIP-FLOP

• The master latch (on the left) receives the D and clock (C) inputs

. When the clock is asserted, the Q output of the master latch follows thedata (D)

. When the clock is deasserted, the master latch is closed, but the second(slave) latch is open

� The output of the slave latch follows its input, which is the output ofthe master latch

QQ

_Q

Q

_Q

Dlatch

D

C

Dlatch

DD

C

C

Page 18: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

COMBINATIONAL LOGIC AND STATE ELEMENTS

Clock cycle

Stateelement

1Combinational logic

Stateelement

2

• Every state element has 2 control inputs: Clock signal and write enable

Page 19: 4304-6-pipe

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

STAGE 1 OF A LOAD INSTRUCTION

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw

Address

Datamemory

Page 20: 4304-6-pipe

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

STAGE 2 OF A LOAD INSTRUCTION

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Instruction decode

lw

Address

Datamemory

Page 21: 4304-6-pipe

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

STAGE 3 OF A LOAD INSTRUCTION

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Execution

lw

Address

Datamemory

Page 22: 4304-6-pipe

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

STAGE 4 OF A LOAD INSTRUCTION

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memory

lw

Address

Page 23: 4304-6-pipe

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

STAGE 5 OF A LOAD INSTRUCTION

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writedata

ReaddataData

memory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write backlw

Writeregister

Address

Page 24: 4304-6-pipe

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

STAGE 3 OF A STORE INSTRUCTION

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Execution

sw

Address

Page 25: 4304-6-pipe

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

STAGE 4 OF A STORE INSTRUCTION

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memory

sw

Address

Page 26: 4304-6-pipe

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

STAGE 5 OF A STORE INSTRUCTION

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write back

sw

Page 27: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2011)

DATAPATH MODIFICATIONS FOR PIPELINING

• The number of the register that an instruction must write to is read in theID stage

. The name of the signal is WriteReg

• Consider two instructions:

lw $10, 20($1)sub $11, $2, $3

. WriteReg signal values are 10 (for lw) and 11 (for sub)

. The WriteReg signal is read only in the WB stage

. The sub’s ID stage modifies WriteReg before lw can read it

. Therefore the value of the WriteReg signal is part of theinstruction’s state, and must be passed along in pipeline registers as theinstruction executes

Page 28: 4304-6-pipe

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX

DATAPATH MODIFIED TOHANDLE A LOAD

WriteReg WriteReg WriteReg

Page 29: 4304-6-pipe

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Address

Datamemory

PIPELINE STAGES USED BYA LOAD INSTRUCTION

Page 30: 4304-6-pipe

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

TWO REPRESENTATIONS OF PIPELINED EXECUTION

IM Reg DM Reg

IM Reg DM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $10, 20($1)

Programexecutionorder(in instructions)

sub $11, $2, $3

ALU

ALU

Programexecutionorder(in instructions)

Time ( in clock cycles)

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Instructionfetch

Instructiondecode

Instructionfetch

Instructiondecode Execution Write back

Execution

Dataaccess

Dataaccess Write backlw $10, $20($1)

sub $11, $2, $3

Page 31: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2011)

PIPELINED EXECUTION OF TWO INSTRUCTIONS

• Exercise: Show the signal values in the datapath in the pipelined executionof the instructions

lw $10, 20($1)sub $11, $2, $3

in each of the following six slides

. Assume the following register and memory contents:

($1) = 0x1000 0000(M[0x1000 0014]) = 0x7fff fffc($2) = 0x0000 000e($3) = 0x0000 0008

. Also show the values of the WriteReg signal in each stage

Page 32: 4304-6-pipe

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw $10, 20($1)

Address

Datamemory

Clock 1

CLOCK PERIOD 1

Page 33: 4304-6-pipe

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction decode

lw $10, 20($1)Instruction fetch

sub $11, $2, $3

Address

Datamemory

Clock 2

CLOCK PERIOD 2

Page 34: 4304-6-pipe

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

lw $10, 20($1)Instruction decode

sub $11, $2, $3

3216Sign

extend

Address

Datamemory

Clock 3

CLOCK PERIOD 3

Page 35: 4304-6-pipe

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

3216Sign

extend

Writeregister

Writedata

Memory

lw $10, 20($1)

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

sub $11, $2, $3

Datamemory

Address

Clock 4

CLOCK PERIOD 4

Page 36: 4304-6-pipe

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

lw $10, 20($1)

Memory

sub $11, $2, $3

Address

Datamemory

Clock 5

CLOCK PERIOD 5

Page 37: 4304-6-pipe

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

sub $11, $2, $3

Address

Datamemory

Clock 6

CLOCK PERIOD 6

Page 38: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2011)

PIPELINE STAGES: DATAPATH

Stage R-type Memory reference Branches

IF IF/ID Instruction = IM[PC]IF/ID PC = PC + 4

ID/EX PC = IF/ID PCID/EX A = Reg[IF/ID Instruction[25–21]]

ID ID/EX B = Reg[IF/ID Instruction[20–16]]ID/EX Immediate = sign-extend(IF/ID Instruction[15–0])

EX/MEM ALUOut EX/MEM ALUOut = A EX/MEM PCEX = A op B + ID/EX Immediate = ID/EX PC

WriteReg = ID/EX Inst[15–11] WriteReg = ID/EX Inst[20–16] + ((ID/EX Imm)<<2)EX/MEM B = ID/EX B

MEM/WB ALUOut = EX/MEM ALUOutAddress = EX/MEM ALUOut

MEM Load: MEM/WB ReadData PC = EX/MEM PC= DM[Address]

Store: DM[Address] = EX/MEM BMEM/WB WriteReg = EX/MEM WriteReg

WB Reg[MEM/WB WriteReg] Load: Reg[MEM/WB WriteReg]= MEM/WB ALUOut = MEM/WB ReadData

Page 39: 4304-6-pipe

PC

Instructionmemory

Address

Inst

ruct

ion

Instruction[20–16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15–0]

0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux1

Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead

Instruction[15–11]

6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

AddAdd

result

Shiftleft 2

ALUresult

ALUZero

Add

0

1

Mux

0

1

Mux

PIPELINED DATAPATHWITH CONTROL SIGNALS

ALUOut

A

B

A

B

B

ALUOut

PC PC PC

PC

Imm

WriteReg WriteReg

Page 40: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (05/1999)

SETTINGS OF CONTROL LINES

Ex/Address Calc. Mem. Access WriteBackInst. Reg ALUOp ALUOp ALU Br Mem Mem Reg Mem-type Dst bit 1 bit 0 Src Read Write Write to-reg.

R-type 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw d 0 0 1 0 0 1 0 dbeq d 0 1 0 1 0 0 0 d

Page 41: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2011)

PIPELINED CONTROL

• The control signals for an instruction are determined in the ID stage

. The next instruction’s ID stage asserts new values of the control signals

. The current instruction’s control signals must be preserved for all stagesafter ID

. Control signals are part of the state of the instruction, and therefore mustbe passed along from stage to stage in pipeline registers, just like data

. Some instruction fields (such as Immediate) must be preserved until theyare needed in later stages

Page 42: 4304-6-pipe

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

CONTROL LINES FOR THETHREE FINAL STAGES

RegWriteMemtoReg

RegWriteMemtoReg

BranchMemReadMemWrite

RegDst}ALUOpALUSrc

Page 43: 4304-6-pipe

PC

Instructionmemory

Inst

ruct

ion

Add

Instruction[20–16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15–0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15–11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Mem

Writ

e

AddressData

memory

Address

PIPELINED DATAPATH WITHCONTROL LOGIC AND SIGNALS

A

B

A

B

PC PC

PC

Imm

WriteReg WriteReg

Page 44: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2011)

PIPELINED EXECUTION OF FIVE INSTRUCTIONS

• We’ll follow what happens in the instruction sequence

[40000024] lw $10, 20($1)[40000028] sub $11, $2, $3[4000002c] and $12, $4, $5[40000030] or $13, $6, $7[40000034] add $14, $8, $9

. For each clock period, note the values of the following signals in theID/EX, EX/MEM, and MEM/WB pipeline registers:

�WB: RegWrite, MemtoReg�M: Branch, MemRead, MemWrite� EX: RegDst, ALUOp, ALUSrc

Page 45: 4304-6-pipe

Instructionmemory

Instruction[20–16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

Instruction[15–0]

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15–11]

EX

M

WB

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000

00

0000

000

00

000

0

00

00

0

00

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

Mem

Writ

e

Address

Clock 1

Page 46: 4304-6-pipe

WB

EX

M

Instructionmemory

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

Mux

0

1

Add Addresult

Writeregister

Writedata

Mux1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

ALU

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010

11

0001

000

00

000

0

00

00

0

00

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

X

10

20

X

1

Instruction[20–16]

Instruction[15–0] Sign

extend

Instruction[15–11]

20

$X

$1

10

X

MemRead

Mem

Writ

e

Datamemory

Address

Address

Clock 2

Page 47: 4304-6-pipe

Instructionmemory

Address

Instruction[20–16]

Mem

toR

eg

Branch

ALUSrc

4

Instruction[15–0]

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15–11]

EX

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000

10

1100

010

11

000

1

00

00

0

00

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

Mem

Writ

e

sub

11

X

X

3

2

X

$3

$2

X

11

$1

20

10

Mux

0

Mux1

ALUOp

RegDst

ALUcontrol

M

WB

Zero

Signextend

Datamemory

Address

Clock 3

Page 48: 4304-6-pipe

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

Reg

Writ

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX MEM/WB

000

10

1100

000

10

101

0

11

10

0

00

Mux

0

1

Add

PC

0Writedata

Mux

1

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

12

X

X

5

4

Instruction[20–16]

Instruction[15–0]

Instruction[15–11]

X

$5

$4

X

12

MemRead

Mem

Writ

e

$3

$2

11

Mux

Mux

ALUAddress Read

dataData

memory

10

WB

Zero

Signextend

Clock 4

ID: and $12, $4, $5 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>IF: or $13, $6, $7

Page 49: 4304-6-pipe

Instructionmemory

Address

Instruction[20–16]

Branch

ALUSrc

4

Instruction[15–0]

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15–11]

EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

MEM/WB

IF: add $14, $8, $9

000

10

1100

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

Mem

Writ

e

or

13

X

X

7

6

X

$7

$6

X

13

$4

Mux

0

Mux1

ALUOp

RegDst

ALUcontrol

M

WB

11 10

10$5

12

WB

Mem

toR

eg

11

Zero

Datamemory

Address

Signextend

Clock 5

Page 50: 4304-6-pipe

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

ALUcontrol

Shiftleft 2

Reg

Writ

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

MEM/WB

IF: after<1>

000

10

1100

000

10

101

0

10

00

0

10

Mux

0

1

Add

PC

0Writedata

Mux

1

addControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

14

X

X

9

8

Instruction[20–16]

Instruction[15–0]

Instruction[15–11]

X

$9

$8

X

14

MemRead

Mem

Writ

e

$7

$6

13

Mux

Mux

ALUReaddata

12

WB

11

11

Writeregister

Writedata

Zero

Datamemory

Address

Signextend

Clock 6

Page 51: 4304-6-pipe

Instructionmemory

Address

Instruction[20–16]

Branch

ALUSrc

4

Instruction[15–0]

0

1

Add Addresult

RegistersWriteregister

Writedata

ALUresult

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15–11]

Signextend

EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .

MEM/WB

IF: after<2>

000

00

0000

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

Mem

Writ

e

$8

Mux

0

Mux1

ALUOp

RegDst

ALUcontrol

M

WB

13 12

12$9

14

WB

Mem

toR

eg

10

Readdata 1

Readdata 2

Readregister 1

Readregister 2 Zero

Datamemory

Address

Clock 7

Page 52: 4304-6-pipe

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .

MEM/WB

IF: after<3>

000

00

0000

000

00

000

0

10

00

0

10

Mux

0

1

Add

PC

0Writedata

Mux

1

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[20–16]

Instruction[15–0] Sign

extend

Instruction[15–11]

MemRead

Mem

Writ

e

Mux

Mux

ALUReaddata

14

WB

13

13

Writeregister

Writedata

Datamemory

Address

Clock 8

Page 53: 4304-6-pipe

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .

MEM/WB

IF: after<4>

000

00

0000

000

00

000

0

00

00

0

10

Mux

0

1

Add

PC

0Writedata

Mux

1

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[20–16]

Instruction[15–0] Sign

extend

Instruction[15–11]

MemRead

Mem

Writ

e

Mux

Mux

ALUReaddata

WB

14

14

Writeregister

Writedata

Datamemory

Address

Clock 9

Page 54: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (05/1999)

DATA HAZARDS (1)

• Data hazards occur when the order of read and write actions is not the orderin strictly sequential execution

! Hazards are named by the ordering in the program that must be preservedin the course of pipelined execution

◦ In the following, the order of execution should be i, then j

! RAW (read after write) — j reads a source before i has written it

◦ j incorrectly gets the old value◦Most common kind of data hazard

! WAR (write after read) — j writes a destination before i reads it

◦ i incorrectly gets the new value

! WAW (write after write) — j should write an operand after i writes it,but the writes are performed in the wrong order, incorrectly leaving thevalue written by i

• Hazards limit pipeline speedup and complicate design

Page 55: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (05/1999)

DATA HAZARDS (2)

• RAW hazards are generated by the instructions

sub $2,$1,$3

and $12,$2,$5

or $13,$6,$2

add $14,$2,$2

sw $15,100($2)

1 2 3 4 5 6 7

Instructionfetch Reg ALU Data

access Reg

Clock Periods

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

Instructionfetch Reg ALU Data

access Reg

Instructionfetch Reg ALU Data

access Reg

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Instructionfetch Reg ALU Data

access Reg

add $14, $2, $2

sw $15, 100($2)

8 9

Page 56: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (05/1999)

DATA HAZARDS (3)

• RAW hazards with i = sub $2,$1,$3 and source = $2 are generated bythe instructions

sub $2,$1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2)

! The instruction j = and $12,$2,$5 reads $2 before i writes it

! The instruction j = or $13,$6,$2 reads $2 before i writes it

! The instruction j = add $14,$2,$2 reads $2 in the same clock periodin which i writes it

◦ Generates a hazard if the register file’s outputs change only on the edgeof the main processor clock

Page 57: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2012)

SIMPLE QUESTIONS ABOUT TIMING

• How many clock periods are required to execute this program segment withvarious pipelined designs?

lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label # Assume branch not takenadd $t5, $t2, $t3sw $t5, 8($t3)

Label: ...

. What happens during clock period 8?

. In what clock period does the addition of $t2 and $t3 actually take place?

• This segment takes 21 clock periods in the multicycle implementation

Page 58: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2012)

PROGRAM SEGMENT TIMING (1)

• Pipeline hazards in the same program segment as for multicycle

IM Reg DM Reg

IM Reg DM Reg

IM Reg DM Reg

IM Reg DM Reg

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

lw $t2, 0($t3)

lw $t3, 4($t3)

beq $t2, $t3, Label

add $t5, $t2, $t3

sw $t5, 8($t3) IM Reg DM RegIF/ID

ID/E

X

EX/M

EM

MEM/

WB

Page 59: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2012)

PROGRAM SEGMENT TIMING (2)

• In the multicycle implementation, this 5-instruction segment takes 21 clocks

• If there were no hazards, pipelined execution of the segment would take 9clocks

• The hazard on register $t3 causes a delay of 3 clocks and the hazard on $t5causes an additional delay of 3 clocks, making the pipelined execution time15 clock periods—71% of the multicycle execution time!

• Pipline design improvements can reduce execution times

IM Reg DM Reg

IM Reg DM Reg

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

lw $t3, 4($t3)

beq $t2, $t3, Label3 cp

Page 60: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2012)

PROGRAM SEGMENT TIMING (3)

• In the multicycle implementation, this 5-instruction segment takes 21 clocks

• If there were no hazards, pipelined execution of the segment would take 9clocks

• The hazard on register $t5 causes an additional delay of 3 clocks, making thepipelined execution time 15 clock periods—71% of the multicycle executiontime!

IM Reg DM Reg

IM Reg DM Reg

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

IF/ID

ID/E

X

EX/M

EM

MEM/

WB

IM Reg DM RegIF/ID

ID/E

X

EX/M

EM

MEM/

WB

CP6 CP7 CP9 CP10CP8 CP11 CP12 CP14 CP15CP13

beq $t2, $t3, Label

add $t5, $t2, $t3

sw $t5, 8($t3)

Page 61: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (02/1999)

DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (1)

• Follows ideas of K. Kennedy et al.

• Definition: If control flow within a program can reach statement T afterpassing through statement S, then T depends on S

! Dependence is always defined by reference to the results ofserial execution

• Assume a loop

! Dependence analysis outside a loop is trivial

• Assume an array

! Dependences that affect pipelining arise from references to the same mem-ory location M[n] in an array

Page 62: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (02/1999)

DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (2)

• Distinguish between statements (in a program) and instances of those state-ments (in loop instances or threads)

! Let i = loop induction variable

◦ Example: A for loop in C¶ Syntax: for ( i=0; i<n; i++ ) · · ·¶ The induction variable is i

! Let Si = instance of statement S that occurs on the value i of the inductionvariable

• Flow dependence (RAW): SiS writes M[n] and TiT reads M[n]

S: X[fS[i]] = · · ·T: · · · = F[X[fT[i]]]

Page 63: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (02/1999)

DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (3)

• Anti-dependence (WAR): SiS reads M[n] and TiT writes M[n]

S: · · · = F[X[fS[i]]]T: X[fT[i]] = · · ·

• Output dependence (WAW): SiS and TiT both write M[n]

S: X[fS[i]] = · · ·T: X[fT[i]] = · · ·

Page 64: 4304-6-pipe
Page 65: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c! C. D. Cantrell (10/2010)

LOOP OVERHEAD EXAMPLE: DOT PRODUCT

# The arguments are in registers $a0 through $a4# The first argument POINTS to the first element of vector v1# The second argument POINTS to the first element of vector v2# Third argument = value of veclen (the dimension of the vectors)# Fourth argument = data size in bytes

.text

__start:dotpro: nop # The dot product function

ori $v0,$0,0 # Initialize the dot product to 0blez $a2,beamup # Return if veclen <= 0or $t1,$0,$a2 # Register t1 will be a counter; initialized

# to veclenor $t3,$0,$a0 # Register t3 points to the component of v1or $t4,$0,$a1 # Register t4 points to the component of v2

loop2: lw $t5,0($t3) # Load word pointed to by reg. t3lw $t6,0($t4) # Load word pointed to by reg. t4mul $t2,$t5,$t6 # Multiply regs. t5 and t6, product in reg. t2add $v0,$v0,$t2 # Add product to running sum in reg. v0add $t3,$a3,$t3 # Increment the pointer to the component of v1add $t4,$a3,$t4 # Increment the pointer to the component of v2addi $t1,-1 # Decrement register t1

bgtz $t1,loop2 # Loop again if t1>0beamup: jr $ra # Beam me up....

Page 66: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (02/1999)

DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (4)

• Requirement for the existence of an instance of dependence:A real memory location M[n] exists such that

M[n] = fS[iS] = fT [iT ]

where S is executed before T and both values of the induction variable arein the range of the loop:

p ≤ iS ≤ iT ≤ q

• Example: fS and fT are linear functions

fS[i] = aSi + bS, fT [i] = aTi + bT

The requirement for dependence implies that

aSiS + bS = aTiT + bT

⇒ aSiS − aTiT + (bS − bT ) = 0

Page 67: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (02/1999)

DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (5)

• Example:

do 100 i=2,100T: b(i)=a(i-1)S: a(i)=c(i)

! Here, fS(i) = i, fT (i) = i− 1

! Condition fS(iS) = fT (iT ) is iS = iT − 1 ⇒ iS < iT(hence T depends on S)

! The equation iS = iT − 1 has lots of solutions such that2 ≤ iS < iT ≤ 100

Page 68: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (02/1999)

DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (6)

• In serial execution, T3 reads from a(2) after S2 writes to a(2):

i = 2: T2: b(2)=a(1)S2: a(2)=c(2)

i = 3: T3: b(3)=a(2)S3: a(3)=c(3)

• In vector execution, T3 reads from a(2) before S2 writes to it:

b(2)=a(1)b(3)=a(2)...a(2)=c(2)a(3)=c(3)

Page 69: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2012)

CONTROL HAZARDS

• A control hazard occurs because a branch instruction needs to make a deci-sion based on the results of operations or instructions that are still pending

. A beq instruction cannot update the PC before the test for equality hascompleted

�We’ll see later that it’s possible to make the decision at the InstructionDecode/Register Fetch stage instead of the ALU stage

. In the MIPS ISA, only the instruction immediately followingthe branch is executed or not executed, depending on the branchdecision

� This isn’t true in longer pipelines

. Methods for dealing with control hazards

� Stall� Predict the branch� Always execute the instruction following the branch

Page 70: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

STALLING AS A SOLUTION FOR CONTROL HAZARDS

• After a conditional branch (beq) there is a one-stage pipeline stall (bubble),even if we are able to compare the inputs to beq in the ID/RF stage

Instructionfetch Reg ALU Data

access Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

Instructionfetch Reg ALU Data

access Reg2ns

Instructionfetch Reg ALU Data

access Reg

2ns

2 4 6 8 10 12 14 16Programexecutionorder(in instructions)

Page 71: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

PREDICTING BRANCHES NOT TAKEN AS A SOLUTION

Instructionfetch Reg ALU Data

access Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch Reg ALU Data

access Reg2 ns

Instructionfetch Reg ALU Data

access Reg2 ns

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

Instructionfetch Reg ALU Data

access Reg

2 4 6 8 10 12 14

2 4 6 8 10 12 14

Instructionfetch Reg ALU Data

access Reg

2 ns

4 ns

bubble bubble bubble bubble bubble

Programexecutionorder(in instructions)

Page 72: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

PIPELINE DELAYED BRANCH AS A SOLUTION

• After the conditional branch (beq), we insert an add instruction (which cando useful work) instead of a stall (bubble), which does nothing

Instructionfetch Reg ALU Data

access Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch Reg ALU Data

access Reg2 ns

Instructionfetch Reg ALU Data

access Reg

2 ns

2 4 6 8 10 12 14

2 ns

(Delayed branch slot)

Programexecutionorder(in instructions)

Page 73: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

FORWARDING AS A SOLUTION FOR DATA HAZARDS (1)

add $s0, $t0, $t1

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

MEM

WBMEM

Page 74: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

FORWARDING AS A SOLUTION FOR DATA HAZARDS (2)

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

Page 75: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (02/1999)

MIPS PIPELINES (3)

• MIPS R2000 integer unit pipeline hazards, named for pipeline registers:

IF/ID| {z }register

. ReadRegisterj| {z }name of register field

1. ID/EX.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)

2. EX/MEM.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)

3. MEM/WB.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)

Page 76: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (11/1999)

MIPS PIPELINES (4)

• MIPS R2000 integer pipeline hazards generated by the instructions

sub $2,$1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2)

! The instruction and $12,$2,$5 results in hazard 1a,ID/EX.WriteRegister = IF/ID.ReadRegister1 = 2

! The instruction or $13,$6,$2 results in hazard 2b,EX/MEM.WriteRegister = IF/ID.ReadRegister2 = 2

! The instruction add $14,$2,$2 results in hazards 3a and 3b,MEM/WB.WriteRegister = IF/ID.ReadRegister1 = 2MEM/WB.WriteRegister = IF/ID.ReadRegister2 = 2

Page 77: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

PIPELINED DEPENDENCIES IN AN INSTRUCTION SEQUENCE

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/–20 –20 –20 –20 –20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

Page 78: 4304-6-pipe

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/–20 –20 –20 –20 –20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X –20 X X X X XValue of EX/MEM :X X X X –20 X X X XValue of MEM/WB :

DM

Page 79: 4304-6-pipe

Mux

ALU

ID/EX MEM/WB

Datamemory

EX/MEM

Registers

PIPELINED DATAPATHWITHOUT FORWARDING

Page 80: 4304-6-pipe

Registers

Mux M

ux

ALU

ID/EX MEM/WB

Datamemory

Mux

Forwardingunit

EX/MEM

ForwardB

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

RtRtRs

ForwardA

Mux

PIPELINED DATAPATHWITH FORWARDING

Page 81: 4304-6-pipe

PC Instructionmemory

Registers

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Forwardingunit

IF/ID

Inst

ruct

ion

Mux

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

DATAPATH MODIFIED TO RESOLVE HAZARDS BY FORWARDING

Page 82: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (05/1999)

PIPELINED EXECUTION WITH FORWARDING

• We’ll follow what happens in the instruction sequence

[40000028] sub $2, $1, $3[4000002c] and $4, $2, $5[40000030] or $4, $4, $2[40000034] add $9, $4, $2

! Without forwarding, there would be RAW hazards on register $2 in theand instruction and on register $4 in the or and add instructions

Page 83: 4304-6-pipe

PC Instructionmemory

Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5 sub $2, $1, $3

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

10 10

$2

$5

5

2

4

$1

$3

3

1

2

Control

ALU

M

WB

ID/EX.WriteRegister = IF/ID.ReadRegister1 = 2(RAW data hazard)

Page 84: 4304-6-pipe

PC Instructionmemory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

or $4, $4, $2 and $4, $2, $5

ID/EX

sub $2, . . .

EX/MEM

before<1>

MEM/WB

add $9, $4, $2

Clock 4

4

2

10 10

$4

$2

2

4

4

$2

$5

5

2

4

Control

ALU

10

2

WB

ID/EX.WriteRegister = IF/ID.ReadRegister1 = 4 andEX/MEM.WriteRegister = IF/ID.ReadRegister2 = 2(RAW data hazards)

EX/MEM.WriteRegister = ID/EX.ReadRegister1 = 2(test by which the need for forwarding to ALUIn1 is actually detected)

2

ALUIn1

ALUIn2

Page 85: 4304-6-pipe

PC Instructionmemory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

add $9, $4, $2 or $4, $4, $2

ID/EX

and $4, . . .

EX/MEM

sub $2, . . .

MEM/WB

after<1>

Clock 5

4

2

10 10

$4

$2

2

4

9

$4

$2

4

2

24

Control

ALU

10

WB

2

1

4

EX/MEM.WriteRegister = IF/ID.ReadRegister1 = 4 andMEM/WB.WriteRegister = IF/ID.ReadRegister2 = 2

EX/MEM.WriteRegister = ID/EX.ReadRegister1 = 4 andMEM/WB.WriteRegister = ID/EX.ReadRegister2 = 2

Page 86: 4304-6-pipe

PC Instructionmemory

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

after<1>after<2> add $9, $4, $2 or $4, . . .

EX/MEM

and $4, . . .

MEM/WB

Clock 6

10

$4

$2

2

4

9

ALU

10

4

WB

4

1

Registers

Inst

ruct

ion

IF/ID

ID/EX

4

Control

EX/MEM.WriteRegister = ID/EX.ReadRegister2 = 4

Who wrote this value?

Page 87: 4304-6-pipe

ALUSrcRegisters

Mux

Mux

Mux

ALU

ID/EX MEM/WB

Datamemory

Mux

Forwardingunit

EX/MEM

Mux

ADDITION OF A MULTIPLEXOR TOCHOOSE THE IMMEDIATE VALUE

Page 88: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (02/1999)

MIPS PIPELINES (3)

• MIPS R2000 integer unit pipeline hazards, named for pipeline registers:

IF/ID| {z }register

. ReadRegisterj| {z }name of register field

1. ID/EX.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)

2. EX/MEM.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)

3. MEM/WB.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)

Page 89: 4304-6-pipe

Reg

IM

Reg

Reg

IM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

A DATA HAZARD THAT CANNOTBE RESOLVED BY FORWARDING

Data dependence goesbackward in time

Page 90: 4304-6-pipe

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

HOW STALLS ARE INSERTEDINTO A PIPELINE

Page 91: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2012)

HAZARD DETECTION UNIT

• The control logic for the hazard detection unit is:

If (ID/EX.MemRead and((ID/EX.RegisterRt = IF/ID.RegisterRs) or(ID/EX.RegisterRt = IF/ID.RegisterRt)))

Thenstall the pipeline

. The first line tests whether the instruction is a load in the EX stage

� The next two lines check whether the destination register of the load isthe same as either of the source registers of the instruction that is inthe ID stage

• For a stall, all control signals are deasserted in the EX stage

Page 92: 4304-6-pipe

PC Instructionmemory

Registers

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

0

Mux

IF/ID

Inst

ruct

ion

ID/EX.MemRead

IF/I

DW

rite

PCW

rite

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRtIF/ID.RegisterRt

IF/ID.RegisterRs

RtRs

Rd

Rt EX/MEM.RegisterRd

MEM/WB.RegisterRd

OVERVIEW OF PIPELINED CONTROL

Page 93: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2011)

PIPELINED EXECUTION WITH A STALL

• We’ll follow what happens in the instruction sequence

[40000028] lw $2, 20($1)[4000002c] and $4, $2, $5[40000030] or $4, $4, $2[40000034] add $9, $4, $2

. The hardware inserts a stall after the lw instruction

� The stall creates the same e↵ect as a nop

� For a stall, all control signals are deasserted in the EX stage� Deasserted control signals are forwarded to the MEM and WB stages� Nothing is written to memory or the register file

. After the stall, forwarding resolves the RAW hazards on register $2 inthe and instruction and on register $4 in the or and add instructions

Page 94: 4304-6-pipe

Hazarddetection

unit

0

MuxIF

/ID

Writ

e

PCW

rite

ID/EX.RegisterRt

ID/EX.MemRead

M

WB

$1

$X

X

1

2

before<3>

PC Instructionmemory

Registers

Mux

Mux

Mux

EX WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

ID/EX

EX/MEM

MEM/WB

and $4, $2, $5 lw $2, 20($1) before<1> before<2>

Clock 2

1

1

X

X11

Control

ALU

M

WB

Page 95: 4304-6-pipe

Hazarddetection

unit

0

MuxIF

/ID

Writ

e

PCW

rite

ID/EX.RegisterRt

lw $2, 20($1)

PC Instructionmemory

Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

2

500 11

$2

$5

5

2

4

$1

$X

X

1

2

Control

ALU

M

WB

ID/EX.MemRead

Page 96: 4304-6-pipe

$2

$5

5

2

24

WB

Hazarddetection

unit

0

MuxIF

/ID

Writ

e

PCW

rite

ID/EX.RegisterRt

PC Instructionmemory

Registers

Mux

Mux

Mux

EX

M

WB

Datamemory

Mux

Inst

ruct

ion

IF/ID

and $4, $2, $5 bubble

ID/EX

lw $2, . . .

EX/MEM

before<1>

MEM/WB

Clock 4

2

2

5

510

11

00

$2

$5

5

2

4

Control

ALU

M

WB

Forwardingunit

ID/EX.MemRead

or $4, $4, $2

000

Page 97: 4304-6-pipe

Hazarddetection

unit

0

MuxIF

/ID

Writ

e

PCW

rite

ID/EX.RegisterRt

2

bubble lw $2, . . .

PC Instructionmemory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

EX/MEM

MEM/WB

add $9, $4, $2

Clock 5

2

210 10

11

$4

$2

2

4

4

4

2

4

$2

$5

5

2

4

Control

ALU

00

WB

ID/EX.MemRead

or $4, $4, $2

Page 98: 4304-6-pipe

PC Instructionmemory

Hazarddetection

unit

0

MuxIF

/ID

Writ

e

PCW

rite

ID/EX.RegisterRt

bubble

Registers

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Inst

ruct

ion

IF/ID

add $9, $4, $2

ID/EX

and $4, . . .

EX/MEM

MEM/WB

Clock 6

4

4

2

210 10

$4

$2

2

4

49

$2

2

Control

ALU

10

WB00

after<1>

Forwardingunit

$4

4

4

or $4, $4, $2

ID/EX.MemRead

Mux

Page 99: 4304-6-pipe

Registers

Inst

ruct

ion

ID/EX

4

Control

PC Instructionmemory

IF/I

DW

rite

PCW

rite

add $9, $4, $2 or $4, . . . and $4, . . .after<2> after<1>

Clock 7

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

EX/MEM

MEM/WB

10 10

$4

$2

2

4

9

ALU

10

WB

44

10

Hazarddetection

unit

0

Mux

ID/EX.RegisterRt

ID/EX.MemRead

IF/ID

Page 100: 4304-6-pipe

Reg

Reg

CC 1

Time (in clock cycles)

40 beq $1, $3, 7

Programexecutionorder(in instructions)

IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

EFFECT OF A PIPELINE ON ABRANCH INSTRUCTION

Page 101: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (10/2010)

MAKE THE BRANCH DECISION EARLY

• In the unoptimized example, taking the branch costs 3 clock periods

! The cost is higher in modern pipelines, which are much deeper

• The branch target calculation PC = (PC + 4) + offset*4 for the instruc-tion beq Rs, Rt, offset can be done in the ID/RF stage

• There is a faster way than a sub to compare the contents of Rs and Rt

! With a small combinational logic block, take the bitwise XOR of theregister contents

◦ This produces a word with 1 bits wherever the operands differ

! Then OR all of the bits of the resulting word

◦ The result is 1 if, and only if, the operands differ ⇒ branch not taken

! This can also be done in the ID/RF stage

• Result: In the MIPS ISA, there is only a one-cycle delay after a branch

Page 102: 4304-6-pipe

PC Instructionmemory

4

Registers

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

Signextend

Control

Mux

=

Shiftleft 2

Mux

PIPELINED DATAPATH INCLUDINGSUPPORT FOR BRANCHES

Page 103: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c! C. D. Cantrell (10/2010)

A PIPELINED BRANCH

• We’ll follow what happens in the instruction sequence

[40000024] sub $10, $4, $8[40000028] beq $1, $3, 7 # PC-relative branch to offset[4000002c] and $12, $2, $5 # (40 + 4) +7*4 = 72 = 0x48[40000030] or $14, $2, $6[40000034] add $14, $4, $2[4000003c] slt $15, $6, $7. . .[40000048] lw $14, 50($7)

Page 104: 4304-6-pipe

PC Instructionmemory

4

Registers

Signextend

Mux

Mux

Control

EX

M

WB

M

WB

WB

Mux

Hazarddetection

unit

Forwardingunit

Mux

IF.Flush

IF/ID

and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8

MEM/WB

EX/MEM

ID/EX

Clock 3

72 44

48 44

28

7

$1

$3

10

48

72

72

0

$4

$8

ALU Datamemory

Mux

Shiftleft 2

before<1> before<2>

=

Page 105: 4304-6-pipe

Mux

0

bubble (nop)lw $4, 50($7)

Clock 4

beq $1, $3, 7 sub $10, . . . before<1>

PC Instructionmemory

4

Registers

Signextend

Mux

Mux

Control

EX

M

WB

M

WB

WB

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

MEM/WB

EX/MEM

ID/EX

76 72

76 72

$1

$3

10

76

ALU Datamemory

Mux

Shiftleft 2

=

Page 106: 4304-6-pipe
Page 107: 4304-6-pipe

Reg

CC 1

Time (in clock cycles)

40 beq $1, $3, 7

Programexecutionorder(in instructions)

IM Reg

IM DM

DM

IM DM

DM Reg

Reg Reg

Reg

Reg

IM72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

EFFECT OF AN OPTIMIZED PIPELINEON A BRANCH INSTRUCTION

44 and $12, $2, $5

Page 108: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2011)

STATIC BRANCH PREDICTION

• Predicted behavior is based only on the branch instruction itself

. The early SPARC and MIPS architectures predicted that a branch wouldnot be taken

. A more sophisticated static prediction scheme would base the predictionon a comparison of the target address with the current value of the PC

� If the branch goes to a later instruction (i.e., to a higher address) thenit is never taken� If the branch goes to an earlier instruction, then it is always taken

• Problems with static prediction

. Predict correctly only for certain types of branches

. Example: If beq is the only available branch instruction, then it must betaken to exit from a loop

. If the beq target is a later instruction, then the branch is almost alwaysmispredicted in the “not taken to a later instruction” approach

Page 109: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2011)

DYNAMIC BRANCH PREDICTION (1)

• Base the predicted branch behavior on the history of the branch

• A common branch prediction scheme uses a branch history table

. Each entry in the memory is indexed by the lower 16 bits of the addressof the branch instruction

. Each entry consists of a bit that is set if the branch was recently taken

. If the branch is not taken, the bit is toggled

. Performance shortcoming: If a branch is almost always taken (or nottaken), then the bit gets toggled on a wrong prediction, and the nextbranch is likely to be mispredicted

� Example: A loop that is executed 10 times, using branch to the head� The branch is mispredicted at the beginning and end (80% accuracy)� Here, branch frequency (90% taken) 6= predicted frequency (80%)

Page 110: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2011)

DYNAMIC BRANCH PREDICTION (2)

• A 2-bit branch prediction scheme uses a branch history table in which eachentry contains 2 bits to indicate the state of a branch prediction FSM (nextslide)

. This scheme mispredicts only once if a branch almost always goes oneway

Page 111: 4304-6-pipe

Look up Predicted PC

Number ofentriesin branch-targetbuffer

No: instruction isnot predicted to bebranch. Proceed normally

=

Yes: then instruction is branch and predictedPC should be used as the next PC

Branchpredictedtaken oruntaken

PC of instruction to fetch

A branch-target buffer

Page 112: 4304-6-pipe

Taken

Taken

Taken

Taken

Not taken

Not taken

Not taken

Not taken

Predict taken Predict taken

Predict not taken Predict not taken

STATES IN A 2-BIT BRANCHPREDICTION SCHEME

Page 113: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, 4th Edition

DYNAMIC BRANCH PREDICTION (3)

• Branch prediction accuracy for a 4096-entry, 2-bit prediction bu↵er

Page 114: 4304-6-pipe

a. From before b. From target c. From fall through

sub $t4, $t5, $t6

add $s1, $s2, $s3

if $s1 = 0 then

add $s1, $s2, $s3

if $s1 = 0 then

add $s1, $s2, $s3

if $s1 = 0 then

sub $t4, $t5, $t6add $s1, $s2, $s3

if $s1 = 0 then

sub $t4, $t5, $t6

add $s1, $s2, $s3

if $s2 = 0 then

BecomesBecomesBecomes

Delay slot

Delay slot

Delay slot

sub $t4, $t5, $t6

if $s2 = 0 then

add $s1, $s2, $s3

SCHEDULING THE BRANCHDELAY SLOT

Page 115: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (10/2010)

EXCEPTIONS (1)

• In the MIPS ISA, an exception is a synchronous (clocked) event thatcauses a process to stop executing

! System call (explicit instruction, e.g. for I/O)

◦ Stopping execution permits another process to execute while the processthat made the syscall waits for I/O

! Exception associated with execution of the current instruction

◦ Bus error (I/O timeout, load/store kernel physical address)◦ Protection exception◦ Attempt to execute a reserved instruction◦ Cache/TLB miss◦ Floating-point arithmetic exception

• An interrupt is an asynchronous event, external to the current instruction,that stops the execution of the current process

! Example: Hardware controller signals end of I/O

Page 116: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (04/1999)

EXCEPTIONS (2)

• In the R2000 ISA, exceptions are handled by coprocessor 0

• How the R2000 processor and the UNIX kernel performexception handling:

1. Processor exits user mode & is forced into kernel mode.

2. The address of an exception vector (exception handling program) isloaded into the program counter (PC).

! Reset exception (reboot): the processor transfers control to the Resetexception vector at address 0xbfc00000

! UTLB Miss: Control is transferred to the exception vector pointed toby the contents of address 0x80000000

! All other exceptions are handled by the kernel◦ The general exception handler pointed to by the contents of

address 0x80000080 takes control, gets the cause from the Causeregister and transfers the correct exception handler

Page 117: 4304-6-pipe

MIPS R2000 CPU AND COPROCESSORS

CPU

Registers$0

$31

Arithmeticunit

Multiplydivide

Lo Hi

Coprocessor 1 (FPU)

Registers$0

$31

Arithmeticunit

Registers

BadVAddr

Coprocessor 0 (traps and memory)

StatusCauseEPC

Memory

PC

Page 118: 4304-6-pipe

MIPS CP0 and Exception Handling Registers

TLBEntryHi

TLBEntryLo

TLB(TranslationLookaside

Buffer)

“Safe”Entries

IndexRegister

RandomRegister

ContextRegister

BadVAddrRegister

EPCRegister

PRIdRegister

StatusRegister

CauseRegister

Used with virtual memory

Used for exception processing

Page 119: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (10/2010)

MIPS R2000 COPROCESSOR 0

• BadVaddr register (coprocessor 0, register 8)

! Memory address at which an addressing exception occurred

• Status register (coprocessor 0, register 12)

! Interrupt mask and interrupt enable bits

! Kernel/user bits for old, previous and current processes

• Cause register (coprocessor 0, register 13)

! Holds a code for the cause of an exception

• Exception program counter (EPC) (coprocessor 0, register 14)

! Holds address of instruction that caused an exception

Page 120: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (09/2012)

MIPS KERNEL CONVENTIONS (1)

• The MIPS kernel recognizes 4 memory segments: kuseg, kseg0, kseg1 andkseg2

. Addresses between 0x00400000 and 0x7fffffff belong to kuseg

� User address space

. Virtual addresses between 0x80000000 and 0x9fffffff belong to kseg0

� Addresses between 0x80000000 and 0x8fffffff are used for kerneltext (.ktext; executable instructions)� Addresses between 0x90000000 and 0x9fffffff are used for kernel

data (.kdata)� Addresses in this range are translated to physical memory by clearing

the high bit and mapping contiguously into the low 512 MB of memory

Page 121: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (09/2012)

MIPS KERNEL CONVENTIONS (2)

• The MIPS kernel recognizes 4 memory segments: kuseg, kseg0, kseg1 andkseg2

. Addresses between 0xa0000000 and 0xbfffffff belong to kseg1

� Typically used for I/O registers, memory-resident ROM code and diskbu↵ers� Direct-mapped, uncached

. Addresses above 0xbfffffff belong to kseg2

� Process structures (remapped on context switches)� User page table entries� Caching and remapping via paging, not via swapping entire processes

Page 122: 4304-6-pipe

kuseg

Virtual Physical

0x1fffffff

0x20000000

0x80000000

MIPS R2000 Memory Map

0xffffffff

0x7fffffff

UserMapped

Cacheable

0x00000000

kseg0Kernel

UnmappedCached

kseg1Kernel

UnmappedUncached

kseg2Kernel

MappedCacheable

0x9fffffff0xa0000000

0xbfffffff0xc0000000

Page 123: 4304-6-pipe

# SPIM TRAP HANDLER DATA .kdata__m1_: .asciiz " Exception "__m2_: .asciiz " caught by trap handler.\n"__m3_: .asciiz "Continuing. . .\n"__m4_: .asciiz "Halting.\n"__e0_: .asciiz " [Interrupt]"__e1_: .asciiz " [TLB modification !BUG!]"__e2_: .asciiz " [TLB miss !BUG!]"__e3_: .asciiz " [TLB miss !BUG!]"__e4_: .asciiz " [Unaligned address in inst/data fetch]"__e5_: .asciiz " [Unaligned address in store]"__e6_: .asciiz " [Bad address in text read]"__e7_: .asciiz " [Bad address in data/stack read]"__e8_: .asciiz " [Error in syscall]"__e9_: .asciiz " [Breakpoint]"__e10_: .asciiz " [Reserved instruction]"__e11_: .asciiz " [Syscall exception !BUG!]"__e12_: .asciiz " [Arithmetic overflow]"__e13_: .asciiz " [Inexact floating point result]"__e14_: .asciiz " [Invalid floating point result]"__e15_: .asciiz " [Divide by 0]"__e16_: .asciiz " [Floating point overflow]"__e17_: .asciiz " [Floating point underflow]"__excp: .word __e0_,__e1_,__e2_,__e3_,__e4_,__e5_,__e6_,__e7_,__e8_,__e9_ .word __e10_,__e11_,__e12_,__e13_,__e14_,__e15_,__e16_,__e17_s1: .word 0s2: .word 0

Page 124: 4304-6-pipe

# SPIM TRAP HANDLER CODE .ktext .space 0x80 # Put trap handler at 0x8000080 sw $v0 s1 # Not re-entrant sw $a0 s2 # Don't need to save k0/k1 mfc0 $k0 $13 # Cause and $k0 $k0 0xff# Use just ExcCode field mfc0 $k1 $14 # EPC li $v0 4 # Print " Exception " la $a0 __m1_ syscall li $v0 1 # Print exception number srl $a0 $k0 2 syscall li $v0 4 # Print type of exception lw $a0 __excp($k0) syscall li $v0 4 # Print " occurred.\n" la $a0 __m2_ syscall srl $a0 $k0 2 beq $a0 12 ret # continue on overflow beq $a0 13 ret # continue on inexact fp result beq $a0 14 ret # continue on invalid fp result beq $a0 16 ret # continue on fp overflow beq $a0 17 ret # continue on fp underflow li $v0 4 # Print "Halting.\n" la $a0 __m4_ syscall li $v0 10 # Exit on all bug overflow exceptions syscall # syscall 10 (exit)

ret: li $v0 4 # Print "Continuing. . .\n" la $a0 __m3_ syscall

lw $v0 s1 lw $a0 s2 addiu $k1 $k1 4 # Return to next instruction rfe # Return from exception handler jr $k1

.text .globl __start

Page 125: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2011)

EXCEPTIONS IN A PIPELINED PROCESSOR

• Five instructions are active in any given clock period

. Multiple exceptions can occur simultaneously

. If execution is not stopped soon enough, the value in the register thathelped cause the exception may be overwritten in the WB stage

. To flush the instructions that follow the instruction that caused the ex-ception, we add two new signals, ID.Flush and EX.Flush

. ID.Flush is ORed with the stall signal from the hazard detection unit toflush an instruction during its ID stage

. To flush an instruction in its EX stage, we add an input to the PC mul-tiplexor that sends 0x80000080 to the PC

Page 126: 4304-6-pipe

PC Instructionmemory

4

Registers

Signextend

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Mux

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

=

ExceptPC

80000080

0

Mux

0

Mux

0

Mux

ID.Flush EX.Flush

Cause

Shiftleft 2

DATAPATH WITH CONTROLSTO HANDLE EXCEPTIONS

Page 127: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c! C. D. Cantrell (05/1999)

A PIPELINED EXCEPTION

• We’ll follow what happens in the instruction sequence

[40000040] sub $11, $2, $4[40000044] and $12, $2, $5[40000048] or $13, $2, $6[4000004c] add $1, $2, $1 # overflow exception occurs here[40000050] slt $15, $6, $7[40000054] lw $16, 50($7). . .

given that the instructions to execute when an exception occurs are

[80000080] lui $1, -28672 # -28672 (base 10) = 0x9000[80000084] sw $2, 592($1) # sw $v0 s1. . .

Page 128: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (10/2010)

A PIPELINED EXCEPTION

• The value of the PC when an instruction is issued is part of the instruction’scontrol state, and must be passed along in pipeline registers

! Otherwise, you’d never know exactly where you were clobbered

! Some architectures have imprecise exceptions (e.g., the IBM 360/91)

◦ In these cases, it’s usually the address that is in the PC when theexception occurs that is reported, not the address of the instructionthat actually caused the exception◦ This is especially annoying when a branch is taken immediately after

the exception-causing instruction!

• In the example, an integer overflow occurs, asserting the Overflow signal

! The Overflow signal must be routed to the Control block, which thenasserts the EX.Flush, ID.Flush and IF.Flush signals, and asserts acontrol signal that causes the PC Source multiplexor to load 0x80000080into the PC

Page 129: 4304-6-pipe

slt $15, $6, $7lw $16, 50($7) add $1, $2, $1 or $13, . . . and $12, . . .

Clock 5

0x80000080

0

0

0

010

10

0

0

10

58 54

54

12

($6)

($7)

Write register 15

50

($2)

($1)

1

13 12

DatamemoryPC

4

Registers

Signextend

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Mux

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

=

ExceptPC

0x80000080

0

Mux

0

Mux

0

Mux

ID.Flush EX.Flush

Cause

Shiftleft 2

Instructionmemory

($2)

($1)

Write register 12

To Causeand Control

Overflow

Page 130: 4304-6-pipe

bubble (nop)lui $1, -28672 bubble bubble or $13, . . .

Clock 6

80000084

80000084

13

0

0

0

000

0000

00

10

13

Datamemory

80000080

PC

4

Registers

Signextend

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Mux

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

=

ExceptPC

80000080

0

Mux

0

Mux

0

Mux

ID.Flush EX.Flush

Cause

Shiftleft 2

Instructionmemory

Control

Page 131: 4304-6-pipe

PC

Instructionmemory

4

Registers

Signextend

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Mux

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

Mux

ExceptPC

80000080

0

Mux

0

Mux

0

Mux

ID.Flush EX.Flush

Cause

Shiftleft 2

Writedata

Readdata

Address

Readdata

Address Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1Readregister 2

ALUcontrol

3216

Inst

ruct

ion

Instruction [15–11]

Instruction [20–16]Instruction [20–16]

Instruction [25–21]

Reg

Writ

e

ALUOp

ALUSrc

RegDst

Mem

Writ

e

MemRead

Mem

toR

eg

Branch

=

PIPELINED DATAPATHWITH CONTROL

Page 132: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (10/2010)

PIPELINE ENHANCEMENTS

• Design a deeper pipeline (many stages)

! The Pentium 4 “Willamette” pipeline had 20 stages; the “Prescott”, 31

◦ This permitted high clock frequencies ⇒ high power consumption

! The Core 2 Duo pipeline has 14 stages

• Issue more than one instruction per clock period (“superscalar” architecture)

! Theoretically, this divides the CPI by the number issued per clock

! We will study the modifications needed for issuing 2 instructions/clock

◦ Double the number of ALUs and read/write ports on the register file

• Schedule the pipeline dynamically

! Find useful instructions to schedule during a stall

! Major pipeline units:

◦ Instruction fetch/issue unit◦ Execution unit◦ Commit unit

Page 133: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (10/2010)

DYNAMIC PIPELINE SCHEDULING (1)

• Major limitation of the statically scheduled pipeline that we have studied:In-order instruction issue and execution

! This permits head-of-line blocking of instructions that could execute

• One approach is to allow in-order issue and out-of-order execution

! For in-order issue, the IF/ID unit must check for structural hazards

! For out-of-order execution:

◦ Need multiple functional units¶ Execution occurs whenever there are no data dependences or hazards

◦ A new kind of unit, a scoreboard, must check for data hazards

! Out-of-order completion means that there are imprecise exceptions

! This is the design approach used in the CDC 6600

Page 134: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (10/2010)

BOOKKEEPING FOR DYNAMIC SCHEDULING

• The scoreboarding technique was introduced in the CDC 6600 (1963)

! Goal: Maintain a low CPI by executing as early as possible

! The scoreboard maintains several status tables

◦ Status of each instruction: Issued, Operands Read, Execution Com-pleted, Results Written

◦ Once an instruction has issued, the functional unit table keeps a recordof the operands¶ Functional unit status: Busy, Operation Underway, Destination Reg-

ister Name, Source Register Names, Units Producing Source RegisterOperands, Flags (indicating when the source register operands areready)

◦ Register result status¶ Indicates which functional unit will write to the register

Page 135: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (10/2010)

PIPELINE DEPTH vs. SPEEDUP

• The graph on the following page shows the speedup achieved by increasingthe number of stages, assuming:

! Constant clock frequency

! A single instruction queue with in-order issue and completion

• The speedup achieved under these circumstances is much less than the num-ber of stages

! This is a result of data and control hazards ⇒ pipeline stalls

• In reality, increasing the number of stages so that there are fewer levels oflogic per stage makes it possible to increase the clock frequency

! This shifts the curve toward the right (higher number of stages)

• To achieve a much higher speedup than is shown in the graph, designersresort to multiple issue and dynamic scheduling

Page 136: 4304-6-pipe

1 2 4 8 16

Pipeline depth

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Rel

ativ

e pe

rfor

man

ce

PIPELINE DEPTH vs.SPEEDUP

Page 137: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c� C. D. Cantrell (10/2011)

SUPERSCALAR EXAMPLE

• We’ll follow what happens in the instruction sequence

[40000040] lw $8, 0($17) # $17=pointer[40000044] addu $8, $8, $18 # $8=array element[40000048] sw $8, 0($17)[4000004c] addi $17, $17, -4 # decrement pointer[40000050] bne $17, $19, -5

assuming a static 2-issue MIPS pipeline

. Note that a comparison with $0 would be invalid, since a null pointer isnot a valid address (this corrects the code in the textbook)

• The new hardware for the example is a second pipeline for data transfer(load and store) instructions

. The original pipeline is used for ALU and branch instructions

Page 138: 4304-6-pipe

PC Instructionmemory

4

RegistersMux

Mux

ALU

Mux

Datamemory

Mux

80000080

Signextend Sign

extend

ALU Address

Writedata

SUPERSCALAR DATAPATH

ADDRESS CALC.

Page 139: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 4th Edition

SUPERSCALAR EXAMPLE: SCHEDULING

ALU or branch instruction Data transfer instruction Clock cycle

Loop: lw $t0, 0($s1) 1

addi $s1,$s1,–4 2

addu $t0,$t0,$s2 3

bne $s1,$s3,Loop sw $t0, 4($s1) 4

• The resulting CPI is 0.8 instead of the theoretical value, 0.5

! Speedup = 1.25 instead of 2.0

• The problem is that we are taking 4 clocks to execute 5 instructions

! Two of these instructions (addi and bne) are loop overhead

• In some memory architectures there may be a hardware conflict between thestore in one loop instance and the load in the next instance

Page 140: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (10/2010)

LOOP UNROLLING

• The goal of loop unrolling is to minimize the performance impact of loopoverhead by executing several instances of the loop for one set of overheadinstructions

! If a loop is unrolled in hardware, then the original register targets ofdata-transfer and computational instructions must be renamed

! Register renaming is especially important in executing x86 instruc-tions, because there are very few general-purpose x86 architecturalregisters

◦ The architectural registers can be renamed into a larger set of physicalregisters

! Loops can also be unrolled in software

◦ Compiler unrolling (often performed by optimizing compilers)◦ Unrolling in a higher-level language

Page 141: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 4th Edition

SUPERSCALAR EXAMPLE: LOOP UNROLLING

ALU or branch instruction Data transfer instruction Clock cycle

Loop: addi $s1,$s1,–16 lw $t0, 0($s1) 1lw $t1,12($s1) 2

addu $t0,$t0,$s2 lw $t2, 8($s1) 3addu $t1,$t1,$s2 lw $t3, 4($s1) 4addu $t2,$t2,$s2 sw $t0, 16($s1) 5addu $t3,$t3,$s2 sw $t1,12($s1) 6

sw $t2, 8($s1) 7bne $s1,$s3,Loop sw $t3, 4($s1) 8

• The loop in this example is unrolled to a depth of 4

• The resulting CPI is 0.57, much closer to the theoretical value of 0.5

! Speedup = 1.75, much closer to 2.0

Page 142: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (10/2010)

LOOP UNROLLING IN SOFTWARE

• Computation of one component of a matrix-vector product y = Ax in C:

for (j=0; j<n; j++){y[i] = y[i] + a[i][j] * x[j];}

! The vector y must be initialized to 0 in a previous loop

• The same loop, unrolled to a depth of 4:

for (j=0; j<n; j+=4){y[i] = (((y[i] + a[i][j-3]*x[j-3]) + a[i][j-2]*x[j-2]) \

+ a[i][j-1]*x[j-1]) + a[i][j]*x[j] ;}

! The programmer has to ensure that n is a multiple of 4

Page 143: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (10/2010)

DYNAMIC PIPELINE SCHEDULING (2)

• When several instructions are issued in a clock period, it is possible to re-order the executions to minimize pipeline stalls

• There are multiple pipelines, divided into three major types of unit:

! Instruction fetch/instruction decode unit

! Reservation stations

◦ These are buffers that hold the instructions’ operands and control state

! Integer and floating-point out-of-order execution units

◦ Execution occurs whenever there are no data dependences or hazards

! Commit unit

◦Maintains a reorder buffer◦ Buffers the results of execution until it is safe to write the results◦ The reorder buffer can also provide operands, like the forwarding units

in a statically scheduled pipeline

Page 144: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition

A DYNAMICALLY SCHEDULED PIPELINE

Commitunit

Instruction fetchand decode unit

In-order issue

In-order commit

Load/Store

Floatingpoint

IntegerInteger …Functionalunits

Out-of-order execute

Reservationstation

Reservationstation

Reservationstation

Reservationstation

Page 145: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (10/2010)

TOMASULO’S ALGORITHM (1)

• Focuses on floating-point execution

! Originally designed for the IBM 360/91, with long memory-access andfloating-point execution times

! Can support overlapping execution of multiple loop instances

• Tomasulo addressed limitations of the scoreboard approach

! Hazard detection and control of execution are distributed to the reserva-tion stations

! Results are forwarded directly to the functional units instead of goingthrough registers

◦ Results are broadcast on a common data bus

Page 146: 4304-6-pipe

IMPLEMENTATION OF TOMASULO’S ALGORITHM

From instruction unitFloating-pointoperationqueue

Frommemory

Load buffersFP registers

Store buffers

Tomemory

654321 3

21

Reservationstations

FP adders FP multipliers

321

21

Common data bus (CDB)

Operation bus

Operandbuses

Page 147: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (10/2010)

TOMASULO’S ALGORITHM (2)

• Control fields in each reservation station:

! Operation

! The IDs of the reservation stations that will produce the operands

◦ The reservation stations can rename registers◦ This enables overlapping different loop iterations

! The values of the operands

◦ Note that values are available sooner than if the functional units hadto contend for access to write to a register

! A “busy” flag

• Control fields for each register and store buffer:

! The number of the functional unit that will produce the value to be written

! A “busy” flag

Page 148: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 3rd Edition

PIPELINING IMPROVES THROUGHPUT

Slo

wer

Clo

ck r

ate

FasterSlower

Instruction throughput(instructions per clock cycle or 1/CPI)

Multicycledatapath

Pipelineddatapath

Single-cycledatapath

Fast

er

Multiple-issuepipelined

Deeplypipelined

Multiple issuewith deep pipeline

Page 149: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 3rd Edition

PIPELINING DOES NOT IMPROVE LATENCYS

hare

d

Har

dwar

e

Several1

Clock cycles of latency for an instruction

Single-cycledatapath

Pipelineddatapath

Multicycledatapath

Spe

cial

ized

Deeplypipelined

Multiple issuewith deep pipeline

Multiple-issuepipelined

Page 150: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 4th Edition

MICROPROCESSOR PIPELINES

Microprocessor Year Clock RatePipeline Stages

Issue Width

Out-of-Order/ Speculation

Cores/ Chip Power

Intel 486 1989 25 MHz 5 1 No 1 5 W

Intel Pentium 1993 66 MHz 5 2 No 1 10 W

Intel Pentium Pro 1997 200 MHz 10 3 Yes 1 29 W

Intel Pentium 4 Willamette 2001 2000 MHz 20 3 Yes 1 75 W

Intel Pentium 4 Prescott 2004 3600 MHz 31 3 Yes 1 103 W

Intel Core 2006 2930 MHz 14 4 Yes 2 75 W

Sun UltraSPARC III 2003 1950 MHz 14 4 No 1 90 W

Sun UltraSPARC T1 (Niagara) 2005 1200 MHz 6 1 No 8 70 W

Page 151: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c©Intel

PENTIUM 4 “WILLAMETTE” CHIP LAYOUT

400MHz

SystemBus

AdvancedTransferCache

Hyperpipeline(20 stages)

EnhancedFloatingPoint &

Multimedia

ExecutionTrace Cache

RapidExecution

Advanced DynamicExecution (A.D.E.)

A.D.E.

A.D.E.

DataCache

Page 152: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

c© C. D. Cantrell (09/2010)

PENTIUM 4 FUNCTIONAL UNITS

• “400 MHz” system bus

! 100 MHz, 4 instructions wide

• Advanced transfer cache (L2 cache, 256 kB, instructions + data)

• Execution trace cache

! L1 instruction cache; stores decoded CISC instructions

• Hyperpipelined unit

! Used for uniform-length microinstructions (“micro-ops,” in Intel-speak)

• Enhanced floating-point & multimedia unit

• Rapid execution engine

! Parallel, partly double-clocked execution of microinstructions

• Advanced dynamic execution

! Deep, out-of-order speculative execution & branch prediction

Page 153: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 4th Edition

AMD OPTERON X4 MICROARCHITECTURE

Instruction prefetchand decodeBranch

prediction

Register file

IntegerALU

IntegerALU.

Multiplier Integer

ALU

Floatingpoint

Adder/SSE

Floatingpoint

Multiplier/SSE

FloatingpointMisc

Datacache

Instruction cache

RISC-operation queue

Dispatch and register renaming

Integer and floating-point operation queue

Load/Store queue

Commitunit

Page 154: 4304-6-pipe

The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 4th Edition

AMD OPTERON X4 PIPELINE

Number ofclock cycles

Reorderbuffer

allocation +register

renaming

InstructionFetch

Scheduling+ dispatch

unit

Decodeand

translateExecution Data Cache/

Commit

RISC-operationqueue

Reorderbuffer

3 22 22 1

Page 155: 4304-6-pipe

Instructionmemory

Address

Instruction[20–16]

Branch

ALUSrc

4

Instruction[15–0]

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

ALUresult

Zero

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15–11]

EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: EX: MEM: WB:

MEM/WB

IF:

000

10

1100

001

00

000

1

00

01

0

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

Mem

Writ

e

1

11

11

10

$11

$10

11

1

$5

Mux

0

Mux1

ALUOp

RegDst

ALUcontrol

M

WB

31 15

15$6

0

WB

Mem

toR

eg

11

2090 16

6


Recommended