Download - A Pipelined Processor

A Pipelined Processor

Taken from Digital Design and Computer Architecture by Harris and Harris

7-<1>

7-<2>

Introduction

• Microarchitecture: how to implement an architecture in hardware

• Processor:– Datapath: functional blocks– Control: control signals

Physics

Devices

AnalogCircuits

DigitalCircuits

Logic

Micro-architecture

Architecture

OperatingSystems

ApplicationSoftware

electrons

transistorsdiodes

amplifiersfilters

AND gatesNOT gates

addersmemories

datapathscontrollers

instructionsregisters

device drivers

programs

7-<3>

Microarchitecture

• Multiple implementations for a single architecture:– Single-cycle

• Each instruction executes in a single cycle

– Multicycle• Each instruction is broken up into a series of shorter steps

– Pipelined• Each instruction is broken up into a series of steps• Multiple instructions execute at once.

– Microcode

7-<4>

Architectural State

• Determines everything about a processor:– PC– 32 registers– Memory

6-<5>

Instruction Formats

op rs rt rd shamt funct6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

R-Type

op rs rt imm6 bits 5 bits 5 bits 16 bits

I-Type

op addr6 bits 26 bits

J-Type


R-Type


I-Type

op addr6 bits 26 bits

J-Type

addsuborand…

lwsw…

beqbne…

7-<6>

State Elements: PC, 32 registers, and Memory

CLK

A RD

InstructionMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RD

DataMemory

WD

WEPCPC'

CLK

32 3232 32

32

32

32 32

32

32

5

5

5

7-<7>

Single-Cycle Datapath: lw fetch

• executing lw: lw (index reg) (destination reg) (immediate offset)

example: lw $s3, 1($0) # read memory word 1 into $s3 rt <- DataMemory[rs+imm]

• STEP 1: Fetch instruction

CLK

A RD

InstructionMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RD

DataMemory

WD

WEPCPC'

Instr

CLK


I-Type

7-<8>

Single-Cycle Datapath: lw register read

• STEP 2: Read source operands from register file

Instr

CLK

A RD

InstructionMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RD

DataMemory

WD

WEPCPC'

25:21

CLK

7-<9>

Single-Cycle Datapath: lw immediate

• STEP 3: Sign-extend the immediate

SignImm

CLK

A RD

InstructionMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

A RD

DataMemory

WD

WEPCPC' Instr

25:21

15:0

CLK

7-<10>

Single-Cycle Datapath: lw address

• STEP 4: Compute the memory address

SignImm

CLK

A RD

InstructionMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

A RD

DataMemory

WD

WEPCPC' Instr

25:21

15:0

SrcB

ALUResult

SrcA Zero

CLK

ALUControl2:0

ALU

010

7-<11>

Single-Cycle Datapath: lw memory read

• STEP 5: Read data from memory and write it back to register file

A1

A3

WD3

RD2

RD1WE3

A2

SignImm

CLK

A RD

InstructionMemory

CLK

Sign Extend

RegisterFile

A RD

DataMemory

WD

WEPCPC' Instr

25:21

15:0

SrcB20:16

ALUResult ReadData

SrcA

RegWrite

Zero

CLK

ALUControl2:0

ALU

0101

7-<12>

Single-Cycle Datapath: lw PC increment

• STEP 6: Determine the address of the next instruction

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

A RD

DataMemory

WD

WEPCPC' Instr

25:21

15:0

SrcB20:16

ALUResult ReadData

SrcA

PCPlus4

Result

RegWrite

Zero

CLK

ALUControl2:0

ALU

0101

7-<13>

Single-Cycle Datapath: sw

• sw: sw (index reg) (source reg) (immediate offset)

• Write data in rt to memory

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

A RD

DataMemory

WD

WEPCPC' Instr

25:21

20:16

15:0

SrcB20:16

ALUResult ReadData

WriteData

SrcA

PCPlus4

Result

MemWriteRegWrite

Zero

CLK

ALUControl2:0

ALU

10100


I-Type

7-<14>

Single-Cycle Datapath: R-type instructions

• example: rd <- rt + rs ; add $s0, $s1, $s2• Read from rs and rt• Write ALUResult to register file• Write to rd (instead of rt)

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCPC' Instr25:21

20:16

15:0

SrcB

20:16

15:11

ALUResult ReadData

WriteData

SrcA

PCPlus4WriteReg4:0

Result

RegDst MemWrite MemtoRegALUSrcRegWrite

Zero

CLK

ALUControl2:0

ALU

0varies1 001


R-Type

7-<15>

Single-Cycle Datapath: beq

• Determine whether values in rs and rt are equal beq $s0, $s1, target

…

target: # label

• Calculate branch target address: BTA = (sign-extended immediate << 2) + (PC+4)

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1

PC' Instr25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

RegDst Branch MemWrite MemtoRegALUSrcRegWrite

Zero

PCSrc

CLK

ALUControl2:0

ALU

01100 x0x 1


I-Type

7-<16>

Complete Single-Cycle Processor

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

Branch

MemWrite

MemtoReg

ALUSrc

RegWrite

Op

Funct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

7-<17>

Review: Processor Performance

Program Execution Time

= (# instructions)(cycles/instruction)(seconds/cycle)

= # instructions x CPI x TC

7-<18>

Single-Cycle Performance

• TC is limited by the critical path (lw)

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

Branch

MemWrite

MemtoReg

ALUSrc

RegWrite

Op

Funct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU1

0100

1

0

1

0 0

7-<19>

Single-Cycle Performance

• Single-cycle critical path: Tc = tpcq_PC + tmem + max(tRFread, tsext + tmux) + tALU + tmem + tmux + tRFsetup

• In most implementations, limiting paths are: – memory, ALU, register file. – Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup

7-<20>

Single-Cycle Performance Example

Tc =

Element Parameter Delay (ps)

Register clock-to-Q tpcq_PC 30

Register setup tsetup 20

Multiplexer tmux 25

ALU tALU 200

Memory read tmem 250

Register file read tRFread 150

Register file setup tRFsetup 20

7-<21>


Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup

= [30 + 2(250) + 150 + 25 + 200 + 20] ps = 925 ps

Element Parameter Delay (ps)

Register clock-to-Q tpcq_PC 30

Register setup tsetup 20

Multiplexer tmux 25

ALU tALU 200

Memory read tmem 250

Register file read tRFread 150

Register file setup tRFsetup 20

7-<22>


• For a program with 100 billion instructions executing on a single-cycle MIPS processor,

Execution Time =

7-<23>


• For a program with 100 billion instructions executing on a single-cycle MIPS processor,

Execution Time = # instructions x CPI x TC

= (100 × 109)(1)(925 × 10-12 s) = 92.5 seconds

7-<24>

Pipelined Processor

• Temporal parallelism• Divide single-cycle processor into 5 ROUGHLY

EQUIVALENT stages:– Fetch– Decode– Execute– Memory– Writeback

• Each stage includes one “slow step”• Add pipeline registers between stages• 5 stages => ~ 5 times faster!• All modern high-performance processors are pipelined.

7-<25>

Single-Cycle vs. Pipelined Performance

Time (ps)Instr

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead / Write

WriteReg

1

2

0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000

Instr

1

2

3

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead / Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

Single-Cycle

Pipelined

The length of all pipeline stages is set by the slowest stage

The instruction latency is 5 * 250 ps = 1250 ps

7-<26>

Pipelining Abstraction

Time (cycles)

lw $s2, 40($0) RF 40

$0RF

$s2+ DM

RF $t2

$t1RF

$s3+ DM

RF $s5

$s1RF

$s4- DM

RF $t6

$t5RF

$s5& DM

RF 20

$s1RF

$s6+ DM

RF $t4

$t3RF

$s7| DM

add $s3, $t1, $t2

sub $s4, $s1, $s5

and $s5, $t5, $t6

sw $s6, 20($s1)

or $s7, $t3, $t4

1 2 3 4 5 6 7 8 9 10

add

IM

IM

IM

IM

IM

IMlw

sub

and

sw

or

7-<27>

Single-Cycle and Pipelined Datapath

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

ALU

WriteRegE4:0

CLK

CLK

CLK

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

Zero

CLK

ALU

Fetch Decode Execute Memory Writeback

7-<28>

Corrected Pipelined Datapath

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

WriteRegW4:0

ALU

WriteRegE4:0

CLK

CLK

CLK

Fetch Decode Execute Memory Writeback

• WriteReg address must arrive at the same time as Result

7-<29>

Pipelined Control

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

ZeroM

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

BranchE BranchM

RegDstE

ALUSrcE

WriteRegE4:0

Same control unit as single-cycle processor

Control delayed to proper pipeline stage

7-<30>

Pipeline Hazard

• Occurs when an instruction depends on results from previous instruction that hasn’t completed.

• Types of hazards:– Data hazard: register value not written back to register

file yet

– Control hazard: next instruction not decided yet (caused by branches)

7-<31>

Data Hazard

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

7-<32>

Handling Data Hazards

• Insert nops in code at compile time• Rearrange code at compile time• Forward data at run time• Stall the processor at run time

7-<33>

Control Hazards

• beq: – branch is not determined until the fourth stage of the pipeline

– Instructions after the branch are fetched before branch occurs

– These instructions must be flushed if the branch happens

• Branch misprediction penalty– number of instruction flushed when branch is taken

– May be reduced by determining branch earlier

7-<34>

Branch Prediction

• Guess whether branch will be taken– Backward branches are usually taken (loops)

– Perhaps consider history of whether branch was previously taken to improve the guess

• Good prediction reduces the fraction of branches requiring a flush