Chapter 6 Slides - University of Arizona · 2 Pipelining • Improve performance by increasing...

transcript

CHAPTER 6

Pipelining

• Improve performance by increasing instruction throughput

Instruction class

Instructionmemory

Registerread

ALUData

memoryRegister

writeTotal

(in ps)

Load word 200 100 200 200 100 800

Store word 200 100 200 200 700

R-format 200 100 200 100 600

Branch 200 100 200 500

Ideal speedup is number of stages in the pipeline. Do we achieve this?

Ins tru ction�fe tch R eg A LU D ata �

acc ess R eg

8 n s Ins tru ction�fe tch R eg A LU D ata �

ac cess R eg

8 n sIns tru ction�

fe tch

T im e

lw $ 1, 10 0 ($0 )

lw $ 2, 20 0 ($0 )

lw $ 3, 30 0 ($0 )

2 4 6 8 1 0 1 2 14 16 1 8

2 4 6 8 1 0 1 2 14

P rog ram �e xecution �o rd er�(in in struc tio ns )

Ins truc tion �fe tch R eg ALU D a ta�

access R eg

T im e

lw $1 , 1 00 ($ 0)

lw $2 , 2 00 ($ 0)

lw $3 , 3 00 ($ 0)

2 ns Ins truc tion �fe tch R eg ALU D a ta�

access R eg

2 nsIns truc tion �

fe tc h R eg A LU D a ta�access R eg

2 ns 2 n s 2 n s 2 ns 2 n s

P rog ram �e xecut io n�o rd er�( in in struc tio n s)

Pipelining

• What makes it easy– all instructions are the same length– just a few instruction formats– memory operands appear only in loads and stores

• What makes it hard?– structural hazards: suppose we had only one memory– control hazards: need to worry about branch instructions– data hazards: an instruction depends on a previous instruction

• We’ll build a simple pipeline and look at these issues

• We’ll talk about modern processors and what really makes it hard:– exception handling– trying to improve performance with out-of-order execution, etc.

Hazards

A=B+EC=B+F

lw $t1, 0($t0)lw $t2, 4($t0)add $t3, $t1, $t2sw $t3, 12($t0)lw $t4, 8($t0)add $t5, $t1, $t4sw $t5, 16($t0)

lw $t1, 0($t0)lw $t2, 4($t0)lw $t4, 8($t0)add $t3, $t1, $t2sw $t3, 12($t0)add $t5, $t1, $t4sw $t5, 16($t0)

Basic Idea

What do we need to add to actually split the datapath into stages?

Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

Pipelined datapath

Five Stages (lw)

Memory and registersLeft half: writeRight half: read

Five Stages (lw)

What is wrong with this datapath?

• Can help with answering questions like:– How many cycles does it take to execute this code?– What is the ALU doing during cycle 4?– Use this representation to help understand datapaths

Graphically representing pipelines

Pipeline operation

• In pipeline one operation begins in every cycle• Also, one operation completes in each cycle• Each instruction takes 5 clock cycles

– k cycles in general, where k is pipeline depth• When a stage is not used, no control needs to be applied• In one clock cycle, several instructions are active • Different stages are executing different instructions• How to generate control signals for them is an issue

Pipeline control

• We have 5 stages. What needs to be controlled in each stage?– Instruction Fetch and PC Increment– Instruction Decode / Register Fetch– Execution– Memory Stage– Write Back

• How would control be handled in an automobile plant?– A fancy control center telling everyone what to do?– Should we use a finite state machine?

Instruction�memory

Address

Instruction�[20– 16]

MemtoReg

Branch

RegDst

ALUSrc

16 32Instruction�[15– 0]

0Registers

Write�register

Write�data

Read�data 1

Read�data 2

Read�register 1

Read�register 2

Sign�extend

M�u�x

1Write�data

Read�data M�

ALU�control

RegWrite

MemRead

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Data�memory

Add Add�result

Shift�left 2

ALU�result

ALUZero

M�u�x

Pipeline control

Execution/Address Calculation stage control

linesMemory access stage

control lines

Write-back stage control

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src

Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Pipeline control

Branch

RegDst

ALUSrc

M�u�x

Add Add�result

RegistersW rite�register

W rite�data

Read�data 1

Read�data 2

Read�register 1

Read�register 2

Sign�extend

M�u�x

ALU�result

Write�data

Read�data

M�u�x

ALU�control

Shift�left 2

MemRead

Control

WBIF/ID

EX/MEM

MEM/WB

M�u�x

AddressData�

memory

Address

Datapath with control

• Problem with starting next instruction before first is finished– Dependencies that “go backward in time” are data hazards

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Program�execution�order�(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of �register $2:

DM Reg

Dependencies

• Use temporary results, don’t wait for them to be written– register file forwarding to handle read/write to same register– ALU forwarding

Programexecutionorder(in instructions)

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14,$2 , $2

sw $15, 100($2)

Forwarding

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

10 10 10 10 10/–20 –20 –20 –20 –20Value of register $2:Value of EX/MEM: X X X –20 X X X X XValue of MEM/WB: X X X X –20 X X X X

PC Instruction�memory

Registers

M�u�x

Control

EX/MEM

MEM/WB

Data�memory

M�u�x

Forwarding�unit

M�u�x

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRs

Forwardingsub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

• Load word can still cause a hazard:– an instruction tries to read a register following a load instruction

that writes to the same register.

• Thus, we need a hazard detection unit to “stall” the load instruction

Can't always forward

Programexecutionorder(in instructions)

lw $2, 20($1)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

ForwardingForward from EX/MEM registers

If (EX/MEM.RegWrite)and If (EX/MEM.Rd != 0)

and (ID/EX.Rs == EX/MEM.Rd)

Forward from MEM/WB registers

If (MEM/WB.RegWrite)and If (MEM/WB.Rd != 0)

and If (ID/EX.Rt==EX/MEM.Rd)

lw $2, 20($1)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

bubble

Stalling

• Hardware detection and no-op insertion is called stalling• Stall pipeline by keeping instruction in the same stage

Example

Stall logic

• Stall logic– If (ID/EX.MemRead) // Load

word instruction AND– If ((ID/EX.Rt == IF/ID.Rs) or

(ID/EX.Rt == IF/ID.Rt))

• Insert no-op (no-operation)– Deasserting all control

signals

• Stall following instruction– Not writing program counter– Not writing IF/ID registers

PCWrite

IF/ID.RsIF/ID.Rt

ID/EX.Rt

Pipeline with hazard detection

Summary

Forwarding Case Summary

Multi-cycle

Multi-cycle Pipeline

Branch

RegDst

ALUSrc

M�u�x

Add Add�result

RegistersW rite�register

W rite�data

Read�data 1

Read�data 2

Read�register 1

Read�register 2

Sign�extend

M�u�x

ALU�result

Write�data

Read�data

M�u�x

ALU�control

Shift�left 2

MemRead

Control

WBIF/ID

EX/MEM

MEM/WB

M�u�x

AddressData�

memory

Address

Branch Hazards

• When we decide to branch, other instructions are in the pipeline!• We are predicting “branch not taken”

– need to add hardware for flushing instructions if we are wrong

Time (in clock cycles)

40 beq $1, $3, 7

IM Reg

DM Reg

Reg Reg

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Branch hazards

Solution to control hazards

• Branch prediction– We are predicting “branch not taken”– Need to add hardware for flushing instructions if we are wrong

• Reduce branch penalty– By advancing the branch decision to ID stage– Compare the data read from two registers read in ID stage– Comparison for equality is a simpler design! (Why?)– Still need to flush instruction in IF stage

• Make the hazard into a feature!– Delayed branch slot - Always execute instruction following

branch

Branch detection in ID stage

Dynamic branch prediction

• Use lower part of instruction address

– Use one bit to say denote branch taken or not taken

– Disadvantage: poor performance in loops

• Dynamic branch prediction– Use two bits instead of one– Condition must be satisfied

twice to predict

• More sophisticated– Count the number of times

branch is taken 2-bit branch predictionState diagram

Correlating Branches• Hypothesis: recent branches are correlated; that is, behavior of recently

executed branches affects prediction of current branch• Idea: record m most recently executed branches as taken or not taken, and

use that pattern to select the proper branch history table• In general, (m,n) predictor means record last m branches to select between

2m history tables each with n-bit counters– Old 2-bit BHT is then a (0,2) predictor

If (aa == 2)aa=0;

If (bb == 2)bb = 0;

If (aa != bb)do something;

Branch address

Prediction

2-bit global branch history

2-bit per branch predictors

Correlating Branches

(2,2) predictor– Then behavior of

recent branches selects between, say, four predictions of next branch, updating just that prediction

Branch address

2-bits per branch predictors

PredictionPrediction

2-bit global branch history

ott li

5%6% 6%

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT

Accuracy of Different Schemes

Branch Prediction

• Sophisticated Techniques:– A “branch target buffer” to help us look up the destination– Correlating predictors that base prediction on global behavior

and recently executed branches (e.g., prediction for a specificbranch instruction based on what happened in previous branches)

– Tournament predictors that use different types of prediction strategies and keep track of which one is performing best.

– A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA)

• Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective!

• Modern processors predict correctly 95% of the time!

Branch Target Buffer

• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)– Note: must check for branch match now, since can’t use wrong

branch address

• Return instruction addresses predicted with stack

Predicted PCBranch Prediction:Taken or not Taken

Scheduling in delayed branching

Other issues in pipelines

• Exceptions– Errors in ALU for arithmetic instructions– Memory non-availability

• Exceptions lead to a jump in a program• However, the current PC value must be saved so that the program

can return to it back for recoverable errors• Multiple exception can occur in a pipeline• Preciseness of exception location is important in some cases• I/O exceptions are handled in the same manner

Exceptions

Improving Performance

• Try and avoid stalls! E.g., reorder these instructions:

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

• Dynamic Pipeline Scheduling– Hardware chooses which instructions to execute next– Will execute instructions out of order (e.g., doesn’t wait for a

dependency to be resolved, but rather keeps going!)– Speculates on branches and keeps the pipeline full

(may need to rollback if prediction incorrect)

• Trying to exploit instruction-level parallelism

Advanced Pipelining

• Increase the depth of the pipeline• Start more than one instruction each cycle (multiple issue)• Loop unrolling to expose more ILP (better scheduling)• “Superscalar” processors

– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue• All modern processors are superscalar and issue multiple

instructions usually with some limitations (e.g., different “pipes”)• VLIW: very long instruction word, static multiple issue

(relies more on compiler technology)

• This class has given you the background you need to learn more!

Superscalar architecture --Two instructions executed in parallel

Dynamically scheduled pipeline

Motorola G4e

Intel Pentium 4

IBM PowerPC 970

Important facts to remember

• Pipelined processors divide execution in multiple steps

• However pipeline hazards reduce performance

– Structural, data, and control hazard

• Data forwarding helps resolve data hazards

– But all hazards cannot be resolved

– Some data hazards require bubble or noop insertion

• Effects of control hazard reduced by branch prediction

– Predict always taken, delayed slots, branch prediction table

– Structural hazards are resolved by duplicating resources

• Time to execute n instructions depends on

– # of stages (k)– # of control hazard and penalty of

each step– # of data hazards and penalty for

each– Time = n + k - 1 + (load hazard

penalty) + (branch penalty)

• Load hazard penalty is 1 or 0 cycle – Depending on data use with

forwarding

• Branch penalty is 3, 2, 1, or zero cycles depending on scheme

Design and performance issues with pipelining

• Pipelined processors are not EASY to design

• Technology affect implementation

• Instruction set design affect the performance

– i.e., beq, bne• More stages do not lead to higher

performance!

Chapter 6 Summary

• Pipelining does not improve latency, but does improve throughput

Slower Faster

Instructions per clock (IPC = 1/CPI)

Multicycle(Section 5.5)

Single-cycle(Section 5.4)

Deeplypipelined

Pipelined

Multiple issuewith deep pipeline

(Section 6.10)

Multiple-issuepipelined

(Section 6.9)

1 Several

Use latency in instructions

Multicycle(Section 5.5)

Single-cycle(Section 5.4)

DeeplypipelinedPipelined

Multiple issuewith deep pipeline

(Section 6.10)

Multiple-issuepipelined

(Section 6.9)

Chapter 6 Slides - University of Arizona · 2 Pipelining • Improve performance by increasing...

Documents