A Pipelined Processor
Taken from Digital Design and Computer Architecture by Harris and Harris
7-<1>
7-<2>
Introduction
• Microarchitecture: how to implement an architecture in hardware
• Processor:– Datapath: functional blocks– Control: control signals
Physics
Devices
AnalogCircuits
DigitalCircuits
Logic
Micro-architecture
Architecture
OperatingSystems
ApplicationSoftware
electrons
transistorsdiodes
amplifiersfilters
AND gatesNOT gates
addersmemories
datapathscontrollers
instructionsregisters
device drivers
programs
7-<3>
Microarchitecture
• Multiple implementations for a single architecture:– Single-cycle
• Each instruction executes in a single cycle
– Multicycle• Each instruction is broken up into a series of shorter steps
– Pipelined• Each instruction is broken up into a series of steps• Multiple instructions execute at once.
– Microcode
7-<4>
Architectural State
• Determines everything about a processor:– PC– 32 registers– Memory
6-<5>
Instruction Formats
op rs rt rd shamt funct6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
R-Type
op rs rt imm6 bits 5 bits 5 bits 16 bits
I-Type
op addr6 bits 26 bits
J-Type
op rs rt rd shamt funct6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
R-Type
op rs rt imm6 bits 5 bits 5 bits 16 bits
I-Type
op addr6 bits 26 bits
J-Type
addsuborand…
lwsw…
beqbne…
7-<6>
State Elements: PC, 32 registers, and Memory
CLK
A RD
InstructionMemory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RD
DataMemory
WD
WEPCPC'
CLK
32 3232 32
32
32
32 32
32
32
5
5
5
7-<7>
Single-Cycle Datapath: lw fetch
• executing lw: lw (index reg) (destination reg) (immediate offset)
example: lw $s3, 1($0) # read memory word 1 into $s3 rt <- DataMemory[rs+imm]
• STEP 1: Fetch instruction
CLK
A RD
InstructionMemory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RD
DataMemory
WD
WEPCPC'
Instr
CLK
op rs rt imm6 bits 5 bits 5 bits 16 bits
I-Type
7-<8>
Single-Cycle Datapath: lw register read
• STEP 2: Read source operands from register file
Instr
CLK
A RD
InstructionMemory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RD
DataMemory
WD
WEPCPC'
25:21
CLK
7-<9>
Single-Cycle Datapath: lw immediate
• STEP 3: Sign-extend the immediate
SignImm
CLK
A RD
InstructionMemory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
A RD
DataMemory
WD
WEPCPC' Instr
25:21
15:0
CLK
7-<10>
Single-Cycle Datapath: lw address
• STEP 4: Compute the memory address
SignImm
CLK
A RD
InstructionMemory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
A RD
DataMemory
WD
WEPCPC' Instr
25:21
15:0
SrcB
ALUResult
SrcA Zero
CLK
ALUControl2:0
ALU
010
7-<11>
Single-Cycle Datapath: lw memory read
• STEP 5: Read data from memory and write it back to register file
A1
A3
WD3
RD2
RD1WE3
A2
SignImm
CLK
A RD
InstructionMemory
CLK
Sign Extend
RegisterFile
A RD
DataMemory
WD
WEPCPC' Instr
25:21
15:0
SrcB20:16
ALUResult ReadData
SrcA
RegWrite
Zero
CLK
ALUControl2:0
ALU
0101
7-<12>
Single-Cycle Datapath: lw PC increment
• STEP 6: Determine the address of the next instruction
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
A RD
DataMemory
WD
WEPCPC' Instr
25:21
15:0
SrcB20:16
ALUResult ReadData
SrcA
PCPlus4
Result
RegWrite
Zero
CLK
ALUControl2:0
ALU
0101
7-<13>
Single-Cycle Datapath: sw
• sw: sw (index reg) (source reg) (immediate offset)
• Write data in rt to memory
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
A RD
DataMemory
WD
WEPCPC' Instr
25:21
20:16
15:0
SrcB20:16
ALUResult ReadData
WriteData
SrcA
PCPlus4
Result
MemWriteRegWrite
Zero
CLK
ALUControl2:0
ALU
10100
op rs rt imm6 bits 5 bits 5 bits 16 bits
I-Type
7-<14>
Single-Cycle Datapath: R-type instructions
• example: rd <- rt + rs ; add $s0, $s1, $s2• Read from rs and rt• Write ALUResult to register file• Write to rd (instead of rt)
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCPC' Instr25:21
20:16
15:0
SrcB
20:16
15:11
ALUResult ReadData
WriteData
SrcA
PCPlus4WriteReg4:0
Result
RegDst MemWrite MemtoRegALUSrcRegWrite
Zero
CLK
ALUControl2:0
ALU
0varies1 001
op rs rt rd shamt funct6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
R-Type
7-<15>
Single-Cycle Datapath: beq
• Determine whether values in rs and rt are equal beq $s0, $s1, target
…
target: # label
• Calculate branch target address: BTA = (sign-extended immediate << 2) + (PC+4)
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1
PC' Instr25:21
20:16
15:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
RegDst Branch MemWrite MemtoRegALUSrcRegWrite
Zero
PCSrc
CLK
ALUControl2:0
ALU
01100 x0x 1
op rs rt imm6 bits 5 bits 5 bits 16 bits
I-Type
7-<16>
Complete Single-Cycle Processor
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
7-<17>
Review: Processor Performance
Program Execution Time
= (# instructions)(cycles/instruction)(seconds/cycle)
= # instructions x CPI x TC
7-<18>
Single-Cycle Performance
• TC is limited by the critical path (lw)
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU1
0100
1
0
1
0 0
7-<19>
Single-Cycle Performance
• Single-cycle critical path: Tc = tpcq_PC + tmem + max(tRFread, tsext + tmux) + tALU + tmem + tmux + tRFsetup
• In most implementations, limiting paths are: – memory, ALU, register file. – Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup
7-<20>
Single-Cycle Performance Example
Tc =
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
7-<21>
Single-Cycle Performance Example
Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup
= [30 + 2(250) + 150 + 25 + 200 + 20] ps = 925 ps
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
7-<22>
Single-Cycle Performance Example
• For a program with 100 billion instructions executing on a single-cycle MIPS processor,
Execution Time =
7-<23>
Single-Cycle Performance Example
• For a program with 100 billion instructions executing on a single-cycle MIPS processor,
Execution Time = # instructions x CPI x TC
= (100 × 109)(1)(925 × 10-12 s) = 92.5 seconds
7-<24>
Pipelined Processor
• Temporal parallelism• Divide single-cycle processor into 5 ROUGHLY
EQUIVALENT stages:– Fetch– Decode– Execute– Memory– Writeback
• Each stage includes one “slow step”• Add pipeline registers between stages• 5 stages => ~ 5 times faster!• All modern high-performance processors are pipelined.
7-<25>
Single-Cycle vs. Pipelined Performance
Time (ps)Instr
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead / Write
WriteReg
1
2
0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000
Instr
1
2
3
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead / Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
Single-Cycle
Pipelined
The length of all pipeline stages is set by the slowest stage
The instruction latency is 5 * 250 ps = 1250 ps
7-<26>
Pipelining Abstraction
Time (cycles)
lw $s2, 40($0) RF 40
$0RF
$s2+ DM
RF $t2
$t1RF
$s3+ DM
RF $s5
$s1RF
$s4- DM
RF $t6
$t5RF
$s5& DM
RF 20
$s1RF
$s6+ DM
RF $t4
$t3RF
$s7| DM
add $s3, $t1, $t2
sub $s4, $s1, $s5
and $s5, $t5, $t6
sw $s6, 20($s1)
or $s7, $t3, $t4
1 2 3 4 5 6 7 8 9 10
add
IM
IM
IM
IM
IM
IMlw
sub
and
sw
or
7-<27>
Single-Cycle and Pipelined Datapath
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCF0
1PC' InstrD
25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
ALU
WriteRegE4:0
CLK
CLK
CLK
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
Zero
CLK
ALU
Fetch Decode Execute Memory Writeback
7-<28>
Corrected Pipelined Datapath
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCF0
1PC' InstrD
25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
WriteRegW4:0
ALU
WriteRegE4:0
CLK
CLK
CLK
Fetch Decode Execute Memory Writeback
• WriteReg address must arrive at the same time as Result
7-<29>
Pipelined Control
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
ZeroM
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
BranchE BranchM
RegDstE
ALUSrcE
WriteRegE4:0
Same control unit as single-cycle processor
Control delayed to proper pipeline stage
7-<30>
Pipeline Hazard
• Occurs when an instruction depends on results from previous instruction that hasn’t completed.
• Types of hazards:– Data hazard: register value not written back to register
file yet
– Control hazard: next instruction not decided yet (caused by branches)
7-<31>
Data Hazard
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
7-<32>
Handling Data Hazards
• Insert nops in code at compile time• Rearrange code at compile time• Forward data at run time• Stall the processor at run time
7-<33>
Control Hazards
• beq: – branch is not determined until the fourth stage of the pipeline
– Instructions after the branch are fetched before branch occurs
– These instructions must be flushed if the branch happens
• Branch misprediction penalty– number of instruction flushed when branch is taken
– May be reduced by determining branch earlier
7-<34>
Branch Prediction
• Guess whether branch will be taken– Backward branches are usually taken (loops)
– Perhaps consider history of whether branch was previously taken to improve the guess
• Good prediction reduces the fraction of branches requiring a flush