Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 0 times |
1CSE 45432 SUNY New Paltz
Chapter Six
Enhancing Performance with Pipelining
PC
Instructionmemory
Instruction
Add
Instruction[20– 16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
RegW
rite
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
MemW
rite
AddressData
memory
Address
2CSE 45432 SUNY New Paltz
Sequential Laundry
• Sequential laundry takes 8 hours for 4 loads
• If they learned pipelining, how long would laundry take?
30Task
Order
B
C
D
ATime
3030 3030 30 3030 3030 3030 3030 3030
6 PM 7 8 9 10 11 12 1 2 AM
3CSE 45432 SUNY New Paltz
Pipelined Laundry• Pipelining doesn’t help latency of
single task, it helps throughput of entire workload
• Multiple tasks operating simultaneously using different resources
• Potential speedup = Number pipe stages
• Pipeline rate limited by slowest pipeline stage
• Unbalanced lengths of pipe stages reduces speedup
• Time to “fill” pipeline and time to “drain” it reduces speedup
• Stall for dependencies
6 PM 7 8 9
Time
B
C
D
A
303030 3030 3030Task
Order
4CSE 45432 SUNY New Paltz
Single Stage VS. Pipeline Performance
Ideal speedup is number of stages in the pipeline.
Instructionfetch
R eg ALU Dataaccess
R eg
8 nsInstruction
fetchR eg ALU
Dataaccess
Reg
8 nsIns truction
fe tch
8 ns
Tim e
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch
Reg ALUDa ta
accessReg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetchReg ALU
Dataaccess
Reg
2 nsInstruction
fetchReg ALU
D ataaccess
Reg
2 ns 2 ns 2 ns 2 ns 2 n s
P rogramexecutionorder(in instruc tions)
InstructionInstr.
MemoryRegister
Read ALU Op.Data
MemoryReg. Write Total
R-format 2 1 2 0 1 6 nslw 2 1 2 1 2 8 nssw 2 1 2 2 7 nsbeq 2 1 2 5 ns
5CSE 45432 SUNY New Paltz
Pipelining
• What makes it easy in MIPS
– all instructions are the same length
– just a few instruction formats
– memory operands appear only in loads and stores
• What makes it hard?
– structural hazards: suppose we had only one memory
– control hazards: need to worry about branch instructions
– data hazards: an instruction depends on a previous instruction
• We’ll build a simple pipeline and look at these issues
6CSE 45432 SUNY New Paltz
The Five Stages of Load
• Ifetch: Instruction Fetch
– Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec: Calculate the memory address
• Mem: Read the data from the Data Memory
• Wr: Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Ifetch Reg/Dec Exec Mem WrLoad
7CSE 45432 SUNY New Paltz
Single Cycle, Multiple Cycle, vs. Pipeline
Clk
Cycle 1
Multiple Cycle Implementation:Multiple Cycle Implementation:
Ifetch Reg Exec Mem Wr
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Load Ifetch Reg Exec Mem Wr
Ifetch Reg Exec Mem
Load Store
Pipeline Implementation:Pipeline Implementation:
Ifetch Reg Exec Mem WrStore
Clk
Single Cycle Implementation:Single Cycle Implementation:
Load Store Waste
Ifetch
R-type
Ifetch Reg Exec Mem WrR-type
Cycle 1 Cycle 2
8CSE 45432 SUNY New Paltz
Basic Idea
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Instruction
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
ReaddataAddress
Datamemory
1
ALUresult
Mux
ALUZero
IF: Instruction fetch ID: Instruction decode/register file read
EX: Execute/address calculation
MEM: Memory access WB: Write back
9CSE 45432 SUNY New Paltz
Pipelined Datapath
• Walk through lw instruction• Walk through sw instruction• The design
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Datamemory
Address
10CSE 45432 SUNY New Paltz
Corrected Datapath
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0
Address
Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX
11CSE 45432 SUNY New Paltz
Graphically Representing Pipelines
• Can help with answering questions like:– how many cycles does it take to execute this code?– what is the ALU doing during cycle 4?– use this representation to help understand datapaths
IM Reg DM Reg
IM Reg DM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $10, 20($1)
Programexecutionorder(in instructions)
sub $11, $2, $3
ALU
ALU
12CSE 45432 SUNY New Paltz
Why Pipeline? Because the resources are there!
Instr.
Order
Time (clock cycles)
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm RegA
LUIm Reg Dm Reg
AL
UIm Reg Dm Reg
13CSE 45432 SUNY New Paltz
Pipeline Control
PC
Instructionmemory
Address
Inst
ruct
ion
Instruction[20– 16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0Registers
Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1Write
data
Read
data Mux
1
ALUcontrol
RegWrite
MemRead
Instruction[15– 11]
6
IF/ID ID/EX EX/MEM MEM/WB
MemWrite
Address
Datamemory
PCSrc
Zero
AddAdd
result
Shiftleft 2
ALUresult
ALU
Zero
Add
0
1
Mux
0
1
Mux
14CSE 45432 SUNY New Paltz
• Pass control signals along just like the data
Pipeline Control
Execution/Address Calculation stage control lines
Memory access stage control lines
Write-back stage control
lines
InstructionReg Dst
ALU Op1
ALU Op0
ALU Src Branch
Mem Read
Mem Write
Reg write
Mem to Reg
R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
15CSE 45432 SUNY New Paltz
Datapath with Control
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
RegW
rite
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Write
AddressData
memory
Address
16CSE 45432 SUNY New Paltz
Designing a Pipelined Processor
• Go back and examine your datapath and control diagram
• associated resources with states
• ensure that flows do not conflict, or figure out how to resolve
• assert control in appropriate stage
17CSE 45432 SUNY New Paltz
Can pipelining get us into trouble?
• Yes: Pipeline Hazards
– structural hazards: attempt to use the same resource two different ways at the same time
– data hazards: attempt to use item before it is ready
• instruction depends on result of prior instruction still in the pipeline
– control hazards: attempt to make a decision before condition is evaluated
• branch instructions• Can always resolve hazards by waiting
– pipeline control must detect the hazard
– take action (or delay action) to resolve hazards
18CSE 45432 SUNY New Paltz
• Problem with starting next instruction before first is finished
• dependencies that “go backward in time” are data hazards
Data Hazards
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
19CSE 45432 SUNY New Paltz
• Have compiler guarantee no hazards
• Where should compiler insert “nop” instructions?
sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
• Problem:
– It happens too often to rely on compiler
– It really slows us down!
Software Solution
20CSE 45432 SUNY New Paltz
• Use temporary results, don’t wait for them to be written
• Also, write register file during 1st half of clock and read during 2nd half
Data Hazard Solution: Forwarding
what if this $2 was $13?
IM R e g
IM R e g
C C 1 C C 2 C C 3 C C 4 C C 5 C C 6
T im e ( in c lo c k cy c le s )
s ub $ 2 , $ 1 , $ 3
P r o g ra me xe c u tio n o rde r( in in s tru c tio n s )
a n d $ 1 2 , $ 2 , $5
IM R eg D M R e g
IM D M R e g
IM D M R e g
C C 7 C C 8 C C 9
10 1 0 1 0 1 0 1 0 /– 2 0 – 2 0 – 2 0 – 2 0 – 20
o r $ 1 3 , $ 6 , $2
a dd $ 1 4 , $ 2 , $2
s w $ 15 , 1 0 0 ($ 2 )
V a lue o f re g is te r $ 2 :
D M R eg
R eg
R e g
R eg
X X X – 20 X X X X XV a lu e o f E X /M E M :X X X X – 2 0 X X X XV a lu e o f M E M /W B :
D M
21CSE 45432 SUNY New Paltz
• Steer the result from precious instruction to the ALU
• EX hazardif (EX/MEM.RegWrite
and (EX/MEM.RegisterRd = 0)
and (EX /MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd = 0)
and (EX /MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10
• MEM hazardif (MEM/WB.RegWrite
and (MEM/WB.RegisterRd = 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd = 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
Hazard Conditions
22CSE 45432 SUNY New Paltz
Forwarding
PCInstructionmemory
Registers
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Forwardingunit
IF/ID
Inst
ruct
ion
Mux
RdEX/MEM.RegisterRd
MEM/WB.RegisterRd
Rt
Rt
Rs
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
00 Register file
01 Mem. or earlier ALU
10 Prior ALU
23CSE 45432 SUNY New Paltz
• lw can still cause a hazard:– an instruction tries to read a register following a load instruction that writes to the same
register.
–
• Thus, we need a hazard detection unit to “stall” the load instruction
Can't always forward
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
24CSE 45432 SUNY New Paltz
Stalling
• We can stall the pipeline by keeping an instruction in the same stage
• Repeat in clock cycle 4 what they did in clock cycle 3
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
25CSE 45432 SUNY New Paltz
Hazard Detection Unit
• Stall by letting an instruction that won’t write anything go forward
• controls writing of the PC and IF/ID plus MUX
PCInstruction
memory
Registers
Mux
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
0
Mux
IF/ID
Inst
ruct
ion
ID/EX.MemReadIF
/ID
Wri
te
PC
Writ
e
ID/EX.RegisterRt
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
RtRs
Rd
Rt EX/MEM.RegisterRd
MEM/WB.RegisterRd
26CSE 45432 SUNY New Paltz
• When we decide to branch, other instructions are in the pipeline!
• We are predicting “branch not taken”– need to add hardware for flushing instructions if we are wrong
Branch Hazards
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
27CSE 45432 SUNY New Paltz
Flushing Instructions
PCInstruction
memory
4
Registers
Mux
Mux
Mux
ALU
EX
M
WB
M
WB
WB
ID/EX
0
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
Signextend
Control
Mux
=
Shiftleft 2
Mux
• Reduce branch delay
28CSE 45432 SUNY New Paltz
Improving Performance
• Superpipelining: ideal maximum speedup is related to number of stages
• Superscalar: start more than one instruction in the same cycle
• Dynamic pipeline scheduling
– Try and avoid stalls! E.g., reorder these instructions:
lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)
29CSE 45432 SUNY New Paltz
Dynamic Scheduling
• The hardware performs the “scheduling”
– hardware tries to find instructions to execute
– out of order execution is possible
– speculative execution and dynamic branch prediction
• All modern processors are very complicated
– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
– PowerPC and Pentium: branch history table
– Compiler technology important
30CSE 45432 SUNY New Paltz
Dynamic Scheduling in PowerPC 604 and Pentium Pro
• Both In-order Issue, Out-of-order execution, In-order Commit
Pentium Pro central reservation station for any functional units with one bus shared by a branch and an integer unit
31CSE 45432 SUNY New Paltz
Dynamic Scheduling in Pentium Pro
• PPro doesn’t pipeline 80x86 instructions
• PPro decode unit translates the Intel instructions into 72-bit micro-operations ( MIPS)
• Sends micro-operations to reorder buffer & reservation stations
• Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations
• Most instructions translate to 1 to 4 micro-operations
• Complex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations
32CSE 45432 SUNY New Paltz
FYI: MIPS R3000 clocking discipline
• 2-phase non-overlapping clocks
phi1
phi2
phi1 phi1phi2Edge-triggered