Date post: | 08-Dec-2015 |
Category: |
Documents |
Upload: | safer-muhammet |
View: | 221 times |
Download: | 4 times |
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (12/1999)
PIPELINING: A CONTINUATION OF PROCESSOR DESIGN
• We found that the single-cycle implementation wastes time
! All instructions take as long as the instruction with the longest delay (lw)
• In the multicycle implementation:
! The clock period is much shorter than in the single-cycle implementation
! Instructions take only as many clock periods as they need
! BUT: Each functional unit is used only once or twice in executing aninstruction
◦We need an implementation in which each functional unit is busy inevery clock period◦ This is possible if we cut the execution of an instruction into stages,
and then overlap the execution of different instructions
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (09/2011)
STEPS IN EXECUTING AN INSTRUCTION
Step R-type Memory reference Branches Jumps
Instruction IR = M[PC]Fetch PC = PC + 4
Instruction A = Reg[IR[25–21]]decode, B = Reg[IR[20–16]]
Register Fetch ALUOut = PC + (sign-extend(IR[15–0])<<2)Execution, ALUOut = A op B ALUOut = A If A == B then PC = PC[31–28]
address comp., + (sign-extend PC = ALUOut concatenated w/branch/jump (IR[15–0]) (IR[25–0]<<2)completion
Memory access Reg[IR[15–11]] Load: MDRor = ALUOut = M[ALUOut]
R-type completion Store: M[ALUOut] = BMemory read Load: Reg[IR[20–16]]completion = MDR
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2011)
PIPELINING
• In a pipelined computer architecture, a single processor can execute severalinstructions concurrently, reducing the CPI
. Execution of one instruction uses several hardware functional units(instruction memory, register file, ALU, data memory, etc.)
. The functional units are organized into stages
� Execution at each stage takes 1 clock period� Stages are separated by clock-controlled pipeline registers that pre-
serve the state of execution for the duration of a clock period
. The pipeline is subject to hazards
� Data hazards: Write/read conflicts or timing problems� Control hazards: Exceptions and branches
• The MIPS R2000 pipeline design strongly influenced the design of all sub-sequent processors
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2011)
PIPELINING: PLUSES AND MINUSES
• What makes pipelining easy in the MIPS ISA:
. All instructions are the same length
. There are only a few instruction formats
. Memory operands occur only in loads and stores
• What makes pipelining hard in any ISA:
. Structural hazards (e.g., contention for the same functional unit)
. Control (branch & exception) hazards
. Data hazards (e.g., trying to read a register before it’s written)
We will build a simple pipeline to illustrate these issues
• Pipelining is even more di�cult in modern general-purpose microprocessors
. Exception handling is a challenge
. Performance improvements such as simultaneous instruction issue, out-of-order execution, etc., create lots of complications
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (09/2010)
PIPELINE DESIGN APPROACH
• Begin with the multicycle implementation
! Different functional units are all executing the same instruction, althoughthe units are active in different clock periods
! Control information does not need to be stored in the temporary registers
• Identify the changes that need to be made in the pipelined design
! Different functional units are executing different instructions
! All information needed for execution of a given instruction must propagatethrough the pipeline with the instruction
! Control information must be stored between stages, because the controlsignals are different for different instructions
◦ Control design is a source of complexity in pipeline design (think aboutwhat happens when a branch is taken)
! Results of execution may differ from the multicycle implementation
◦ Data hazards are another source of complexity
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
MULTICYCLE DATAPATH AND CONTROL
Shiftleft 2
PCMux
0
1
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Instruction[15–11]
Mux
0
1
Mux
0
1
4
Instruction[15–0]
Signextend
3216
Instruction[25–21]
Instruction[20–16]
Instruction[15–0]
Instructionregister
ALUcontrol
ALUresult
ALUZero
Memorydata
register
A
B
IorD
MemRead
MemWrite
MemtoReg
PCWriteCond
PCWrite
IRWrite
ALUOp
ALUSrcB
ALUSrcA
RegDst
PCSource
RegWrite
Control
Outputs
Op[5–0]
Instruction[31-26]
Instruction [5–0]
Mux
0
2
Jumpaddress [31-0]Instruction [25–0] 26 28
Shiftleft 2
PC [31-28]
1
1 Mux
0
32
Mux
0
1ALUOut
Memory
MemData
Writedata
Address
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
SINGLE-CYCLE DATAPATH
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Instruction
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
ReaddataAddress
Datamemory
1
ALUresult
Mux
ALUZero
IF: Instruction fetch ID: Instruction decode/register file read
EX: Execute/address calculation
MEM: Memory access WB: Write back
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 4th Edition
SEQUENTIAL vs. PIPELINED EXECUTION
Programexecutionorder(in instructions)
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
Time 1000 1200 1400200 400 600 800
1000 1200 1400200 400 600 800
1600 1800
Instructionfetch
Dataaccess Reg
Instructionfetch
Dataaccess Reg
Instructionfetch
800 ps
800 ps
800 ps
Programexecutionorder(in instructions)
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
Time
Instructionfetch
Dataaccess Reg
Instructionfetch
Instructionfetch
Dataaccess Reg
Dataaccess Reg
200 ps
200 ps
200 ps 200 ps 200 ps 200 ps 200 ps
ALUReg
ALUReg
ALU
ALU
ALU
Reg
Reg
Reg
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (11/1999)
PIPELINING (2)
• Pipeline speedup:
! A pipelined processor with s stages can execute n instructions in
ETP = s + (n− 1) clock periods
(assuming no hazards)
! A serial processor executes the same n instructions in
ETS = ns clock periods
! The ideal pipeline speedup equals the number of stages:
SP =ETS
ETP=
ns
s + (n− 1)−→n¿s
s
• Amdahl’s law applies to pipelining
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (05/1999)
PROGRAMMING IMPLICATIONS OF PIPELINING
• Avoid function or subprogram calls in an inner loop
! Jumps force the pipeline to be flushed
• Avoid recursion in an inner loop
! Recursion on the elements of an array generally causes data hazards be-cause the value of v[n] has not been written before it is needed for thecomputation of v[n+1]
• Avoid scalar temporary variables in an inner loop
! Reading a memory-resident scalar variable may cause a data hazard
• Avoid case and switch statements in an inner loop
! Conditional branches cause control hazards, and the use of a jump tablemay cause data hazards
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (05/1999)
MIPS PIPELINES (1)
• MIPS R2000 integer unit pipeline stages(Patterson & Hennessy, Chapter 6)
1. Instruction Fetch (IF)
2. Instruction Decode (ID) and Register Fetch
3. Execute (EX or ALU)
! ALU operations, condition evaluation, address computation
4. Memory access (MEM)
5. Write back (WB) to register file
Clockperiods
1 2 3 4 5
IF ID WBEX MEM
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (05/1999)
R4000 PIPELINE
IF IS
Instruction memory Reg ALU Data memory Reg
RF EX DF DS TC WB
The eight-stage pipeline of the R4000
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (05/1999)
MIPS PIPELINES (2)
• MIPS R2000 floating-point unit pipeline stages
1. Instruction Fetch (IF)2. Register Fetch and Instruction Decode (RD)
! FPU decodes instruction on bus to see if it’s floating-point! FPU reads data from its registers
3. Execute (EX or ALU)4. Memory access (MEM)5. Exception processing (stage called WB for correspondence with
integer pipeline)6. Write back (FWB)
Clockperiods
1 2 3 4 5
IF ID WBEX MEM FWB
6
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
SINGLE-CYCLE DATAPATH
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Instruction
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
ReaddataAddress
Datamemory
1
ALUresult
Mux
ALUZero
IF: Instruction fetch ID: Instruction decode/register file read
EX: Execute/address calculation
MEM: Memory access WB: Write back
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
PIPELINED EXECUTION IN SINGLE-CYCLE DATAPATH
IM Reg DM RegALU
IM Reg DM RegALU
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7
Time (in clock cycles)
lw $2, 200($0)
lw $3, 300($0)
Programexecutionorder(in instructions)
lw $1, 100($0) IM Reg DM RegALU
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Datamemory
Address
SINGLE-CYCLE DATAPATHWITH PIPELINE REGISTERS
Because the state of a D flip-flop changes only on clock edges,new data can be asserted on the inputs of the pipeline registers
while the data written in the previous clock period is still valid on the outputs
Inputside
Outputside
Inputside
Outputside
Inputside
Outputside
Inputside
Outputside
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (09/2011)
MASTER-SLAVE D FLIP-FLOP
• The master latch (on the left) receives the D and clock (C) inputs
. When the clock is asserted, the Q output of the master latch follows thedata (D)
. When the clock is deasserted, the master latch is closed, but the second(slave) latch is open
� The output of the slave latch follows its input, which is the output ofthe master latch
_Q
Q
_Q
Dlatch
D
C
Dlatch
DD
C
C
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
COMBINATIONAL LOGIC AND STATE ELEMENTS
Clock cycle
Stateelement
1Combinational logic
Stateelement
2
• Every state element has 2 control inputs: Clock signal and write enable
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
STAGE 1 OF A LOAD INSTRUCTION
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction fetch
lw
Address
Datamemory
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
STAGE 2 OF A LOAD INSTRUCTION
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Instruction decode
lw
Address
Datamemory
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
STAGE 3 OF A LOAD INSTRUCTION
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Execution
lw
Address
Datamemory
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
STAGE 4 OF A LOAD INSTRUCTION
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Memory
lw
Address
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
STAGE 5 OF A LOAD INSTRUCTION
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writedata
ReaddataData
memory
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Write backlw
Writeregister
Address
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
STAGE 3 OF A STORE INSTRUCTION
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Execution
sw
Address
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
STAGE 4 OF A STORE INSTRUCTION
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Memory
sw
Address
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
STAGE 5 OF A STORE INSTRUCTION
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0
Address
Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Write back
sw
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2011)
DATAPATH MODIFICATIONS FOR PIPELINING
• The number of the register that an instruction must write to is read in theID stage
. The name of the signal is WriteReg
• Consider two instructions:
lw $10, 20($1)sub $11, $2, $3
. WriteReg signal values are 10 (for lw) and 11 (for sub)
. The WriteReg signal is read only in the WB stage
. The sub’s ID stage modifies WriteReg before lw can read it
. Therefore the value of the WriteReg signal is part of theinstruction’s state, and must be passed along in pipeline registers as theinstruction executes
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0
Address
Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX
DATAPATH MODIFIED TOHANDLE A LOAD
WriteReg WriteReg WriteReg
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Address
Datamemory
PIPELINE STAGES USED BYA LOAD INSTRUCTION
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
TWO REPRESENTATIONS OF PIPELINED EXECUTION
IM Reg DM Reg
IM Reg DM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $10, 20($1)
Programexecutionorder(in instructions)
sub $11, $2, $3
ALU
ALU
Programexecutionorder(in instructions)
Time ( in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Instructionfetch
Instructiondecode
Instructionfetch
Instructiondecode Execution Write back
Execution
Dataaccess
Dataaccess Write backlw $10, $20($1)
sub $11, $2, $3
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2011)
PIPELINED EXECUTION OF TWO INSTRUCTIONS
• Exercise: Show the signal values in the datapath in the pipelined executionof the instructions
lw $10, 20($1)sub $11, $2, $3
in each of the following six slides
. Assume the following register and memory contents:
($1) = 0x1000 0000(M[0x1000 0014]) = 0x7fff fffc($2) = 0x0000 000e($3) = 0x0000 0008
. Also show the values of the WriteReg signal in each stage
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction fetch
lw $10, 20($1)
Address
Datamemory
Clock 1
CLOCK PERIOD 1
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction decode
lw $10, 20($1)Instruction fetch
sub $11, $2, $3
Address
Datamemory
Clock 2
CLOCK PERIOD 2
Instructionmemory
Address
4
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Execution
lw $10, 20($1)Instruction decode
sub $11, $2, $3
3216Sign
extend
Address
Datamemory
Clock 3
CLOCK PERIOD 3
Instructionmemory
Address
4
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
3216Sign
extend
Writeregister
Writedata
Memory
lw $10, 20($1)
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Execution
sub $11, $2, $3
Datamemory
Address
Clock 4
CLOCK PERIOD 4
Instructionmemory
Address
4
32
0
Add Addresult
1
ALUresult
Zero
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEMID/EX MEM/WB
Write backMux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Mux
ALUReaddata
Writeregister
Writedata
lw $10, 20($1)
Memory
sub $11, $2, $3
Address
Datamemory
Clock 5
CLOCK PERIOD 5
Instructionmemory
Address
4
32
0
Add Addresult
1
ALUresult
Zero
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEMID/EX MEM/WB
Write backMux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Mux
ALUReaddata
Writeregister
Writedata
sub $11, $2, $3
Address
Datamemory
Clock 6
CLOCK PERIOD 6
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2011)
PIPELINE STAGES: DATAPATH
Stage R-type Memory reference Branches
IF IF/ID Instruction = IM[PC]IF/ID PC = PC + 4
ID/EX PC = IF/ID PCID/EX A = Reg[IF/ID Instruction[25–21]]
ID ID/EX B = Reg[IF/ID Instruction[20–16]]ID/EX Immediate = sign-extend(IF/ID Instruction[15–0])
EX/MEM ALUOut EX/MEM ALUOut = A EX/MEM PCEX = A op B + ID/EX Immediate = ID/EX PC
WriteReg = ID/EX Inst[15–11] WriteReg = ID/EX Inst[20–16] + ((ID/EX Imm)<<2)EX/MEM B = ID/EX B
MEM/WB ALUOut = EX/MEM ALUOutAddress = EX/MEM ALUOut
MEM Load: MEM/WB ReadData PC = EX/MEM PC= DM[Address]
Store: DM[Address] = EX/MEM BMEM/WB WriteReg = EX/MEM WriteReg
WB Reg[MEM/WB WriteReg] Load: Reg[MEM/WB WriteReg]= MEM/WB ALUOut = MEM/WB ReadData
PC
Instructionmemory
Address
Inst
ruct
ion
Instruction[20–16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15–0]
0
0Registers
Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
Write
data
Read
data Mux
1
ALUcontrol
RegWrite
MemRead
Instruction[15–11]
6
IF/ID ID/EX EX/MEM MEM/WB
MemWrite
Address
Datamemory
PCSrc
Zero
AddAdd
result
Shiftleft 2
ALUresult
ALUZero
Add
0
1
Mux
0
1
Mux
PIPELINED DATAPATHWITH CONTROL SIGNALS
ALUOut
A
B
A
B
B
ALUOut
PC PC PC
PC
Imm
WriteReg WriteReg
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (05/1999)
SETTINGS OF CONTROL LINES
Ex/Address Calc. Mem. Access WriteBackInst. Reg ALUOp ALUOp ALU Br Mem Mem Reg Mem-type Dst bit 1 bit 0 Src Read Write Write to-reg.
R-type 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw d 0 0 1 0 0 1 0 dbeq d 0 1 0 1 0 0 0 d
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2011)
PIPELINED CONTROL
• The control signals for an instruction are determined in the ID stage
. The next instruction’s ID stage asserts new values of the control signals
. The current instruction’s control signals must be preserved for all stagesafter ID
. Control signals are part of the state of the instruction, and therefore mustbe passed along from stage to stage in pipeline registers, just like data
. Some instruction fields (such as Immediate) must be preserved until theyare needed in later stages
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
CONTROL LINES FOR THETHREE FINAL STAGES
RegWriteMemtoReg
RegWriteMemtoReg
BranchMemReadMemWrite
RegDst}ALUOpALUSrc
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20–16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15–0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15–11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
PIPELINED DATAPATH WITHCONTROL LOGIC AND SIGNALS
A
B
A
B
PC PC
PC
Imm
WriteReg WriteReg
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2011)
PIPELINED EXECUTION OF FIVE INSTRUCTIONS
• We’ll follow what happens in the instruction sequence
[40000024] lw $10, 20($1)[40000028] sub $11, $2, $3[4000002c] and $12, $4, $5[40000030] or $13, $6, $7[40000034] add $14, $8, $9
. For each clock period, note the values of the following signals in theID/EX, EX/MEM, and MEM/WB pipeline registers:
�WB: RegWrite, MemtoReg�M: Branch, MemRead, MemWrite� EX: RegDst, ALUOp, ALUSrc
Instructionmemory
Instruction[20–16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
Instruction[15–0]
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
ALUresult
Zero
ALUcontrol
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15–11]
EX
M
WB
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: before<1> EX: before<2> MEM: before<3> WB: before<4>
MEM/WB
IF: lw $10, 20($1)
000
00
0000
000
00
000
0
00
00
0
00
Mux
0
1
Add
PC
0
Datamemory
Address
Writedata
Readdata
Mux
1
Mem
Writ
e
Address
Clock 1
WB
EX
M
Instructionmemory
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
Mux
0
1
Add Addresult
Writeregister
Writedata
Mux1
ALUresult
Zero
ALUcontrol
Shiftleft 2
Reg
Writ
e
ALU
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>
MEM/WB
IF: sub $11, $2, $3
010
11
0001
000
00
000
0
00
00
0
00
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
lwControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
X
10
20
X
1
Instruction[20–16]
Instruction[15–0] Sign
extend
Instruction[15–11]
20
$X
$1
10
X
MemRead
Mem
Writ
e
Datamemory
Address
Address
Clock 2
Instructionmemory
Address
Instruction[20–16]
Mem
toR
eg
Branch
ALUSrc
4
Instruction[15–0]
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
ALUresult
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15–11]
EX
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>
MEM/WB
IF: and $12, $4, $5
000
10
1100
010
11
000
1
00
00
0
00
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
Mem
Writ
e
sub
11
X
X
3
2
X
$3
$2
X
11
$1
20
10
Mux
0
Mux1
ALUOp
RegDst
ALUcontrol
M
WB
Zero
Signextend
Datamemory
Address
Clock 3
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
Writeregister
Writedata 1
ALUresult
ALUcontrol
Shiftleft 2
Reg
Writ
e
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX MEM/WB
000
10
1100
000
10
101
0
11
10
0
00
Mux
0
1
Add
PC
0Writedata
Mux
1
andControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
12
X
X
5
4
Instruction[20–16]
Instruction[15–0]
Instruction[15–11]
X
$5
$4
X
12
MemRead
Mem
Writ
e
$3
$2
11
Mux
Mux
ALUAddress Read
dataData
memory
10
WB
Zero
Signextend
Clock 4
ID: and $12, $4, $5 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>IF: or $13, $6, $7
Instructionmemory
Address
Instruction[20–16]
Branch
ALUSrc
4
Instruction[15–0]
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
ALUresult
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15–11]
EX
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .
MEM/WB
IF: add $14, $8, $9
000
10
1100
000
10
101
0
10
00
0
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
Mem
Writ
e
or
13
X
X
7
6
X
$7
$6
X
13
$4
Mux
0
Mux1
ALUOp
RegDst
ALUcontrol
M
WB
11 10
10$5
12
WB
Mem
toR
eg
11
Zero
Datamemory
Address
Signextend
Clock 5
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
1
ALUresult
ALUcontrol
Shiftleft 2
Reg
Writ
e
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .
MEM/WB
IF: after<1>
000
10
1100
000
10
101
0
10
00
0
10
Mux
0
1
Add
PC
0Writedata
Mux
1
addControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
14
X
X
9
8
Instruction[20–16]
Instruction[15–0]
Instruction[15–11]
X
$9
$8
X
14
MemRead
Mem
Writ
e
$7
$6
13
Mux
Mux
ALUReaddata
12
WB
11
11
Writeregister
Writedata
Zero
Datamemory
Address
Signextend
Clock 6
Instructionmemory
Address
Instruction[20–16]
Branch
ALUSrc
4
Instruction[15–0]
0
1
Add Addresult
RegistersWriteregister
Writedata
ALUresult
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15–11]
Signextend
EX
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .
MEM/WB
IF: after<2>
000
00
0000
000
10
101
0
10
00
0
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
Mem
Writ
e
$8
Mux
0
Mux1
ALUOp
RegDst
ALUcontrol
M
WB
13 12
12$9
14
WB
Mem
toR
eg
10
Readdata 1
Readdata 2
Readregister 1
Readregister 2 Zero
Datamemory
Address
Clock 7
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
Reg
Writ
e
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .
MEM/WB
IF: after<3>
000
00
0000
000
00
000
0
10
00
0
10
Mux
0
1
Add
PC
0Writedata
Mux
1
Control
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Instruction[20–16]
Instruction[15–0] Sign
extend
Instruction[15–11]
MemRead
Mem
Writ
e
Mux
Mux
ALUReaddata
14
WB
13
13
Writeregister
Writedata
Datamemory
Address
Clock 8
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
Reg
Writ
e
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .
MEM/WB
IF: after<4>
000
00
0000
000
00
000
0
00
00
0
10
Mux
0
1
Add
PC
0Writedata
Mux
1
Control
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Instruction[20–16]
Instruction[15–0] Sign
extend
Instruction[15–11]
MemRead
Mem
Writ
e
Mux
Mux
ALUReaddata
WB
14
14
Writeregister
Writedata
Datamemory
Address
Clock 9
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (05/1999)
DATA HAZARDS (1)
• Data hazards occur when the order of read and write actions is not the orderin strictly sequential execution
! Hazards are named by the ordering in the program that must be preservedin the course of pipelined execution
◦ In the following, the order of execution should be i, then j
! RAW (read after write) — j reads a source before i has written it
◦ j incorrectly gets the old value◦Most common kind of data hazard
! WAR (write after read) — j writes a destination before i reads it
◦ i incorrectly gets the new value
! WAW (write after write) — j should write an operand after i writes it,but the writes are performed in the wrong order, incorrectly leaving thevalue written by i
• Hazards limit pipeline speedup and complicate design
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (05/1999)
DATA HAZARDS (2)
• RAW hazards are generated by the instructions
sub $2,$1,$3
and $12,$2,$5
or $13,$6,$2
add $14,$2,$2
sw $15,100($2)
1 2 3 4 5 6 7
Instructionfetch Reg ALU Data
access Reg
Clock Periods
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
Instructionfetch Reg ALU Data
access Reg
Instructionfetch Reg ALU Data
access Reg
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Instructionfetch Reg ALU Data
access Reg
add $14, $2, $2
sw $15, 100($2)
8 9
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (05/1999)
DATA HAZARDS (3)
• RAW hazards with i = sub $2,$1,$3 and source = $2 are generated bythe instructions
sub $2,$1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2)
! The instruction j = and $12,$2,$5 reads $2 before i writes it
! The instruction j = or $13,$6,$2 reads $2 before i writes it
! The instruction j = add $14,$2,$2 reads $2 in the same clock periodin which i writes it
◦ Generates a hazard if the register file’s outputs change only on the edgeof the main processor clock
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2012)
SIMPLE QUESTIONS ABOUT TIMING
• How many clock periods are required to execute this program segment withvarious pipelined designs?
lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label # Assume branch not takenadd $t5, $t2, $t3sw $t5, 8($t3)
Label: ...
. What happens during clock period 8?
. In what clock period does the addition of $t2 and $t3 actually take place?
• This segment takes 21 clock periods in the multicycle implementation
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2012)
PROGRAM SEGMENT TIMING (1)
• Pipeline hazards in the same program segment as for multicycle
IM Reg DM Reg
IM Reg DM Reg
IM Reg DM Reg
IM Reg DM Reg
IF/ID
ID/E
X
EX/M
EM
MEM/
WB
IF/ID
ID/E
X
EX/M
EM
MEM/
WB
IF/ID
ID/E
X
EX/M
EM
MEM/
WB
IF/ID
ID/E
X
EX/M
EM
MEM/
WB
lw $t2, 0($t3)
lw $t3, 4($t3)
beq $t2, $t3, Label
add $t5, $t2, $t3
sw $t5, 8($t3) IM Reg DM RegIF/ID
ID/E
X
EX/M
EM
MEM/
WB
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2012)
PROGRAM SEGMENT TIMING (2)
• In the multicycle implementation, this 5-instruction segment takes 21 clocks
• If there were no hazards, pipelined execution of the segment would take 9clocks
• The hazard on register $t3 causes a delay of 3 clocks and the hazard on $t5causes an additional delay of 3 clocks, making the pipelined execution time15 clock periods—71% of the multicycle execution time!
• Pipline design improvements can reduce execution times
IM Reg DM Reg
IM Reg DM Reg
IF/ID
ID/E
X
EX/M
EM
MEM/
WB
IF/ID
ID/E
X
EX/M
EM
MEM/
WB
lw $t3, 4($t3)
beq $t2, $t3, Label3 cp
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2012)
PROGRAM SEGMENT TIMING (3)
• In the multicycle implementation, this 5-instruction segment takes 21 clocks
• If there were no hazards, pipelined execution of the segment would take 9clocks
• The hazard on register $t5 causes an additional delay of 3 clocks, making thepipelined execution time 15 clock periods—71% of the multicycle executiontime!
IM Reg DM Reg
IM Reg DM Reg
IF/ID
ID/E
X
EX/M
EM
MEM/
WB
IF/ID
ID/E
X
EX/M
EM
MEM/
WB
IM Reg DM RegIF/ID
ID/E
X
EX/M
EM
MEM/
WB
CP6 CP7 CP9 CP10CP8 CP11 CP12 CP14 CP15CP13
beq $t2, $t3, Label
add $t5, $t2, $t3
sw $t5, 8($t3)
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (02/1999)
DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (1)
• Follows ideas of K. Kennedy et al.
• Definition: If control flow within a program can reach statement T afterpassing through statement S, then T depends on S
! Dependence is always defined by reference to the results ofserial execution
• Assume a loop
! Dependence analysis outside a loop is trivial
• Assume an array
! Dependences that affect pipelining arise from references to the same mem-ory location M[n] in an array
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (02/1999)
DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (2)
• Distinguish between statements (in a program) and instances of those state-ments (in loop instances or threads)
! Let i = loop induction variable
◦ Example: A for loop in C¶ Syntax: for ( i=0; i<n; i++ ) · · ·¶ The induction variable is i
! Let Si = instance of statement S that occurs on the value i of the inductionvariable
• Flow dependence (RAW): SiS writes M[n] and TiT reads M[n]
S: X[fS[i]] = · · ·T: · · · = F[X[fT[i]]]
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (02/1999)
DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (3)
• Anti-dependence (WAR): SiS reads M[n] and TiT writes M[n]
S: · · · = F[X[fS[i]]]T: X[fT[i]] = · · ·
• Output dependence (WAW): SiS and TiT both write M[n]
S: X[fS[i]] = · · ·T: X[fT[i]] = · · ·
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c! C. D. Cantrell (10/2010)
LOOP OVERHEAD EXAMPLE: DOT PRODUCT
# The arguments are in registers $a0 through $a4# The first argument POINTS to the first element of vector v1# The second argument POINTS to the first element of vector v2# Third argument = value of veclen (the dimension of the vectors)# Fourth argument = data size in bytes
.text
__start:dotpro: nop # The dot product function
ori $v0,$0,0 # Initialize the dot product to 0blez $a2,beamup # Return if veclen <= 0or $t1,$0,$a2 # Register t1 will be a counter; initialized
# to veclenor $t3,$0,$a0 # Register t3 points to the component of v1or $t4,$0,$a1 # Register t4 points to the component of v2
loop2: lw $t5,0($t3) # Load word pointed to by reg. t3lw $t6,0($t4) # Load word pointed to by reg. t4mul $t2,$t5,$t6 # Multiply regs. t5 and t6, product in reg. t2add $v0,$v0,$t2 # Add product to running sum in reg. v0add $t3,$a3,$t3 # Increment the pointer to the component of v1add $t4,$a3,$t4 # Increment the pointer to the component of v2addi $t1,-1 # Decrement register t1
bgtz $t1,loop2 # Loop again if t1>0beamup: jr $ra # Beam me up....
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (02/1999)
DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (4)
• Requirement for the existence of an instance of dependence:A real memory location M[n] exists such that
M[n] = fS[iS] = fT [iT ]
where S is executed before T and both values of the induction variable arein the range of the loop:
p ≤ iS ≤ iT ≤ q
• Example: fS and fT are linear functions
fS[i] = aSi + bS, fT [i] = aTi + bT
The requirement for dependence implies that
aSiS + bS = aTiT + bT
⇒ aSiS − aTiT + (bS − bT ) = 0
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (02/1999)
DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (5)
• Example:
do 100 i=2,100T: b(i)=a(i-1)S: a(i)=c(i)
! Here, fS(i) = i, fT (i) = i− 1
! Condition fS(iS) = fT (iT ) is iS = iT − 1 ⇒ iS < iT(hence T depends on S)
! The equation iS = iT − 1 has lots of solutions such that2 ≤ iS < iT ≤ 100
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (02/1999)
DEPENDENCE ANALYSIS FOR PIPELINED LOOPS (6)
• In serial execution, T3 reads from a(2) after S2 writes to a(2):
i = 2: T2: b(2)=a(1)S2: a(2)=c(2)
i = 3: T3: b(3)=a(2)S3: a(3)=c(3)
• In vector execution, T3 reads from a(2) before S2 writes to it:
b(2)=a(1)b(3)=a(2)...a(2)=c(2)a(3)=c(3)
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2012)
CONTROL HAZARDS
• A control hazard occurs because a branch instruction needs to make a deci-sion based on the results of operations or instructions that are still pending
. A beq instruction cannot update the PC before the test for equality hascompleted
�We’ll see later that it’s possible to make the decision at the InstructionDecode/Register Fetch stage instead of the ALU stage
. In the MIPS ISA, only the instruction immediately followingthe branch is executed or not executed, depending on the branchdecision
� This isn’t true in longer pipelines
. Methods for dealing with control hazards
� Stall� Predict the branch� Always execute the instruction following the branch
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
STALLING AS A SOLUTION FOR CONTROL HAZARDS
• After a conditional branch (beq) there is a one-stage pipeline stall (bubble),even if we are able to compare the inputs to beq in the ID/RF stage
Instructionfetch Reg ALU Data
access Reg
Time
beq $1, $2, 40
add $4, $5, $6
lw $3, 300($0)4 ns
Instructionfetch Reg ALU Data
access Reg2ns
Instructionfetch Reg ALU Data
access Reg
2ns
2 4 6 8 10 12 14 16Programexecutionorder(in instructions)
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
PREDICTING BRANCHES NOT TAKEN AS A SOLUTION
Instructionfetch Reg ALU Data
access Reg
Time
beq $1, $2, 40
add $4, $5, $6
lw $3, 300($0)
Instructionfetch Reg ALU Data
access Reg2 ns
Instructionfetch Reg ALU Data
access Reg2 ns
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
beq $1, $2, 40
add $4, $5 ,$6
or $7, $8, $9
Instructionfetch Reg ALU Data
access Reg
2 4 6 8 10 12 14
2 4 6 8 10 12 14
Instructionfetch Reg ALU Data
access Reg
2 ns
4 ns
bubble bubble bubble bubble bubble
Programexecutionorder(in instructions)
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
PIPELINE DELAYED BRANCH AS A SOLUTION
• After the conditional branch (beq), we insert an add instruction (which cando useful work) instead of a stall (bubble), which does nothing
Instructionfetch Reg ALU Data
access Reg
Time
beq $1, $2, 40
add $4, $5, $6
lw $3, 300($0)
Instructionfetch Reg ALU Data
access Reg2 ns
Instructionfetch Reg ALU Data
access Reg
2 ns
2 4 6 8 10 12 14
2 ns
(Delayed branch slot)
Programexecutionorder(in instructions)
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
FORWARDING AS A SOLUTION FOR DATA HAZARDS (1)
add $s0, $t0, $t1
sub $t2, $s0, $t3
Programexecutionorder(in instructions)
IF ID WBEX
IF ID MEMEX
Time2 4 6 8 10
MEM
WBMEM
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
FORWARDING AS A SOLUTION FOR DATA HAZARDS (2)
Time2 4 6 8 10 12 14
lw $s0, 20($t1)
sub $t2, $s0, $t3
Programexecutionorder(in instructions)
IF ID WBMEMEX
IF ID WBMEMEX
bubble bubble bubble bubble bubble
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (02/1999)
MIPS PIPELINES (3)
• MIPS R2000 integer unit pipeline hazards, named for pipeline registers:
IF/ID| {z }register
. ReadRegisterj| {z }name of register field
1. ID/EX.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)
2. EX/MEM.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)
3. MEM/WB.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (11/1999)
MIPS PIPELINES (4)
• MIPS R2000 integer pipeline hazards generated by the instructions
sub $2,$1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2)
! The instruction and $12,$2,$5 results in hazard 1a,ID/EX.WriteRegister = IF/ID.ReadRegister1 = 2
! The instruction or $13,$6,$2 results in hazard 2b,EX/MEM.WriteRegister = IF/ID.ReadRegister2 = 2
! The instruction add $14,$2,$2 results in hazards 3a and 3b,MEM/WB.WriteRegister = IF/ID.ReadRegister1 = 2MEM/WB.WriteRegister = IF/ID.ReadRegister2 = 2
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
PIPELINED DEPENDENCIES IN AN INSTRUCTION SEQUENCE
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/–20 –20 –20 –20 –20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/–20 –20 –20 –20 –20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X –20 X X X X XValue of EX/MEM :X X X X –20 X X X XValue of MEM/WB :
DM
Mux
ALU
ID/EX MEM/WB
Datamemory
EX/MEM
Registers
PIPELINED DATAPATHWITHOUT FORWARDING
Registers
Mux M
ux
ALU
ID/EX MEM/WB
Datamemory
Mux
Forwardingunit
EX/MEM
ForwardB
RdEX/MEM.RegisterRd
MEM/WB.RegisterRd
RtRtRs
ForwardA
Mux
PIPELINED DATAPATHWITH FORWARDING
PC Instructionmemory
Registers
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Forwardingunit
IF/ID
Inst
ruct
ion
Mux
RdEX/MEM.RegisterRd
MEM/WB.RegisterRd
Rt
Rt
Rs
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
DATAPATH MODIFIED TO RESOLVE HAZARDS BY FORWARDING
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (05/1999)
PIPELINED EXECUTION WITH FORWARDING
• We’ll follow what happens in the instruction sequence
[40000028] sub $2, $1, $3[4000002c] and $4, $2, $5[40000030] or $4, $4, $2[40000034] add $9, $4, $2
! Without forwarding, there would be RAW hazards on register $2 in theand instruction and on register $4 in the or and add instructions
PC Instructionmemory
Registers
Mux
Mux
Mux
EX
M
WB
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
and $4, $2, $5 sub $2, $1, $3
ID/EX
before<1>
EX/MEM
before<2>
MEM/WB
or $4, $4, $2
Clock 3
2
5
10 10
$2
$5
5
2
4
$1
$3
3
1
2
Control
ALU
M
WB
ID/EX.WriteRegister = IF/ID.ReadRegister1 = 2(RAW data hazard)
PC Instructionmemory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
or $4, $4, $2 and $4, $2, $5
ID/EX
sub $2, . . .
EX/MEM
before<1>
MEM/WB
add $9, $4, $2
Clock 4
4
2
10 10
$4
$2
2
4
4
$2
$5
5
2
4
Control
ALU
10
2
WB
ID/EX.WriteRegister = IF/ID.ReadRegister1 = 4 andEX/MEM.WriteRegister = IF/ID.ReadRegister2 = 2(RAW data hazards)
EX/MEM.WriteRegister = ID/EX.ReadRegister1 = 2(test by which the need for forwarding to ALUIn1 is actually detected)
2
ALUIn1
ALUIn2
PC Instructionmemory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
add $9, $4, $2 or $4, $4, $2
ID/EX
and $4, . . .
EX/MEM
sub $2, . . .
MEM/WB
after<1>
Clock 5
4
2
10 10
$4
$2
2
4
9
$4
$2
4
2
24
Control
ALU
10
WB
2
1
4
EX/MEM.WriteRegister = IF/ID.ReadRegister1 = 4 andMEM/WB.WriteRegister = IF/ID.ReadRegister2 = 2
EX/MEM.WriteRegister = ID/EX.ReadRegister1 = 4 andMEM/WB.WriteRegister = ID/EX.ReadRegister2 = 2
PC Instructionmemory
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
after<1>after<2> add $9, $4, $2 or $4, . . .
EX/MEM
and $4, . . .
MEM/WB
Clock 6
10
$4
$2
2
4
9
ALU
10
4
WB
4
1
Registers
Inst
ruct
ion
IF/ID
ID/EX
4
Control
EX/MEM.WriteRegister = ID/EX.ReadRegister2 = 4
Who wrote this value?
ALUSrcRegisters
Mux
Mux
Mux
ALU
ID/EX MEM/WB
Datamemory
Mux
Forwardingunit
EX/MEM
Mux
ADDITION OF A MULTIPLEXOR TOCHOOSE THE IMMEDIATE VALUE
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (02/1999)
MIPS PIPELINES (3)
• MIPS R2000 integer unit pipeline hazards, named for pipeline registers:
IF/ID| {z }register
. ReadRegisterj| {z }name of register field
1. ID/EX.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)
2. EX/MEM.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)
3. MEM/WB.WriteRegister = IF/ID.ReadRegisterj (j = 1, 2)
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
A DATA HAZARD THAT CANNOTBE RESOLVED BY FORWARDING
Data dependence goesbackward in time
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
HOW STALLS ARE INSERTEDINTO A PIPELINE
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2012)
HAZARD DETECTION UNIT
• The control logic for the hazard detection unit is:
If (ID/EX.MemRead and((ID/EX.RegisterRt = IF/ID.RegisterRs) or(ID/EX.RegisterRt = IF/ID.RegisterRt)))
Thenstall the pipeline
. The first line tests whether the instruction is a load in the EX stage
� The next two lines check whether the destination register of the load isthe same as either of the source registers of the instruction that is inthe ID stage
• For a stall, all control signals are deasserted in the EX stage
PC Instructionmemory
Registers
Mux
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
0
Mux
IF/ID
Inst
ruct
ion
ID/EX.MemRead
IF/I
DW
rite
PCW
rite
ID/EX.RegisterRt
IF/ID.RegisterRd
IF/ID.RegisterRtIF/ID.RegisterRt
IF/ID.RegisterRs
RtRs
Rd
Rt EX/MEM.RegisterRd
MEM/WB.RegisterRd
OVERVIEW OF PIPELINED CONTROL
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2011)
PIPELINED EXECUTION WITH A STALL
• We’ll follow what happens in the instruction sequence
[40000028] lw $2, 20($1)[4000002c] and $4, $2, $5[40000030] or $4, $4, $2[40000034] add $9, $4, $2
. The hardware inserts a stall after the lw instruction
� The stall creates the same e↵ect as a nop
� For a stall, all control signals are deasserted in the EX stage� Deasserted control signals are forwarded to the MEM and WB stages� Nothing is written to memory or the register file
. After the stall, forwarding resolves the RAW hazards on register $2 inthe and instruction and on register $4 in the or and add instructions
Hazarddetection
unit
0
MuxIF
/ID
Writ
e
PCW
rite
ID/EX.RegisterRt
ID/EX.MemRead
M
WB
$1
$X
X
1
2
before<3>
PC Instructionmemory
Registers
Mux
Mux
Mux
EX WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
ID/EX
EX/MEM
MEM/WB
and $4, $2, $5 lw $2, 20($1) before<1> before<2>
Clock 2
1
1
X
X11
Control
ALU
M
WB
Hazarddetection
unit
0
MuxIF
/ID
Writ
e
PCW
rite
ID/EX.RegisterRt
lw $2, 20($1)
PC Instructionmemory
Registers
Mux
Mux
Mux
EX
M
WB
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
and $4, $2, $5
ID/EX
before<1>
EX/MEM
before<2>
MEM/WB
or $4, $4, $2
Clock 3
2
5
2
500 11
$2
$5
5
2
4
$1
$X
X
1
2
Control
ALU
M
WB
ID/EX.MemRead
$2
$5
5
2
24
WB
Hazarddetection
unit
0
MuxIF
/ID
Writ
e
PCW
rite
ID/EX.RegisterRt
PC Instructionmemory
Registers
Mux
Mux
Mux
EX
M
WB
Datamemory
Mux
Inst
ruct
ion
IF/ID
and $4, $2, $5 bubble
ID/EX
lw $2, . . .
EX/MEM
before<1>
MEM/WB
Clock 4
2
2
5
510
11
00
$2
$5
5
2
4
Control
ALU
M
WB
Forwardingunit
ID/EX.MemRead
or $4, $4, $2
000
Hazarddetection
unit
0
MuxIF
/ID
Writ
e
PCW
rite
ID/EX.RegisterRt
2
bubble lw $2, . . .
PC Instructionmemory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
and $4, $2, $5
ID/EX
EX/MEM
MEM/WB
add $9, $4, $2
Clock 5
2
210 10
11
$4
$2
2
4
4
4
2
4
$2
$5
5
2
4
Control
ALU
00
WB
ID/EX.MemRead
or $4, $4, $2
PC Instructionmemory
Hazarddetection
unit
0
MuxIF
/ID
Writ
e
PCW
rite
ID/EX.RegisterRt
bubble
Registers
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Inst
ruct
ion
IF/ID
add $9, $4, $2
ID/EX
and $4, . . .
EX/MEM
MEM/WB
Clock 6
4
4
2
210 10
$4
$2
2
4
49
$2
2
Control
ALU
10
WB00
after<1>
Forwardingunit
$4
4
4
or $4, $4, $2
ID/EX.MemRead
Mux
Registers
Inst
ruct
ion
ID/EX
4
Control
PC Instructionmemory
IF/I
DW
rite
PCW
rite
add $9, $4, $2 or $4, . . . and $4, . . .after<2> after<1>
Clock 7
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
EX/MEM
MEM/WB
10 10
$4
$2
2
4
9
ALU
10
WB
44
10
Hazarddetection
unit
0
Mux
ID/EX.RegisterRt
ID/EX.MemRead
IF/ID
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
EFFECT OF A PIPELINE ON ABRANCH INSTRUCTION
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (10/2010)
MAKE THE BRANCH DECISION EARLY
• In the unoptimized example, taking the branch costs 3 clock periods
! The cost is higher in modern pipelines, which are much deeper
• The branch target calculation PC = (PC + 4) + offset*4 for the instruc-tion beq Rs, Rt, offset can be done in the ID/RF stage
• There is a faster way than a sub to compare the contents of Rs and Rt
! With a small combinational logic block, take the bitwise XOR of theregister contents
◦ This produces a word with 1 bits wherever the operands differ
! Then OR all of the bits of the resulting word
◦ The result is 1 if, and only if, the operands differ ⇒ branch not taken
! This can also be done in the ID/RF stage
• Result: In the MIPS ISA, there is only a one-cycle delay after a branch
PC Instructionmemory
4
Registers
Mux
Mux
Mux
ALU
EX
M
WB
M
WB
WB
ID/EX
0
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
Signextend
Control
Mux
=
Shiftleft 2
Mux
PIPELINED DATAPATH INCLUDINGSUPPORT FOR BRANCHES
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c! C. D. Cantrell (10/2010)
A PIPELINED BRANCH
• We’ll follow what happens in the instruction sequence
[40000024] sub $10, $4, $8[40000028] beq $1, $3, 7 # PC-relative branch to offset[4000002c] and $12, $2, $5 # (40 + 4) +7*4 = 72 = 0x48[40000030] or $14, $2, $6[40000034] add $14, $4, $2[4000003c] slt $15, $6, $7. . .[40000048] lw $14, 50($7)
PC Instructionmemory
4
Registers
Signextend
Mux
Mux
Control
EX
M
WB
M
WB
WB
Mux
Hazarddetection
unit
Forwardingunit
Mux
IF.Flush
IF/ID
and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8
MEM/WB
EX/MEM
ID/EX
Clock 3
72 44
48 44
28
7
$1
$3
10
48
72
72
0
$4
$8
ALU Datamemory
Mux
Shiftleft 2
before<1> before<2>
=
Mux
0
bubble (nop)lw $4, 50($7)
Clock 4
beq $1, $3, 7 sub $10, . . . before<1>
PC Instructionmemory
4
Registers
Signextend
Mux
Mux
Control
EX
M
WB
M
WB
WB
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
MEM/WB
EX/MEM
ID/EX
76 72
76 72
$1
$3
10
76
ALU Datamemory
Mux
Shiftleft 2
=
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder(in instructions)
IM Reg
IM DM
DM
IM DM
DM Reg
Reg Reg
Reg
Reg
IM72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
EFFECT OF AN OPTIMIZED PIPELINEON A BRANCH INSTRUCTION
44 and $12, $2, $5
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2011)
STATIC BRANCH PREDICTION
• Predicted behavior is based only on the branch instruction itself
. The early SPARC and MIPS architectures predicted that a branch wouldnot be taken
. A more sophisticated static prediction scheme would base the predictionon a comparison of the target address with the current value of the PC
� If the branch goes to a later instruction (i.e., to a higher address) thenit is never taken� If the branch goes to an earlier instruction, then it is always taken
• Problems with static prediction
. Predict correctly only for certain types of branches
. Example: If beq is the only available branch instruction, then it must betaken to exit from a loop
. If the beq target is a later instruction, then the branch is almost alwaysmispredicted in the “not taken to a later instruction” approach
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2011)
DYNAMIC BRANCH PREDICTION (1)
• Base the predicted branch behavior on the history of the branch
• A common branch prediction scheme uses a branch history table
. Each entry in the memory is indexed by the lower 16 bits of the addressof the branch instruction
. Each entry consists of a bit that is set if the branch was recently taken
. If the branch is not taken, the bit is toggled
. Performance shortcoming: If a branch is almost always taken (or nottaken), then the bit gets toggled on a wrong prediction, and the nextbranch is likely to be mispredicted
� Example: A loop that is executed 10 times, using branch to the head� The branch is mispredicted at the beginning and end (80% accuracy)� Here, branch frequency (90% taken) 6= predicted frequency (80%)
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2011)
DYNAMIC BRANCH PREDICTION (2)
• A 2-bit branch prediction scheme uses a branch history table in which eachentry contains 2 bits to indicate the state of a branch prediction FSM (nextslide)
. This scheme mispredicts only once if a branch almost always goes oneway
Look up Predicted PC
Number ofentriesin branch-targetbuffer
No: instruction isnot predicted to bebranch. Proceed normally
=
Yes: then instruction is branch and predictedPC should be used as the next PC
Branchpredictedtaken oruntaken
PC of instruction to fetch
A branch-target buffer
Taken
Taken
Taken
Taken
Not taken
Not taken
Not taken
Not taken
Predict taken Predict taken
Predict not taken Predict not taken
STATES IN A 2-BIT BRANCHPREDICTION SCHEME
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, 4th Edition
DYNAMIC BRANCH PREDICTION (3)
• Branch prediction accuracy for a 4096-entry, 2-bit prediction bu↵er
a. From before b. From target c. From fall through
sub $t4, $t5, $t6
…
add $s1, $s2, $s3
if $s1 = 0 then
add $s1, $s2, $s3
if $s1 = 0 then
add $s1, $s2, $s3
if $s1 = 0 then
sub $t4, $t5, $t6add $s1, $s2, $s3
if $s1 = 0 then
sub $t4, $t5, $t6
add $s1, $s2, $s3
if $s2 = 0 then
BecomesBecomesBecomes
Delay slot
Delay slot
Delay slot
sub $t4, $t5, $t6
if $s2 = 0 then
add $s1, $s2, $s3
SCHEDULING THE BRANCHDELAY SLOT
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (10/2010)
EXCEPTIONS (1)
• In the MIPS ISA, an exception is a synchronous (clocked) event thatcauses a process to stop executing
! System call (explicit instruction, e.g. for I/O)
◦ Stopping execution permits another process to execute while the processthat made the syscall waits for I/O
! Exception associated with execution of the current instruction
◦ Bus error (I/O timeout, load/store kernel physical address)◦ Protection exception◦ Attempt to execute a reserved instruction◦ Cache/TLB miss◦ Floating-point arithmetic exception
• An interrupt is an asynchronous event, external to the current instruction,that stops the execution of the current process
! Example: Hardware controller signals end of I/O
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (04/1999)
EXCEPTIONS (2)
• In the R2000 ISA, exceptions are handled by coprocessor 0
• How the R2000 processor and the UNIX kernel performexception handling:
1. Processor exits user mode & is forced into kernel mode.
2. The address of an exception vector (exception handling program) isloaded into the program counter (PC).
! Reset exception (reboot): the processor transfers control to the Resetexception vector at address 0xbfc00000
! UTLB Miss: Control is transferred to the exception vector pointed toby the contents of address 0x80000000
! All other exceptions are handled by the kernel◦ The general exception handler pointed to by the contents of
address 0x80000080 takes control, gets the cause from the Causeregister and transfers the correct exception handler
MIPS R2000 CPU AND COPROCESSORS
CPU
Registers$0
$31
Arithmeticunit
Multiplydivide
Lo Hi
Coprocessor 1 (FPU)
Registers$0
$31
Arithmeticunit
Registers
BadVAddr
Coprocessor 0 (traps and memory)
StatusCauseEPC
Memory
PC
MIPS CP0 and Exception Handling Registers
TLBEntryHi
TLBEntryLo
TLB(TranslationLookaside
Buffer)
“Safe”Entries
IndexRegister
RandomRegister
ContextRegister
BadVAddrRegister
EPCRegister
PRIdRegister
StatusRegister
CauseRegister
Used with virtual memory
Used for exception processing
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (10/2010)
MIPS R2000 COPROCESSOR 0
• BadVaddr register (coprocessor 0, register 8)
! Memory address at which an addressing exception occurred
• Status register (coprocessor 0, register 12)
! Interrupt mask and interrupt enable bits
! Kernel/user bits for old, previous and current processes
• Cause register (coprocessor 0, register 13)
! Holds a code for the cause of an exception
• Exception program counter (EPC) (coprocessor 0, register 14)
! Holds address of instruction that caused an exception
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (09/2012)
MIPS KERNEL CONVENTIONS (1)
• The MIPS kernel recognizes 4 memory segments: kuseg, kseg0, kseg1 andkseg2
. Addresses between 0x00400000 and 0x7fffffff belong to kuseg
� User address space
. Virtual addresses between 0x80000000 and 0x9fffffff belong to kseg0
� Addresses between 0x80000000 and 0x8fffffff are used for kerneltext (.ktext; executable instructions)� Addresses between 0x90000000 and 0x9fffffff are used for kernel
data (.kdata)� Addresses in this range are translated to physical memory by clearing
the high bit and mapping contiguously into the low 512 MB of memory
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (09/2012)
MIPS KERNEL CONVENTIONS (2)
• The MIPS kernel recognizes 4 memory segments: kuseg, kseg0, kseg1 andkseg2
. Addresses between 0xa0000000 and 0xbfffffff belong to kseg1
� Typically used for I/O registers, memory-resident ROM code and diskbu↵ers� Direct-mapped, uncached
. Addresses above 0xbfffffff belong to kseg2
� Process structures (remapped on context switches)� User page table entries� Caching and remapping via paging, not via swapping entire processes
kuseg
Virtual Physical
0x1fffffff
0x20000000
0x80000000
MIPS R2000 Memory Map
0xffffffff
0x7fffffff
UserMapped
Cacheable
0x00000000
kseg0Kernel
UnmappedCached
kseg1Kernel
UnmappedUncached
kseg2Kernel
MappedCacheable
0x9fffffff0xa0000000
0xbfffffff0xc0000000
# SPIM TRAP HANDLER DATA .kdata__m1_: .asciiz " Exception "__m2_: .asciiz " caught by trap handler.\n"__m3_: .asciiz "Continuing. . .\n"__m4_: .asciiz "Halting.\n"__e0_: .asciiz " [Interrupt]"__e1_: .asciiz " [TLB modification !BUG!]"__e2_: .asciiz " [TLB miss !BUG!]"__e3_: .asciiz " [TLB miss !BUG!]"__e4_: .asciiz " [Unaligned address in inst/data fetch]"__e5_: .asciiz " [Unaligned address in store]"__e6_: .asciiz " [Bad address in text read]"__e7_: .asciiz " [Bad address in data/stack read]"__e8_: .asciiz " [Error in syscall]"__e9_: .asciiz " [Breakpoint]"__e10_: .asciiz " [Reserved instruction]"__e11_: .asciiz " [Syscall exception !BUG!]"__e12_: .asciiz " [Arithmetic overflow]"__e13_: .asciiz " [Inexact floating point result]"__e14_: .asciiz " [Invalid floating point result]"__e15_: .asciiz " [Divide by 0]"__e16_: .asciiz " [Floating point overflow]"__e17_: .asciiz " [Floating point underflow]"__excp: .word __e0_,__e1_,__e2_,__e3_,__e4_,__e5_,__e6_,__e7_,__e8_,__e9_ .word __e10_,__e11_,__e12_,__e13_,__e14_,__e15_,__e16_,__e17_s1: .word 0s2: .word 0
# SPIM TRAP HANDLER CODE .ktext .space 0x80 # Put trap handler at 0x8000080 sw $v0 s1 # Not re-entrant sw $a0 s2 # Don't need to save k0/k1 mfc0 $k0 $13 # Cause and $k0 $k0 0xff# Use just ExcCode field mfc0 $k1 $14 # EPC li $v0 4 # Print " Exception " la $a0 __m1_ syscall li $v0 1 # Print exception number srl $a0 $k0 2 syscall li $v0 4 # Print type of exception lw $a0 __excp($k0) syscall li $v0 4 # Print " occurred.\n" la $a0 __m2_ syscall srl $a0 $k0 2 beq $a0 12 ret # continue on overflow beq $a0 13 ret # continue on inexact fp result beq $a0 14 ret # continue on invalid fp result beq $a0 16 ret # continue on fp overflow beq $a0 17 ret # continue on fp underflow li $v0 4 # Print "Halting.\n" la $a0 __m4_ syscall li $v0 10 # Exit on all bug overflow exceptions syscall # syscall 10 (exit)
ret: li $v0 4 # Print "Continuing. . .\n" la $a0 __m3_ syscall
lw $v0 s1 lw $a0 s2 addiu $k1 $k1 4 # Return to next instruction rfe # Return from exception handler jr $k1
.text .globl __start
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2011)
EXCEPTIONS IN A PIPELINED PROCESSOR
• Five instructions are active in any given clock period
. Multiple exceptions can occur simultaneously
. If execution is not stopped soon enough, the value in the register thathelped cause the exception may be overwritten in the WB stage
. To flush the instructions that follow the instruction that caused the ex-ception, we add two new signals, ID.Flush and EX.Flush
. ID.Flush is ORed with the stall signal from the hazard detection unit toflush an instruction during its ID stage
. To flush an instruction in its EX stage, we add an input to the PC mul-tiplexor that sends 0x80000080 to the PC
PC Instructionmemory
4
Registers
Signextend
Mux
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Mux
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
=
ExceptPC
80000080
0
Mux
0
Mux
0
Mux
ID.Flush EX.Flush
Cause
Shiftleft 2
DATAPATH WITH CONTROLSTO HANDLE EXCEPTIONS
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c! C. D. Cantrell (05/1999)
A PIPELINED EXCEPTION
• We’ll follow what happens in the instruction sequence
[40000040] sub $11, $2, $4[40000044] and $12, $2, $5[40000048] or $13, $2, $6[4000004c] add $1, $2, $1 # overflow exception occurs here[40000050] slt $15, $6, $7[40000054] lw $16, 50($7). . .
given that the instructions to execute when an exception occurs are
[80000080] lui $1, -28672 # -28672 (base 10) = 0x9000[80000084] sw $2, 592($1) # sw $v0 s1. . .
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (10/2010)
A PIPELINED EXCEPTION
• The value of the PC when an instruction is issued is part of the instruction’scontrol state, and must be passed along in pipeline registers
! Otherwise, you’d never know exactly where you were clobbered
! Some architectures have imprecise exceptions (e.g., the IBM 360/91)
◦ In these cases, it’s usually the address that is in the PC when theexception occurs that is reported, not the address of the instructionthat actually caused the exception◦ This is especially annoying when a branch is taken immediately after
the exception-causing instruction!
• In the example, an integer overflow occurs, asserting the Overflow signal
! The Overflow signal must be routed to the Control block, which thenasserts the EX.Flush, ID.Flush and IF.Flush signals, and asserts acontrol signal that causes the PC Source multiplexor to load 0x80000080into the PC
slt $15, $6, $7lw $16, 50($7) add $1, $2, $1 or $13, . . . and $12, . . .
Clock 5
0x80000080
0
0
0
010
10
0
0
10
58 54
54
12
($6)
($7)
Write register 15
50
($2)
($1)
1
13 12
DatamemoryPC
4
Registers
Signextend
Mux
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Mux
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
=
ExceptPC
0x80000080
0
Mux
0
Mux
0
Mux
ID.Flush EX.Flush
Cause
Shiftleft 2
Instructionmemory
($2)
($1)
Write register 12
To Causeand Control
Overflow
bubble (nop)lui $1, -28672 bubble bubble or $13, . . .
Clock 6
80000084
80000084
13
0
0
0
000
0000
00
10
13
Datamemory
80000080
PC
4
Registers
Signextend
Mux
Mux
Mux
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Mux
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
=
ExceptPC
80000080
0
Mux
0
Mux
0
Mux
ID.Flush EX.Flush
Cause
Shiftleft 2
Instructionmemory
Control
PC
Instructionmemory
4
Registers
Signextend
Mux
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Mux
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
Mux
ExceptPC
80000080
0
Mux
0
Mux
0
Mux
ID.Flush EX.Flush
Cause
Shiftleft 2
Writedata
Readdata
Address
Readdata
Address Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1Readregister 2
ALUcontrol
3216
Inst
ruct
ion
Instruction [15–11]
Instruction [20–16]Instruction [20–16]
Instruction [25–21]
Reg
Writ
e
ALUOp
ALUSrc
RegDst
Mem
Writ
e
MemRead
Mem
toR
eg
Branch
=
PIPELINED DATAPATHWITH CONTROL
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (10/2010)
PIPELINE ENHANCEMENTS
• Design a deeper pipeline (many stages)
! The Pentium 4 “Willamette” pipeline had 20 stages; the “Prescott”, 31
◦ This permitted high clock frequencies ⇒ high power consumption
! The Core 2 Duo pipeline has 14 stages
• Issue more than one instruction per clock period (“superscalar” architecture)
! Theoretically, this divides the CPI by the number issued per clock
! We will study the modifications needed for issuing 2 instructions/clock
◦ Double the number of ALUs and read/write ports on the register file
• Schedule the pipeline dynamically
! Find useful instructions to schedule during a stall
! Major pipeline units:
◦ Instruction fetch/issue unit◦ Execution unit◦ Commit unit
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (10/2010)
DYNAMIC PIPELINE SCHEDULING (1)
• Major limitation of the statically scheduled pipeline that we have studied:In-order instruction issue and execution
! This permits head-of-line blocking of instructions that could execute
• One approach is to allow in-order issue and out-of-order execution
! For in-order issue, the IF/ID unit must check for structural hazards
! For out-of-order execution:
◦ Need multiple functional units¶ Execution occurs whenever there are no data dependences or hazards
◦ A new kind of unit, a scoreboard, must check for data hazards
! Out-of-order completion means that there are imprecise exceptions
! This is the design approach used in the CDC 6600
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (10/2010)
BOOKKEEPING FOR DYNAMIC SCHEDULING
• The scoreboarding technique was introduced in the CDC 6600 (1963)
! Goal: Maintain a low CPI by executing as early as possible
! The scoreboard maintains several status tables
◦ Status of each instruction: Issued, Operands Read, Execution Com-pleted, Results Written
◦ Once an instruction has issued, the functional unit table keeps a recordof the operands¶ Functional unit status: Busy, Operation Underway, Destination Reg-
ister Name, Source Register Names, Units Producing Source RegisterOperands, Flags (indicating when the source register operands areready)
◦ Register result status¶ Indicates which functional unit will write to the register
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (10/2010)
PIPELINE DEPTH vs. SPEEDUP
• The graph on the following page shows the speedup achieved by increasingthe number of stages, assuming:
! Constant clock frequency
! A single instruction queue with in-order issue and completion
• The speedup achieved under these circumstances is much less than the num-ber of stages
! This is a result of data and control hazards ⇒ pipeline stalls
• In reality, increasing the number of stages so that there are fewer levels oflogic per stage makes it possible to increase the clock frequency
! This shifts the curve toward the right (higher number of stages)
• To achieve a much higher speedup than is shown in the graph, designersresort to multiple issue and dynamic scheduling
1 2 4 8 16
Pipeline depth
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Rel
ativ
e pe
rfor
man
ce
PIPELINE DEPTH vs.SPEEDUP
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c� C. D. Cantrell (10/2011)
SUPERSCALAR EXAMPLE
• We’ll follow what happens in the instruction sequence
[40000040] lw $8, 0($17) # $17=pointer[40000044] addu $8, $8, $18 # $8=array element[40000048] sw $8, 0($17)[4000004c] addi $17, $17, -4 # decrement pointer[40000050] bne $17, $19, -5
assuming a static 2-issue MIPS pipeline
. Note that a comparison with $0 would be invalid, since a null pointer isnot a valid address (this corrects the code in the textbook)
• The new hardware for the example is a second pipeline for data transfer(load and store) instructions
. The original pipeline is used for ALU and branch instructions
PC Instructionmemory
4
RegistersMux
Mux
ALU
Mux
Datamemory
Mux
80000080
Signextend Sign
extend
ALU Address
Writedata
SUPERSCALAR DATAPATH
ADDRESS CALC.
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 4th Edition
SUPERSCALAR EXAMPLE: SCHEDULING
ALU or branch instruction Data transfer instruction Clock cycle
Loop: lw $t0, 0($s1) 1
addi $s1,$s1,–4 2
addu $t0,$t0,$s2 3
bne $s1,$s3,Loop sw $t0, 4($s1) 4
• The resulting CPI is 0.8 instead of the theoretical value, 0.5
! Speedup = 1.25 instead of 2.0
• The problem is that we are taking 4 clocks to execute 5 instructions
! Two of these instructions (addi and bne) are loop overhead
• In some memory architectures there may be a hardware conflict between thestore in one loop instance and the load in the next instance
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (10/2010)
LOOP UNROLLING
• The goal of loop unrolling is to minimize the performance impact of loopoverhead by executing several instances of the loop for one set of overheadinstructions
! If a loop is unrolled in hardware, then the original register targets ofdata-transfer and computational instructions must be renamed
! Register renaming is especially important in executing x86 instruc-tions, because there are very few general-purpose x86 architecturalregisters
◦ The architectural registers can be renamed into a larger set of physicalregisters
! Loops can also be unrolled in software
◦ Compiler unrolling (often performed by optimizing compilers)◦ Unrolling in a higher-level language
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 4th Edition
SUPERSCALAR EXAMPLE: LOOP UNROLLING
ALU or branch instruction Data transfer instruction Clock cycle
Loop: addi $s1,$s1,–16 lw $t0, 0($s1) 1lw $t1,12($s1) 2
addu $t0,$t0,$s2 lw $t2, 8($s1) 3addu $t1,$t1,$s2 lw $t3, 4($s1) 4addu $t2,$t2,$s2 sw $t0, 16($s1) 5addu $t3,$t3,$s2 sw $t1,12($s1) 6
sw $t2, 8($s1) 7bne $s1,$s3,Loop sw $t3, 4($s1) 8
• The loop in this example is unrolled to a depth of 4
• The resulting CPI is 0.57, much closer to the theoretical value of 0.5
! Speedup = 1.75, much closer to 2.0
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (10/2010)
LOOP UNROLLING IN SOFTWARE
• Computation of one component of a matrix-vector product y = Ax in C:
for (j=0; j<n; j++){y[i] = y[i] + a[i][j] * x[j];}
! The vector y must be initialized to 0 in a previous loop
• The same loop, unrolled to a depth of 4:
for (j=0; j<n; j+=4){y[i] = (((y[i] + a[i][j-3]*x[j-3]) + a[i][j-2]*x[j-2]) \
+ a[i][j-1]*x[j-1]) + a[i][j]*x[j] ;}
! The programmer has to ensure that n is a multiple of 4
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (10/2010)
DYNAMIC PIPELINE SCHEDULING (2)
• When several instructions are issued in a clock period, it is possible to re-order the executions to minimize pipeline stalls
• There are multiple pipelines, divided into three major types of unit:
! Instruction fetch/instruction decode unit
! Reservation stations
◦ These are buffers that hold the instructions’ operands and control state
! Integer and floating-point out-of-order execution units
◦ Execution occurs whenever there are no data dependences or hazards
! Commit unit
◦Maintains a reorder buffer◦ Buffers the results of execution until it is safe to write the results◦ The reorder buffer can also provide operands, like the forwarding units
in a statically scheduled pipeline
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
A DYNAMICALLY SCHEDULED PIPELINE
Commitunit
Instruction fetchand decode unit
…
In-order issue
In-order commit
Load/Store
Floatingpoint
IntegerInteger …Functionalunits
Out-of-order execute
Reservationstation
Reservationstation
Reservationstation
Reservationstation
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (10/2010)
TOMASULO’S ALGORITHM (1)
• Focuses on floating-point execution
! Originally designed for the IBM 360/91, with long memory-access andfloating-point execution times
! Can support overlapping execution of multiple loop instances
• Tomasulo addressed limitations of the scoreboard approach
! Hazard detection and control of execution are distributed to the reserva-tion stations
! Results are forwarded directly to the functional units instead of goingthrough registers
◦ Results are broadcast on a common data bus
IMPLEMENTATION OF TOMASULO’S ALGORITHM
From instruction unitFloating-pointoperationqueue
Frommemory
Load buffersFP registers
Store buffers
Tomemory
654321 3
21
Reservationstations
FP adders FP multipliers
321
21
Common data bus (CDB)
Operation bus
Operandbuses
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (10/2010)
TOMASULO’S ALGORITHM (2)
• Control fields in each reservation station:
! Operation
! The IDs of the reservation stations that will produce the operands
◦ The reservation stations can rename registers◦ This enables overlapping different loop iterations
! The values of the operands
◦ Note that values are available sooner than if the functional units hadto contend for access to write to a register
! A “busy” flag
• Control fields for each register and store buffer:
! The number of the functional unit that will produce the value to be written
! A “busy” flag
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 3rd Edition
PIPELINING IMPROVES THROUGHPUT
Slo
wer
Clo
ck r
ate
FasterSlower
Instruction throughput(instructions per clock cycle or 1/CPI)
Multicycledatapath
Pipelineddatapath
Single-cycledatapath
Fast
er
Multiple-issuepipelined
Deeplypipelined
Multiple issuewith deep pipeline
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 3rd Edition
PIPELINING DOES NOT IMPROVE LATENCYS
hare
d
Har
dwar
e
Several1
Clock cycles of latency for an instruction
Single-cycledatapath
Pipelineddatapath
Multicycledatapath
Spe
cial
ized
Deeplypipelined
Multiple issuewith deep pipeline
Multiple-issuepipelined
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 4th Edition
MICROPROCESSOR PIPELINES
Microprocessor Year Clock RatePipeline Stages
Issue Width
Out-of-Order/ Speculation
Cores/ Chip Power
Intel 486 1989 25 MHz 5 1 No 1 5 W
Intel Pentium 1993 66 MHz 5 2 No 1 10 W
Intel Pentium Pro 1997 200 MHz 10 3 Yes 1 29 W
Intel Pentium 4 Willamette 2001 2000 MHz 20 3 Yes 1 75 W
Intel Pentium 4 Prescott 2004 3600 MHz 31 3 Yes 1 103 W
Intel Core 2006 2930 MHz 14 4 Yes 2 75 W
Sun UltraSPARC III 2003 1950 MHz 14 4 No 1 90 W
Sun UltraSPARC T1 (Niagara) 2005 1200 MHz 6 1 No 8 70 W
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c©Intel
PENTIUM 4 “WILLAMETTE” CHIP LAYOUT
400MHz
SystemBus
AdvancedTransferCache
Hyperpipeline(20 stages)
EnhancedFloatingPoint &
Multimedia
ExecutionTrace Cache
RapidExecution
Advanced DynamicExecution (A.D.E.)
A.D.E.
A.D.E.
DataCache
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
c© C. D. Cantrell (09/2010)
PENTIUM 4 FUNCTIONAL UNITS
• “400 MHz” system bus
! 100 MHz, 4 instructions wide
• Advanced transfer cache (L2 cache, 256 kB, instructions + data)
• Execution trace cache
! L1 instruction cache; stores decoded CISC instructions
• Hyperpipelined unit
! Used for uniform-length microinstructions (“micro-ops,” in Intel-speak)
• Enhanced floating-point & multimedia unit
• Rapid execution engine
! Parallel, partly double-clocked execution of microinstructions
• Advanced dynamic execution
! Deep, out-of-order speculative execution & branch prediction
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 4th Edition
AMD OPTERON X4 MICROARCHITECTURE
Instruction prefetchand decodeBranch
prediction
Register file
IntegerALU
IntegerALU.
Multiplier Integer
ALU
Floatingpoint
Adder/SSE
Floatingpoint
Multiplier/SSE
FloatingpointMisc
Datacache
Instruction cache
RISC-operation queue
Dispatch and register renaming
Integer and floating-point operation queue
Load/Store queue
Commitunit
The University of Texas at Dallas Erik Jonsson School ofEngineering & Computer Science
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 4th Edition
AMD OPTERON X4 PIPELINE
Number ofclock cycles
Reorderbuffer
allocation +register
renaming
InstructionFetch
Scheduling+ dispatch
unit
Decodeand
translateExecution Data Cache/
Commit
RISC-operationqueue
Reorderbuffer
3 22 22 1
Instructionmemory
Address
Instruction[20–16]
Branch
ALUSrc
4
Instruction[15–0]
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
ALUresult
Zero
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15–11]
EX
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: EX: MEM: WB:
MEM/WB
IF:
000
10
1100
001
00
000
1
00
01
0
Mux
0
1
Add
PC
0
Datamemory
Address
Writedata
Readdata
Mux
1
Mem
Writ
e
1
11
11
10
$11
$10
11
1
$5
Mux
0
Mux1
ALUOp
RegDst
ALUcontrol
M
WB
31 15
15$6
0
WB
Mem
toR
eg
11
2090 16
6