CPSC614Lec 2.1
Prof. E. J. Kim
CPSC 614:Graduate Computer Architecture
Hazards (Chap. 3)
CPSC614Lec 2.2
Chapter 3
Instruction-Level Parallelism and
Its Dynamic Exploitation
CPSC614Lec 2.3
Instruction-Level Parallelism (ILP)
• Instructions are evaluated in parallel.• Pipelining• Two approaches to exploiting ILP
– Hardware-dependent (chapter 3)» Intel Pentium 3 & 4, Athlon, MIPS
R10000/12000, Sun UltraSPARC III, PowerPC, …
– Software-dependent (chapter 4)» IA-64, Intel Itanium, embedded processors
CPSC614Lec 2.4
Pipeline CPI =
Ideal CPI + Structural stalls
+ Data hazard stalls + Control stalls
CPSC614Lec 2.5
Techniques to Decrease Pipeline CPI
• Forwarding and Bypassing• Delayed Branches and Simple Branch Scheduling• Basic Dynamic Scheduling (Scoreboarding)• Dynamic Scheduling with Renaming• Dynamic Branch Prediction• Issuing Multiple Instructions per Cycle• Speculation• Dynamic Memory Disambiguation
CPSC614Lec 2.6
Techniques to Decrease Pipeline CPI
• Loop Unrolling• Basic Compiler Pipeline Scheduling• Compiler Dependence Analysis• Software Pipelining, Trace Scheduling• Compiler Speculation
CPSC614Lec 2.7
Data Dependences
• If two instructions are parallel, they can execute simultaneously.
• If two instructions are dependent, they must be executed in order.
• How to determine an instruction is dependent on anther instruction?
CPSC614Lec 2.8
Data Dependences• Data dependences (True data dependences)• Name dependences• Control dependences• An instruction j is dependent on instruction
i if either – i produces a result that may be used by j, or– j is data dependent on instruction k, and k is data
dependent on i.
CPSC614Lec 2.9
Loop: L.D F0, 0(R1) ;F0=array element
ADD.D F4, F0, F2 ;add scalar in F2
S.D F4, 0(R1) ;store the result
DADDUI R1, R1, #-8 ;decrement pointer 8 bytes
BNE R1, R2, LOOP ;branch R1 != R2
Floating-point data dependences
Integer data dependence
CPSC614Lec 2.10
• A dependence– indicates the possibility of a hazard,– determines the order in which results must be
calculated, and– sets an upper bound on how much parallelism
can be possibly be exploited.
CPSC614Lec 2.11
How to Overcome a Dependence
• Maintaining the dependence but avoiding a hazard
• Eliminating a dependence by transforming the code
CPSC614Lec 2.12
Name Dependence
• Occurs then two instructions use the same register or memory location (name), but there is no flow of data between the instructions associated with that name.
• When i precedes j in program order:– Antidependence: Instruction j writes a register
or memory location that instruction i reads.– Output dependence: Instructions i and j write
the same register or memory location.
CPSC614Lec 2.13
Register Renaming
• Instructions involved in a name dependence can execute simultaneously or be reordered, if the name (register number or memory location) used in the instructions is changed so the instructions do not conflict. (Especially for register operands)
CPSC614Lec 2.14
Data Hazards• Goal: Preserve the program order only
where it affects the outcome of the program to maximize ILP.
• When instruction i occurs before instruction j in program order,– RAW (Read after Write): j tries to read a source before i
writes it.– WAW (Write after Write): j tries to write an operand
before it is written by i.– WAR (Write after Read): j tries to write a destination
before it is read by i.
CPSC614Lec 2.15
Control Dependence
• Caused by branch instructions• An instruction that is control
dependent on a branch cannot be moved before the branch.
• An instruction that is not control dependent on a branch cannot be moved after the branch.
CPSC614Lec 2.16
• Control dependence is not the critical property that must be preserved.
• We can violate the control dependences, if we can do so without affecting the correctness of the program. (e.g. branch prediction)
CPSC614Lec 2.17
5 Steps of MIPS Datapath
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MUX
MUX
Mem
ory
MUX
SignExtend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/MEM
4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Data
Next PC
Address
RS1
RS2
Imm
MUX
Datapath
Control Path
CPSC614Lec 2.18
5 Steps of MIPS Datapath
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MUX
MUX
Mem
ory
MUX
SignExtend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/MEM
4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Data
Next PC
Address
RS1
RS2
Imm
MUX
Datapath
Control Path
Inst
1
Inst
1
Inst
2
Inst
1
Inst
2
Inst
3
CPSC614Lec 2.19
Review: Visualizing PipeliningFigure 3.3, Page 133 , CA:AQA 2e
Instr.
Order
Time (clock cycles)
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
CPSC614Lec 2.20
Limits to pipelining
• Hazards: circumstances that would cause incorrect execution if next instruction were launched
– Structural hazards: Attempting to use the same hardware to do two different things at the same time
– Data hazards: Instruction depends on result of prior instruction still in the pipeline
– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
CPSC614Lec 2.21
Example: One Memory Port/Structural Hazard
Figure 3.6, Page 142 , CA:AQA 2e
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
DMem
Structural Hazard
CPSC614Lec 2.22
Resolving structural hazards
• Defn: attempt to use same hardware for two different things at the same time
• Solution 1: Wait must detect the hazard must have mechanism to stall
• Solution 2: Throw more hardware at the problem
CPSC614Lec 2.23
Detecting and Resolving Structural Hazard
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Stall
Instr 3
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Reg ALU
DMemIfetch Reg
Bubble Bubble Bubble BubbleBubble
CPSC614Lec 2.24
Eliminating Structural Hazards at Design Time
ALU
InstrCache
Reg File
MUX
MUX
DataCache
MUX
SignExtend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/MEM
4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Data
Next PC
Address
RS1
RS2
Imm
MUX
Datapath
Control Path
CPSC614Lec 2.25
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Data HazardsFigure 3.9, page 147 , CA:AQA 2e
Time (clock cycles)
IF ID/RF EX MEM WB
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
CPSC614Lec 2.26
• Read After Write (RAW) InstrJ tries to read operand before InstrI writes it
• Caused by a “Data Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.
Three Generic Data Hazards
I: add r1,r2,r3J: sub r4,r1,r3
CPSC614Lec 2.27
• Write After Read (WAR) InstrJ writes operand before InstrI reads it
• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:– All instructions take 5 stages, and– Reads are always in stage 2, and – Writes are always in stage 5
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
Three Generic Data Hazards
CPSC614Lec 2.28
Three Generic Data Hazards• Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5
• Will see WAR and WAW in later more complicated pipes
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
CPSC614Lec 2.29
Time (clock cycles)
Forwarding to Avoid Data HazardFigure 3.10, Page 149 , CA:AQA 2e
Inst
r.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
CPSC614Lec 2.30
HW Change for ForwardingFigure 3.20, Page 161, CA:AQA 2e
MEM
/WR
ID/EX
EX/MEM
DataMemory
ALU
mux
mux
Registers
NextPC
Immediate
mux
CPSC614Lec 2.31
Time (clock cycles)
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Data Hazard Even with ForwardingFigure 3.12, Page 153 , CA:AQA 2e
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
CPSC614Lec 2.32
Resolving this load hazard
• Adding hardware? ... not• Detection?• Compilation techniques?
• What is the cost of load delays?
CPSC614Lec 2.33
Resolving the Load Data Hazard
Time (clock cycles)
or r8,r1,r9
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Reg ALU
DMemIfetch Reg
RegIfetch ALU
DMem RegBubble
Ifetch ALU
DMem RegBubble Reg
Ifetch ALU
DMemBubble Reg
CPSC614Lec 2.34
Try producing fast code fora = b + c;d = e – f;
assuming a, b, c, d ,e, and f in memory. Slow code:
LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,fSUB Rd,Re,RfSW d,Rd
Software Scheduling to Avoid Load Hazards
Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd
CPSC614Lec 2.35
Instruction Set Connection
• What is exposed about this organizational hazard in the instruction set?
• k cycle delay?– bad, CPI is not part of ISA
• k instruction slot delay– load should not be followed by use of the value in
the next k instructions• Nothing, but code can reduce run-time
delays• MIPS did the transformation in the
assembler
CPSC614Lec 2.36
Historical Perspective: Microprogramming
MainMemory
executionunit
controlmemory
CPU
ADDSUBAND
DATA
.
.
.
User program plus Data
this can change!
one of these ismapped into oneof these
Supported complex instructions a sequence of simple micro-inst (RTs)Pipelined micro-instruction processing, but very limited view.Could not reorganize macroinstructions to enable pipelining
CPSC614Lec 2.37
Control Hazard on Branches=> Three Stage Stall
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
CPSC614Lec 2.38
Example: Branch Stall Impact
• If 30% branch, Stall 3 cycles significant• Two part solution:
– Determine branch taken or not sooner, AND– Compute taken branch address earlier
• MIPS branch tests if register = 0 or 0• MIPS Solution:
– Move Zero test to ID/RF stage– Adder to calculate new PC in ID/RF stage– 1 clock cycle penalty for branch versus 3
CPSC614Lec 2.39
Adder
IF/ID
Pipelined MIPS DatapathFigure 3.22, page 163, CA:AQA 2/e
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MUX
DataM
emory
MUX
SignExtend
Zero?
MEM
/WB
EX/MEM
4
Adder
Next SEQ PC
RD RD RD WB
Data
• Data stationary control– local decode for each instruction phase / pipeline stage
Next PC
Address
RS1
RS2
ImmM
UX
ID/EX
CPSC614Lec 2.40
Four Branch Hazard Alternatives#1: Stall until branch direction is clear#2: Predict Branch Not Taken
– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% MIPS branches not taken on average– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken– 53% MIPS branches taken on average– But haven’t calculated branch target address in MIPS
» MIPS still incurs 1 cycle branch penalty» Other machines: branch target known before outcome
CPSC614Lec 2.41
Four Branch Hazard Alternatives
#4: Delayed Branch– Define branch to take place AFTER a following
instruction
branch instructionsequential successor1sequential successor2........sequential successorn
........branch target if taken
– 1 slot delay allows proper decision and branch target address in 5 stage pipeline
– MIPS uses this
Branch delay of length n
CPSC614Lec 2.42
Delayed Branch• Where to get instructions to fill branch delay slot?
– Before branch instruction– From the target address: only valuable when branch taken– From fall through: only valuable when branch not taken– Canceling branches allow more slots to be filled
• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots– About 80% of instructions executed in branch delay slots useful in
computation– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
CPSC614Lec 2.43
Example: Evaluating Branch Alternatives
Assume: Conditional & Unconditional = 14%, 65% change PC
Scheduling BranchCPIspeedup v. scheme penaltystall
Stall pipeline 31.421.0Predict taken 11.141.26Predict not taken 11.091.29Delayed branch 0.51.071.31
Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty