branch.110/14
Branch Prediction
Static, Dynamic Branch prediction techniques
branch.210/14
I-cache
Fetch Buffer
IssueBuffer
Func.Units
Arch.State
Execute
Decode
ResultBuffer Commit
PC
Fetch
Branchexecuted
Next fetch started
Modern processors have 10 -14 pipeline stages between next PC calculation and branch resolution !
Control Flow PenaltyWhy Branch Prediction
work lost if pipeline makes wrong prediction
~ Loop length x pipeline width
branch.310/14
Branch Penalties in a Superscalarare extensive
branch.410/14
Reducing Control Flow Penalty Software solutions
• Minimize branches - loop unrolling Increases the run length
Hardware solutions• Find something else to do - delay slots• Speculate –Dynamic branch prediction
Speculative execution of instructions beyond branch
branch.510/14
Motivation:Branch penalties limit performance of deeply pipelined processors
Much worse for superscalar processors
Modern branch predictors have high accuracy(>95%) and can reduce branch penalties significantly
Required hardware support:Dynamic Prediction HW:
• Branch history tables, branch target buffers, etc.
Mispredict recovery mechanisms:• Keep computation result separate from commit• Kill instructions following branch• Restore state to state following branch
Branch Prediction
branch.610/14
Static Branch Prediction- reviewOverall probability a branch is taken is ~60-70% but:
ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110
bne0 (preferred taken) beq0 (not taken)
ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA-64 typically reported as ~80% accurate
JZ
JZbackward
90%forward
50%
branch.710/14
Branch Prediction Needs
• Target address generation– Get register: PC, Link reg, GP reg.
– Calculate: +/- offset, auto inc/dec
– Target speculation
• Condition resolution– Get register: condition code reg, count reg.,
other reg.
– Compare registers
– Condition speculation
branch.810/14
Target address generation takes time
branch.910/14
Condition resolution takes time
branch.1010/14
Solution: Branch speculation
branch.1110/14
Branch Prediction Schemes
1. 2-bit Branch-Prediction Buffer
2. Branch Target Buffer
3. Correlating Branch Prediction Buffer
4. Tournament Branch Predictor
5. Integrated Instruction Fetch Units
6. Return Address Predictors (for subroutines, Pentium, Core Duo)
7. Predicated Execution (Itanium)
branch.1210/14
Dynamic Branch Predictionlearning based on past behavior
• Incoming stream of addresses
• Fast outgoing stream of predictions
• Correction information returned from pipeline
BranchPredictor
Incoming Branches{ Address }
Prediction{ Address, Value }
Corrections{ Address, Value }
History Information
branch.1310/14
Branch History Table (BHT)Table of predictors
• Each branch given its own predictor
• BHT is table of “Predictors”– Could be 1-bit or more
– Indexed by PC address of Branch
• Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit):
– End of loop case: when it exits loop– First time through loop, it predicts exit instead of looping
• most schemes use at least 2 bit predictors• Performance = ƒ(accuracy, cost of misprediction)
– Misprediction Flush Reorder Buffer
• In Fetch state of branch:– Use Predictor to make prediction
• When branch completes– Update corresponding Predictor
Predictor 0
Predictor 7
Predictor 1Branch PC
branch.1410/14
Branch History Table OrganizationTarget PC calculation takes time
4K-entry BHT, 2 bits/entry, ~80-90% correct predictions
0 0Fetch PC
Branch? Target PC
+
I-Cache
Opcode offset
Instruction
k
BHT Index
2k-entryBHT,2 bits/entry
Taken/¬Taken?
branch.1510/14
• Better Solution: 2-bit scheme where change prediction only if get misprediction twice:
• Red: stop, not taken
• Green: go, taken
• Adds hysteresis to decision making process
2-bit Dynamic Branch Predictionmore accurate than 1-bit
T
T
NT
Predict Taken
Predict Not Taken
Predict Taken
Predict Not TakenT
NT
T
NT
NT
branch.1610/14
BTB: Branch Address at Same Time as Prediction
• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)
Branch PC Predicted PC
=?
PC
of in
stru
ctio
nFETC
H
prediction statebits
Yes: instruction is branch and use predicted PC as next PC
No: branch not predicted, proceed normally
(Next PC = PC+4)Only predicted taken branches and jumps held in BTB
Next PC determined before branch fetched and decodedlater: check prediction, if wrong kill instruction, update BPb
branch.1710/14
BTB contains only Branch & Jump Instructions
BTB contains information for branch and jump instructions only not updated for other instructions
For all other instructions the next PC is PC+4 !
Achieved without decoding instruction
branch.1810/14
Combining BTB and BHT• BTB entries considerably more expensive than BHT,
fetch redirected earlier in pipeline - can accelerate indirect branches (JR)
• BHT can hold many more entries - more accurate
A PC Generation/MuxP Instruction Fetch Stage 1F Instruction Fetch Stage 2B Branch Address Calc/Begin DecodeI Complete DecodeJ Steer Instructions to Functional unitsR Register File ReadE Integer Execute
BTB
BHTBHT in later pipeline stage corrects when BTB misses a predicted taken branch
BTB/BHT only updated after branch resolves in E stage
branch.1910/14
Subroutine Return Stack• Small stack – accelerate subroutine returns
• more accurate than BTBs.
Push return address when function call executed
Pop return address when subroutine return decoded
&nexta
&nextb
&nextc k entries(typically k=8-16)
branch.2010/14
Mispredict Recovery
In-order execution machines:– Instructions issued after branch cannot write-back before branch
resolves
– all instructions in pipeline behind mispredicted branch Killed
branch.2110/14
• Avoid branch prediction by turning branches into conditionally executed instructions:
if (x) then A = B op C else NOP– If false, then neither store result nor cause exception
– Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr.
– IA-64: 64 1-bit condition fields selected so conditional execution of any instruction
– This transformation is called “if-conversion”
• Drawbacks to conditional instructions– Still takes a clock even if “annulled”
– Stall if condition evaluated late
– Complex conditions reduce effectiveness; condition becomes known late in pipeline
x
A = B op C
Predicated Execution
branch.2210/14
Accuracy v. Size (SPEC89)
branch.2310/14
Dynamic Branch Prediction Summary
• Prediction becoming important part of scalar execution
• Branch History Table: 2 bits for loop accuracy
• Correlation: Recently executed branches correlated with next branch.
• Tournament Predictor: more resources to competitive solutions and pick between them
• Branch Target Buffer: include branch address & prediction
• Predicated Execution can reduce number of branches, number of mispredicted branches
• Return address stack for prediction of indirect jump