Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 0 times |
2
Outline
What are branches? Techniques for handling branches Branch prediction Why do we need branch prediction? Branch prediction schemes
(static/dynamic) “Real” branch predictors
4
Types of Branches
Conditional Unconditional
Direct if - then- else for loops (BEZ, BNEZ, etc)
procedure calls (JAL) goto (J)
Indirect return (JR) virtual function lookup function pointers (JALR)
5
Techniques for handling branches
IF ID EX MEM WB
Stalling Branch delay slots
Relies on programmer/compiler to fill Depends on being able to find suitable
instructions Ties resolution delay to a particular
pipeline Predication
tranform control dependence to data dependence on branch condition
6
Why aren’t these techniques acceptable?
Branches are frequent (15-25%) Today’s pipelines are deeper and
wider Higher performance penalty for
stalling Misprediction Penalty = issue width *
resolution delay cycles A lot of cycles can be wasted!!!
7
Branch Prediction
Predicting the outcome of a branch Direction:
Taken / Not Taken Direction predictors
Target Address PC+offset (Taken)/ PC+4 (Not Taken) Target address predictors
Branch Target Address Cache (BTAC) or Branch Target Buffer (BTB)
8
Why do we need branch prediction?
Branch prediction Increases the number of instructions
available for the scheduler to issue. Increases instruction level parallelism (ILP)
Allows useful work to be completed while waiting for the branch to resolve
9
Cycle Fetch Decode Execute Save
1 if (x>0)
2 a=0 if (x>0)
3 b=1 a=0 if (x>0)
4 c=2 b=1 a=0 if (x>0)
5 c=2 b=1 a=0
6 c=2 b=1
7 c=2
Cycle Fetch Decode Execute Save
1 if (x>0)
2 a=0 if (x>0)
3 b=1 a=0 if (x>0)
4 d=3 squash b=1
squash a=0
if (x>0)
5 d=3 squash b=1
squash a=0
6 d=3 squash b=1
7 d=3
Cycle Fetch Decode Execute Save
1 if (x>0)
2 d=3 if (x>0)
3 d=3 if (x>0)
4 d=3 if (x>0)
5 d=3
A simple example which demonstrates the benefits
if (x > 0) { a=0; b=1; c=2;}d=3;
10
Classification of branch prediction schemes (1) Static schemes
Decision before runtime (i.e. at compile time)
Predict Branch Taken / Not Taken All branches taken scheme : 34% avg.
misprediction rate Backward Taken/Forward Not Taken
(BTFNT) Advantage in Loops Doesn’t work well on programs with
irregular branches Ball and Larus approach enhancement
works a little better
11
Classification of branch prediction schemes (2) Profiling
branch prediction based on profiles created by earlier runs
key observation: behavior of branches bimodally distributed
Preset static prediction bit in the opcode Doesn’t work well on data sets that occur at run-time
Static schemes useful for scheduling when the branch delays are exposed by
the architecture assisting dynamic predictors determining frequent code paths
12
Classification of branch prediction schemes (3) Dynamic Schemes
Prediction decisions may change during the execution of the program
Branch Target Buffer Lee and Smith 2-bit saturating up-down counters to
collect history information Static Training Scheme
Use statistics collected from pre-run of the program and history pattern consist of the last N run-time execution
13
What happens when a branch is mispredicted?
On mispredict: No speculative state may commit
Squash instructions in the pipeline Must not allow stores in the pipeline to
occur Cannot allow stores which would not have
happened to commit Need to handle exceptions appropriately
17
2-bit branch prediction A branch must miss twice before the
prediction is changed It’s a specialization of the n-bit
saturating scheme. Branch prediction buffer can be
implemented as: Special cache accessed with the instruction
address during IF Pair of bits attached to each block in the
instruction cache
21
Correlating (Two-Level) branch predictors (1)
Consider the sequence (2):If (d==0)
d=1;If (d==1)
MIPS assembly for (2):BNEZ R1,L1 ;branch b1DADDIU R1,R0,#1 ;d=1
L1:DADDIU R3,R1,#-1BNEZ R3,L2 ;branch b2…
L2:
Consider the sequence (1):If (aa==2)
aa=0;If (bb==2)
bb=0;if(aa!=bb) {
22
1-bit correlation branch predictor
in (1) if b1 is NOT taken then b2 is NOT taken too! consider a predictor with 1 bit of correlation to capture dependence of one branch from another 2 prediction bits per branch:
1 assuming last branch executed was Not Taken 1 assuming last branch executed was Taken
Pred bits Pred if last branch not taken
Pred if last branch taken
NT/NT NT NT
NT/T NT T
T/NT T NT
T/T T T
23
Comparisond=? b1 pred b1 act new b1
predb2 pred b2 act new b2
pred
2 NT T T NT T T
0 T NT NT T NT NT
2 NT T T NT T T
0 T NT NT T NT NT
d=? b1 pred b1 act new b1 pred
b2 pred b2 act new b2 pred
2 NT/NT T T/NT NT/NT T NT/T
0 T/NT NT T/NT NT/T NT NT/T
2 T/NT T T/NT NT/T T NT/T
0 T/NT NT T/NT NT/T NT NT/T
24
Correlating branch predictors
2 bits of global history means we look at T/NT behavior of last to branches to determine the behavior of THIS branch.
The buffer can be implemented as an one dimensional array
(m,n) predictor uses behavior of last m branches to choose from 2m predictor each being an n-bit predictor. It takes (2m x n x # of entries selected by the branch address) bits.
25
Q: how can we capture the behavior of last n branches and adjust the behavior of the current branch accordingly?
A: we use an n bit shift register and shift the behavior of each branch to this register as they become known.
Correlating branch predictors
110 Last branch outcome
26
Correlating branch predictors Higher prediction rates than simple 2-bit
predictor scheme with only trivial additional amount of HW (m-bit shift register)
NOTE: buffer NOT a cache, so counters may correspond to different branches at some point in time
Buffer can be implemented as a linear memory array that is n-bits wide Indexing is done by concatenating global history
bits with the bits from the branch address
27
Correlating branch predictors
How many bits are there in a (0,2) predictor that has 4K entries selected from the branch address?
20 x 2 x 4K = 8K How many bits the example
predictor has? 22 x 2 x 16 =128 bits.
31
Hybrid predictors
The basic idea is to use a META predictor to select among multiple predictors
Example: Local predictors are better in some
branches Global predictors are better in utilizing
correlation Use a predictor to select the better predictor
32
Tournament predictors n/m means:
n left predictor m right predictor
0 incorrect 1 correct A predictor must
be twice incorrect before we switch to another one
33
Fractions of predictions coming from the local predictor
The tournament predictor selects between a local 2-bit predictor and a 2-bit Gshare predictor
Each predictor has 1024 entries each 2 bits for a total 64K bits.
35
Need Address at Same Time as Prediction
Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)
Note: must check for branch match now, since can’t use wrong branch address
37
Return Address Predictors Return addresses can be predicted
with BTB but accuracy can be low Procedure may be called from
multiple sites Solution: small buffer operating
as a stack If stack large enough it will predict
perfectly
39
Alpha 21264 8-stage pipeline, mispredict penalty
7 cycles Hybrid predictor (Fetch)
12-bit GAg (4K-entry PHT, 2 bit counters) 10-bit PAg (1K-entry BHT, 1K-entry PHT, 3-bit
counters)
41
Sun UltraSPARC-III
14-stage pipeline, bpred accessed in instruction fetch stages 2-3
16K-entry 2-bit counter Gshare predictor Bimodal predictor which XOR’s PC bits with
global history register (except 3 lower order bits) to reduce aliasing
Miss queue Halves mispredict penalty by providing
instructions for immediate use
43
Intel Pentium III
Dynamic branch prediction 512-entry BTB predicts direction and target,
4-bit history used with PC to derive direction Static branch predictor for BTB misses Return Address Stack (RAS), 4/8 entries Branch Penalties:
Not Taken: no penalty Correctly predicted taken: 1 cycle Mispredicted: at least 9 cycles, as many as
26, average 10-15 cycles
44
AMD Athlon K7
10-stage integer, 15-stage fp pipeline, predictor accessed in fetch
2K-entry bimodal, 2K-entry BTAC 12-entry RAS Branch Penalties:
Correct Predict Taken: 1 cycle Mispredict penalty: at least 10 cycles