Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | lisandra-church |
View: | 27 times |
Download: | 0 times |
Neural Methods for Dynamic Branch Prediction
Daniel A. Jiménez
Department of Computer ScienceRutgers University
2
The Context
I'll be discussing the implementation of microprocessors Microarchitecture
I study deeply pipelined, high clock frequency CPUs
The goal is to improve performance Make the program go faster
How can we exploit program behavior to make it go faster? Remove control dependences
Increase instruction-level parallelism
3
An Example
This C++ code computes something useful. The inner loop executes two statements each time through the loop.
int foo (int w[], bool v[], int n) {int sum = 0;for (int i=0; i<n; i++) {
if (v[i])sum += w[i];
elsesum += ~w[i];
}return sum;
}
4
An Example continued
This C++ code computes the same thing with three statements in the loop.
This version is 55% faster on a Pentium 4. Previous version had many mispredicted branch instructions.
int foo2 (int w[], bool v[], int n) {int sum = 0;for (int i=0; i<n; i++) {
int a = w[i];int b = - (int) v[i];sum += ~(a ^ b);
}return sum;
}
5
How an Instruction is Processed
Instruction fetch
Instruction decode
Execute
Memory access
Write back
Processing can be divided
into five stages:
6
Instruction-Level Parallelism
Instruction fetch
Instruction decode
Execute
Memory access
Write back
To speed up the process, pipelining overlaps execution of multiple instructions, exploiting parallelism between instructions
7
Control Hazards: Branches
Conditional branches create a problem for pipelining: the next instruction can't be fetched until the branch has executed, several stages later.
Branch instruction
8
Pipelining and Branches
Instruction fetch
Instruction decode
Execute
Memory access
Write back
Pipelining overlaps instructions to exploit parallelism, allowing the clock rate to be increased. Branches cause bubbles in the pipeline, where some stages are left idle.
Unresolved branch instruction
9
Branch Prediction
Instruction fetch
Instruction decode
Execute
Memory access
Write back
A branch predictor allows the processor to speculatively fetch and execute instructions down the predicted path.
Speculative execution
Branch predictors must be highly accurate to avoid mispredictions!
10
Branch Predictors Must Improve
The cost of a misprediction is proportional to pipeline depth As pipelines deepen, we need more accurate branch predictors
Pentium 4 pipeline has 20 stages Future pipelines will have > 32 stages
Simulations with SimpleScalar/Alpha
Deeper pipelines allow higher clock rates by decreasing the delay of each pipeline stage
Decreasing misprediction rate from 9% to 4% results in 31% speedup for 32 stage pipeline
11
Overview
Branch prediction background
Applying machine learning to branch prediction
Results and analysis
Circuit-level implementation
Future work and conclusions
12
Branch Prediction Background
13
Branch Prediction Background
The basic mechanism: 2-level adaptive prediction [Yeh & Patt `91]
Uses correlations between branch history and outcome Examples:
gshare [McFarling `93] agree [Sprangle et al. `97] hybrid predictors [Evers et al. `96]
This scheme is highly accurate in practice
14
Branch Predictor Accuracy
Larger tables and smarter organizations yield better accuracy Longer histories provide more context for finding correlations
Table size is exponential in history length The cost is increased access delay and chip area
15
Applying Machine Learning to Branch Prediction
16
Branch Prediction is a Machine Learning Problem
So why not apply a machine learning algorithm? Replace 2-bit counters with a more accurate predictor
Tight constraints on prediction mechanism
Must be fast and small enough to work as a component of a
microprocessor
Artificial neural networks Simple model of neural networks in brain cells
Learn to recognize and classify patterns
Most neural nets are slow and complex relative to tables
For branch prediction, we need a small and fast neural method
17
A Neural Method for Branch Prediction
We investigated several neural methods
Most were too slow, too big, or not accurate enough
Our choice: The perceptron [Rosenblatt `62, Block `62]
Very high accuracy for branch prediction
Prediction and update are quick, relative to other neural methods
Sound theoretical foundation; perceptron convergence theorem
Proven to work well for many classification problems
18
Branch-Predicting Perceptron
Inputs (x’s) are from branch history register Weights (w’s) are small integers learned by on-line training Output (y) gives prediction; dot product of x’s and w’s Training finds correlations between history and outcome
19
Training Algorithm
20
Organization of the Perceptron Predictor
Keeps a table of perceptrons, indexed by branch address Inputs are from branch history register Predict taken if output 0, otherwise predict not taken
Key intuition: table size isn't exponential in history length, so we can consider much longer histories
21
Results and Analysis for the Perceptron Predictor
22
Experimental Evaluation
Execution and trace driven simulations: Measure instruction throughput (IPC) and misprediction rates
SimpleScalar/Alpha [Burger & Austin `97]
Alpha 21264-like configuration:
4-wide issue, 64KB I-cache, 64KB D-cache, 512 entry BTB
SPECint 2000 benchmarks
Technological estimates: HSPICE for circuit delay estimates
Modified CACTI 2.0 [Agarwal 2000] for PHT delay estimates
23
Results: Predictor Accuracy
Perceptron outperforms competitive hybrid predictor by 36% at ~4KB; 1.71% vs. 2.66%
24
Results: Large Hardware Budgets
Multi-component hybrid was the most accurate fully dynamic predictor known in the literature [Evers 2000]
Perceptron predictor is even more accurate
25
Delay Sensitive Implementation
Even the relatively simple perceptron has high access delay
Our solution: An overriding perceptron predictor
First level is a single-cycle gshare
Second level is a 4KB, 23-bit history perceptron predictor
HSPICE total prediction delay estimates:
2 cycles at 833 MHz (like Alpha 21264)
4 cycles at 1.76 GHz (like Pentium 4)
Compare with 4KB hybrid predictor
26
Results: IPC with high clock rate
Pentium 4-like: 20 cycle misprediction penalty, 1.76 GHz 15.8% higher IPC than gshare, 5.7% higher than hybrid
27
Analysis: History Length
The fixed-length path branch predictor can also use long histories [Stark, Evers & Patt `98]
28
Analysis: Training Times
Perceptron “warms up’’ faster
29
Circuit-Level Implementation of a Neural Branch Predictor
30
Circuit-Level Implementation
Example output computation: 12
weights, Wallace tree of depth 6
followed by 14-bit carry-lookahead
adder
Delay is 2-4 cycles for longer histories
Carry-save adders have O(1)
depth, carry-lookahead adder
has O(log n) depth
31
HSPICE Perceptron Simulations
2 cycles at 833 MHz, 4 cycles at 1.76 GHz, 180 nm technology
32
Future Work and Conclusions
33
Future Work with Perceptron Predictor
Let's make the best predictor even better
Better representation
Better training algorithm
Latency is a problem
Crazy people are saying that overriding organizations don't work as
well as simple but large predictors [ Me, HPCA 2003 ]
How can we eliminate the latency of the perceptron predictor?
34
Future Work with Perceptron Predictor
Value prediction
Predict value of a load to mitigate memory latency
Indirect branch prediction
Virtual dispatch
Switch statements in C
Exit prediction
Predict the taken exit from predicated hyperblocks
35
Future Work Characterizing Predictability
Branch predictability, value predictability
How can we characterize algorithms in terms of their predictability?
Given an algorithm, how can we transform it so that its branches and
values are easier to predict?
How much predictability is inherent in the algorithm, and how much is
an artifact of the program structure?
How can we compare different algorithms' predictability?
36
Conclusions
Neural predictors can improve performance for deeply
pipelined microprocessors
Perceptron learning is well-suited for microarchitectural
implementation
There is still a lot of work left to be done on the perceptron
predictor in particular and microarchitectural prediction in
general
37
The End