Date post: | 20-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 0 times |
1
Applying Perceptrons to Speculation in Computer Architecture
Michael BlackDissertation DefenseApril 2, 2007
2
Presentation Outline
Background and ObjectivesPerceptron behaviorLocal value predictionGlobal value predictionCriticality predictionConclusions
3
Motivation: Jimenez’s Perceptron Branch Predictor
27% reduction in misprediction over gshare15.8% increase in performance over gshare1
why better? can consider longer history
1Jimenez and Lin, “Dynamic Branch Prediction with Perceptrons.”, 2002.
4
Problem of Lookup Tables
Size grows exponentially with history
Result: must consider small subset of available data
5
Global vs. Local
Local history: past iterations of same instruction
Global history: all past dynamic instructions
6
Perceptron
Predictions: 1. Dot product of binary inputs and integer weights2. Apply threshold: if +, predict 1; if -, predict 0
Learning objective:Weight values should reflect input’s correlation
7
Training strategies
Training by correlationif actual==inputk : wk++
else: wk--
Training by errorerror = actual - predicted
wk = wk + inputk error
9
Dissertation Objectives
Analyze behavior of perceptrons when used to replace tablesCoping with limitations of perceptrons and their implementationsApplying perceptrons to value predictionApplying perceptrons to criticality prediction
10
Dissertation Contributions
Perceptron Local Value Predictor can consider longer local histories
Perceptron Global-based Local Value Predictor can use global information to choose local values
Two Perceptron Global Value PredictorsPerceptron Global Criticality PredictorComparison and analysis of: perceptron training approaches multiple-bit topologies interference reduction strategies
11
Analyses
How perceptrons behave when replacing tables What effect the training approach has
Design and behavior of different multiple-bit perceptrons
Dealing with history interference
14
What affects perceptron learning?
Noise from uncorrelated inputsImbalance between pattern occurrencesFalse correlations
Effects:Perceptron takes longer to learnPerceptron never learns
15
Noise
Training by correlation: weights grow large rapidly: less susceptible
Training by error: weights don’t grow until misprediction: susceptible
Solution? Exponential Weight Growth
-1
0 1 01 1 1 1
5 1 -1 1 -1 -1 -1
16
Studying Noise
Perceptron modeled independently of applicationp random patterns chosen for each level of correlation:
At n bits correlated, a random correlation direction (direct/inverse) chosen for each of n bits
Target randomly chosen for each pattern; Correlation direction determines first n bits of each pattern
Remaining bits chosen randomly for each pattern
Perceptron is trained on each pattern setAverage of training time for 1000 random pattern sets plotted
pattern set generation for n=4, p=2:
ddid
1101xxxx – 10010xxxx – 0
11010101 – 100101110 – 0
21
Findings
Increasing history size is bad if the percentage of correlated inputs decrease
Must use training-by-error if there is poor correlation and imbalance
22
Multibit Perceptron
Predicts values, not single bits
What is a value correlation? Input value infers a particular output value 5 --> 4
Approaches: Disjoint Fully Coupled Weight per value
23
Disjoint Perceptron
Tradeoff: + small size- can only learn from respective bits
1 52 4
0 34 1
0 0 1
0 0 0
1 0 1
0 1 1
1 0 0
0 0 1
0 1 0
1 0 0
inverse
directinverse
24
Fully Coupled Perceptron
Tradeoff:+ can learn from any past bit- more weights
direct
direct
direct
1 12 4
0 30 0
1 0 0
0 0 0
0 0 1
0 1 0
0 0 1
0 0 0
0 1 0
0 0 0
28
How common is interference?
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
History size
Avera
ge #
bra
nch
es i
nte
rferi
ng
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
1 5 9 13
17
21
25
29
33
37
41
45
49
Quantity of interfering branches
Perc
en
tag
e o
f ti
me t
he m
ost
co
mm
on
bra
nch
ap
pears
at
the
inp
ut
29
How does interference affect perceptrons?
constructivedestructiveneutralweight-destructivevalue-destructive
30
Interference in Perceptron Branch Prediction
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
gzip
gcc
perlb
mk
bzip
2tw
olf
vorte
xvp
rm
cf
Constructive
Neutral
Weight destructive
Value destructive
Completely Destructive
31
Coping: Assigned Seats
Tradeoff: + no additional
size - can’t
consider multiple iterations of an instruction
1 0 10A B C D E
Iteration 11 0 11 0
A B F C D
Iteration 2Branch History
1
1 0 11 0 0
Branch History
Prediction
32
Weight for each interfering branch (“Piecewise Linear”)
Tradeoff:+ interference is
completely removed
- massive size
Iteration 1
A B F C D
Branch History
Prediction
0 1 1
1 0 1 1 0
1 1 01 0
A B C D EBranch Address
Iteration 2
0 1 11 0Branch History
Branch Address
33
Simulatornew superscalar
cycle-accurate execution-driven simulator
can accurately model value prediction & criticality
34
Value Prediction
What is it? predicting instructions’ data values to
overcome data dependencies
Why consider it? requires a multiple-bit prediction, not
a single-bit
35
Table-based Predictor
Limitations: exponential growth in past values & value history can only consider local history
Storage: 70kB for 4 values, 34MB for 8 values, 74*1018 B for 16 values
Data Values
Value History Pattern
2v
Pattern History Table
Predicted Data Value
Instruction Address
36
Perceptron in Pattern Table (PPT)
Tradeoff: + Few perceptrons needed (for 4 past values)
Can consider longer histories- Exponential growth with # of past values
Data Values
Value History Pattern
2v
Predicted Data Value
Instruction Address
Perceptron 0Perceptron 1
Perceptron n
log v
37
Perceptron in Value Table (PVT)
Tradeoff:+ Linear growth in both value history and #
past values- More perceptrons needed
Data Values
Prediction
Instruction Address
Perceptron 0Perceptron 1
Perceptron n
log v
41
Global-Local PredictorCorrect
Data Value Index
Data Values
Prediction
Instruction Address
Perceptron 0Perceptron 1
Perceptron n
log v
42
Global-Global Prediction
Tradeoff:+ Less value storage- More bits needed per perceptron input
Correct Data Value
Index
Prediction
Instruction Address
Perceptron 0Perceptron 1
Perceptron n
Global Value Cache
43
Global Bitwise
Tradeoff:+ No value storage
Not limited to past values only- Many more bits needed per perceptron
input
Data Value
Prediction
Instruction Address
Perceptron 0Perceptron 1
Perceptron n
44
Global Predictors Compared
Global-Local:3.1% accuracy increase, 1.6% performance increase1.2MB storage needed
Global-Global:7.6% accuracy increase, 6.7% performance increase1.3MB storage needed
Bitwise:12.7% accuracy increase, 5.3% performance increase4.2MB storage needed
45
Can Bitwise Predict New Values?
5.0% of all predictions are correct values never seen beforeFurther 9.8% are correct values not seen in local history
46
Multibit Topologies Compared
Disjoint:3.1% accuracy increase, 1.6% performance increase1.2MB storage needed
Fully Coupled:6.8% accuracy decrease, 1.5% performance decrease3.8MB storage needed
Weight per Value:10.7% accuracy increase, 4.4% performance increase21.5MB storage needed
49
Final Weight Values: Distribution and Accuracy
0.001%
0.010%
0.100%
1.000%
10.000%
100.000%
-1
28
-1
17
-1
06
-9
5
-8
4
-7
3
-6
2
-5
1
-4
0
-2
9
-1
8
-7 4
15
26
37
48
59
70
81
92
10
3
11
4
12
5
Weight Value
% o
f W
eig
hts
at V
alu
e
Error, Disjoint
Correlation, Disjoint
Exponential Growth, Disjoint
Error, Fully Coupled
Correlation, Fully Coupled
51
Criticality PredictionWhat is it?
Predicting whether each instruction is on the critical path
Why consider it? lack of good training information multiple input factors
52
Counter-based Criticality
Predicts four “criteria” that indicate criticality: QOLD
oldest waiting instruction QOLDDEP
parent of a QOLD instruction
ALOLD oldest instruction in
machine QCONS
instruction with the most dependencies
Instruction Address
QOLD Counters
QOLDDEP Counters
ALOLD Counters
QCONS Counters
Prediction
53
Perceptron-per-Criteria (PEC)
Correct QOLD
Instruction Address
QOLD Perceptron 0
QOLD Perceptron 1
QOLD Perceptron n
Correct QOLDDEP
QOLDDEP Perceptron 0QOLDDEP
Perceptron 1
QOLDDEP Perceptron n
Correct ALOLD
ALOLD Perceptron 0
ALOLD Perceptron 1
ALOLD Perceptron n
Correct QCONS
QCONS Perceptron 0
QCONS Perceptron 1
QCONS Perceptron n
PredictionTradeoff:+ One input per history entry- Can’t learn relationships between criteria
54
Single Perceptron (SP)
Instruction Address
Perceptron 0
Perceptron 1
Perceptron n
CorrectQOLD
CorrectQOLDDEP
CorrectALOLD
CorrectQOLDDEP
Tradeoff:+ One input per history entry & one
perceptron- Can’t learn effects of individual criteria
55
Single Perceptron with Input for Each Criterion (SPC)
Correct QOLD
Instruction Address
Perceptron 0
Correct QOLDDEP
Correct ALOLD
Correct QCONS
Perceptron 1
Perceptron n
Tradeoff:+ Can learn relative relationships of each
criterion- Four inputs per perceptron
56
Accuracy ComparedPEC:2.9% accuracy increase, 4.2MB storage needed
SP:4.1% accuracy increase, 1.0MB storage needed
SPC:6.6% accuracy increase, 4.2MB storage needed
59
Final SPC Weight Distribution
0.0001%
0.0010%
0.0100%
0.1000%
1.0000%
10.0000%
100.0000%
-128 -112 -96 -80 -64 -48 -32 -16 0 16 32 48 64 80 96 112 128
Training by Errror
Training byCorrelation
60
Conclusions
Perceptron Local Value Predictor 5.6% accuracy increase with 1.3MB storage
Perceptron Global-based Local Value Predictor 3.1% accuracy increase with 1.2MB storage;
10.7% increase for 21.5MB storage
Two Perceptron Global Value Predictors 7.6% accuracy increase with 1.3MB storage;
12.7% increase for 4.2MB storage
Perceptron Global Criticality Predictor 6.6% accuracy increase with 4.2MB storage
61
Conclusions (continued)Perceptron training approaches Training-by-error must be used for poorly correlated
applications
Multiple-bit topologies Disjoint - best approach if hardware is a concern Fully coupled - performs poorly with low correlation Weight-per-value - performs very well but requires high
hardware costs
Interference reduction Assigned Seats - modest improvement but no
additional hardware Piecewise - substantially more hardware, significant
improvement