_____________________________________________________________ Imperial College London
1
Classifier technology
and the illusion of progress
David J. Hand Imperial College London
British Classification Society and Royal Statistical Society Statistical Computing Section
May 2005 2006
_____________________________________________________________ Imperial College London
2
Supervised classification paradigm
- given a set of people with known descriptor variables and known outcome class membership
- construct a rule which will allow new people to be
assigned to an outcome class on the basis of their descriptor variables
_____________________________________________________________ Imperial College London
3
Such problems have been examined in several areas: - statistics
- machine learning
- pattern recognition
- data mining
- operations research
_____________________________________________________________ Imperial College London
4
⇒ many algorithms and models - linear discriminant analysis - quadratic discriminant analysis - regularised discriminant analysis - the naive Bayes method - logistic discriminant analysis - perceptrons - neural networks - radial basis function methods - vector quantization methods - nearest neighbour and kernel nonparametric methods - tree classifiers such as CART and C4.5 - support vector machines - rule-based methods - random forests - etc etc etc
_____________________________________________________________ Imperial College London
5
Different methods have different attributes: - predictive accuracy - speed of classification - speed of rule construction - interpretability - etc.
_____________________________________________________________ Imperial College London
6
Work is continuing to progress - optimisation methods - e.g. genetic algorithms
- variable selection
- bagging
- boosting
- assessing classifier performance - error rate, and its estimators - ROC curves - etc. etc. etc.
_____________________________________________________________ Imperial College London
7
I’m going to argue:
- this paradigm ignores important aspects of real problems - uncertainty due to these other aspects often swamps the
claimed improvement so that
- the ‘improvement’ claimed for advanced methods is often illusory
I’m not claiming too much - it doesn’t apply for all problems in all situations - but ‘there is some truth in what I say’
_____________________________________________________________ Imperial College London
8
Part I: Overmodelling Part II: Marginal improvement Part III: Some further evidence Part IV: Conclusion
_____________________________________________________________ Imperial College London
9
Part I: Overmodelling Contention: Much of the apparent improvement is in the context
of a narrowly defined ‘classical’ paradigm which
fails to take account of important aspects of the
problem
_____________________________________________________________ Imperial College London
10
Related to overfitting: optimising fit to presented design data, which is merely a sample from the underlying distribution But distinct Here: optimising fit to the presented problem, which is merely a single point sample from the space of possible future problems
_____________________________________________________________ Imperial College London
11
I.1 Population drift Classical paradigm assumes the distributions do not change over time This is unrealistic: - commercial applications concerned with human behaviour
customers will change their behaviour with price changes with changes to products with changing competition with changing economic conditions with changes in marketing and advertising practice
_____________________________________________________________ Imperial College London
12
Example: - deterioration of scorecards in credit scoring - need updating every few months - because the distributions change 92,258 customers, unsecured personal loans, 2 year term, 5 years of data, 17 predictor variables, 8.86% bad Design set: alternate customers 1 to 4,999 LDA and tree Applied to alternate customers 2 to 60,000 Tree should model design set distributions better than LDA
_____________________________________________________________ Imperial College London
13
Lowess smooths of the misclassification cost (costs inversely proportional to priors) iπ is the prior (class size) of class
Note 1: only ‘good’ customers here, so normally even worse Note 2: true class known only at end of loan: taking first 5000 customers as design set means models can only be built when the 40,000 customer applies
_____________________________________________________________ Imperial College London
14
‘In times of change, learners inherit the Earth, while
the learned find themselves beautifully equipped to
deal with a world that no longer exists’.
Eric Hoffer
_____________________________________________________________ Imperial College London
15
I.2 Sample selectivity bias Classical paradigm assumes design data and new points from same distribution Build a classifier to select ‘good’ customers Those accepted yield known true class Those rejected yield no class information
⇒ biased sample from the population of applicants
⇒ difficulties in building improved classifiers and in assessing performance No point in building a highly accurate model for the available data The problem of ‘reject inference’
_____________________________________________________________ Imperial College London
16
e.g. Number of weeks since last CCJ
Level % G in % G in Ratio (weeks) whole accepts pop . 1-26 22.7 44.6 1.964 27-52 30.4 46.9 1.542 53-104 31.4 49.2 1.567 105-208 37.8 55.2 1.460 209-312 42.6 63.1 1.481 > 312 55.6 69.2 1.245
_____________________________________________________________ Imperial College London
17
I.3 Errors in class labels Classical paradigm assumes no errors in class labels If expect errors, then can model this If don’t expect, but they occur....
_____________________________________________________________ Imperial College London
18
True posterior odds: ( ) ( ) ( )1| 2 |r p p=x x x Proportion δ of each class incorrectly believed to come from the other class at each x
Apparent posterior odds: ( ) ( ) ( )( ) ( )
1*
1 1r
rr
δ δδ δ+ −
=− +
xx
x
_____________________________________________________________ Imperial College London
19
For small δ ( )*r x is m.i. in ( )r x
⇒ contours of ( )r x map to corresponding contours of ( )*r x ⇒ decision surface, ( )r k=x maps to identically shaped
decision surface, ( )* *r k=x But ( ) ( )*r r>x x whenever ( ) 1r <x
( ) ( )*r r<x x whenever ( ) 1r >x An advantage if classification threshold = 1 A loss of accuracy otherwise And loss of accuracy in estimating the decision surface
_____________________________________________________________ Imperial College London
20
I.4 Arbitrariness in the class definition Classical paradigm assumes the classes are well-defined i.e. a clear external criterion for the class labels Not always true - may decide to vary the definition of bad in response to
changing economic circumstances
_____________________________________________________________ Imperial College London
21
e.g. in customer management. Lewis (1994, p36): a 'good' account in a revolving credit operation is:
- (On the books for a minimum of 10 months) - and (Activity in six of the most recent 10 months) - and (Purchases of more than $50 in at least three of the past 24
months) - and (Not more than once 30 days delinquent in the past 24 months)
A 'bad' account is:
- (Delinquent for 90 days or more at any time with an outstanding undisputed balance of $50 or more)
- or (Delinquent three times for 60 days in the past 12 months with an outstanding undisputed balance on each occasion of $50 or more)
- or (Bankrupt while account was open)
_____________________________________________________________ Imperial College London
22
No point in building a very accurate model to
predict classes which are different from those of
real interest
_____________________________________________________________ Imperial College London
23
I.5 Optimisation criteria, performance assessment To fit a model: optimise measure of goodness-of-fit (perhaps with a penalisation term) But different criteria generally lead to different models
Optimum error rate: top-left to bottom-right Optimum Gini: bottom-left to top right
Benton (2001): ionosphere data
_____________________________________________________________ Imperial College London
24
There is little point in finding the model which minimises a performance measure which is irrelevant to one’s aims:
marginal ‘improvements’ of the irrelevant criterion could be adverse changes for the relevant criterion
In particular, likelihood is seldom of direct interest Can view likelihood as a general measure of model fit, for use in cases where one does not know the real criterion
_____________________________________________________________ Imperial College London
25
I.6 Other sources of uncertainty • more complicated models often require tuning of the
parameters e.g. the number of hidden nodes in a neural network • simple models can often be applied successfully by
inexperienced users e.g logistic regression, perceptrons
⇒ effectiveness of model depends on expertise of the user
_____________________________________________________________ Imperial College London
26
• improvements to be gained by advanced methods may be
subject to large amounts of dataset bias ⇒ different types of data sets, with different structures,
may not respond as well to the advantages of the methods
(e.g. method X may be great on credit scoring data (categorical, poor separability) but not so good on microarray data (continuous, high dimensionality)
_____________________________________________________________ Imperial College London
27
● many of the comparative studies between classification rules draw their conclusions from a very small number of points correctly classified by one method but not by another
⇒ again doubt must be cast on the generalisability of the
conclusions to other data sets, and especially to data sets from different kinds of sources
_____________________________________________________________ Imperial College London
28
I.7 The UCI data repository Good: can compare methods on same data sets Good: can compare new methods with historical methods But
Evidence of overfitting to the data sets in the repository Evidence that these data sets are ‘easy’ Evidence that these data sets do not well match modern
problems
_____________________________________________________________ Imperial College London
29
Complications of comparisons using UCI data - what’s a dataset? - different preprocessing - select a subset of predictors - substitute for missing values? - nominal variables as weights of evidence vs dummies - different subsets of a dataset? - articles not always clear which data set has been used - or how it has been preprocessed
- what’s a method - different k-nn metric means different methods? - are ANNs with different number of nodes different? - logistic regression with/without interaction variables? - user input vs completely automatic - different parameterisation
_____________________________________________________________ Imperial College London
30
Part II: Marginal improvement The 80/20 law, Pareto’s Principle: about 80% of the wealth of a country is controlled by 20% of the people Law of diminishing returns: ‘under certain circumstances the returns to a given additional quantity of labour must necessarily diminish’ (Cannan, 1892).
_____________________________________________________________ Imperial College London
31
Statistical modelling Sequential nature of modelling - either just add refinement to existing model - or recompute existing model with extra term Compare predictive accuracy of original with predictive accuracy of refined model Marginal improvement will decrease as the model complexity increases: initial crude versions of the model will explain the largest portions of the uncertainty in the prediction
Not surprising: one tries to gain as much as one can with the early models
_____________________________________________________________ Imperial College London
32
A simple regression example Response y
Predictors ( )1,...,T
px x = x
( ) ( )1,
1
TTT
TCV yρ ρ − +
=
I 11 τx
τ
with , 0ρ τ ≥
_____________________________________________________________ Imperial College London
33
Linear combination of the ix with maximum correlation with y has
( )( )( )1 2' ,
1pR x y
p p pτβ
ρ=
+ −
1R ≤ ⇒ ( )2 1 1p
τ ρ ρ≤ + −
_____________________________________________________________ Imperial College London
34
( )V p = conditional variance of y given p predictor variables Then
( )( )( )( )
2 2 2
11 1 1 1p pV p
pτ ρ τρ ρ ρ
= − +− + − −
and
( ) ( ) ( )( )22 2 2 1
11 1 1 1 1
ppV p V pp p
τ ρτρ ρ ρ ρ
+− + = + −
− − + − + with ( ) 11 1p ρ−− < <
_____________________________________________________________ Imperial College London
35
Case 1: 0ρ = ⇒ ( ) ( ) 21V p V p τ− + = Case 2: 0ρ > ⇒ ( )V p vs ρ (0.9 ↓ 0.1)
Number of predictors
Con
ditio
nal v
aria
nce
0 1 2 3 4 5 6
0.0
0.2
0.4
0.6
0.8
1.0
rho=0.1
rho=0.3
rho=0.5
rho=0.7
rho=0.9
_____________________________________________________________ Imperial College London
36
When there is reasonably strong mutual correlation between the predictor variables, the earliest ones contribute substantially more to the reduction in variance remaining unexplained than do the later ones Reduction in conditional variance of the response variable decreases with each additional predictor we add
- even though each predictor has identical correlation with response - provided this correlation is non zero
Because much of the predictive power of a new predictor has already been accounted for by the existing predictors - via their correlation
_____________________________________________________________ Imperial College London
37
Real life even more pronounced because
- predictor variables not identically correlated with the response
- predictors are selected sequentially, beginning with those
which maximally reduce the conditional variance
_____________________________________________________________ Imperial College London
38
The flat maximum effect
Predictor variables ( )1,...,T
px x = x
Define 1
p
i ii
w w x=
=∑ 0iw ≥ , 1iw =∑
and 1
p
i ii
v v x=
=∑ , with 1 , 1,...,iv p i n= =
⇒ ( ) ( ) ( )1 1, , min ,i k i kki ir v w r x x r x x
p p≥ ≥∑ ∑
i.e., correlation between arbitrary weighted sum of the x variables (with positive weights summing to 1) and the simple combination using equal weights is bounded below by the smallest row average of the entries in the correlation matrix of the x variables
_____________________________________________________________ Imperial College London
39
⇒ if the correlations are all high
the simple average is highly correlated with any other weighted sum so that the choice of weights makes little difference to the scores
⇒ gain to be made by optimising the weights may not be
worth the effort
_____________________________________________________________ Imperial College London
40
In classification: To illustrate, use misclassification rate: e Simplest model, no predictors: 0 0e m π= = , the prior probability of the smallest class ⇒ scope for further reduction is 0m Improved model, reduces e to 1e m= ⇒ scope for further reduction is 1 0m m≤ Scope for improvement reduces at each step (In fact, cannot improve below Bayes error rate)
_____________________________________________________________ Imperial College London
41
Literature includes (mainly artificial) examples which simple models cannot separate - intertwined spirals - chequerboard patterns
Such problems are exceedingly rare in practice Conversely (two class case) typically the class centroids are different, so simple linear surface does surprisingly well as an estimate of the true decision surface
⇒ dramatic steps in improvement in classifier accuracy are made in the simple first steps
_____________________________________________________________ Imperial College London
42
This phenomenon has been noticed by others: e.g. Holte (1993) compared 1R, a multiple split single level tree, with C4.5 ⇒
‘on most of the datasets studied, 1R’s accuracy is about 3 percentage points lower than C4’s.’
_____________________________________________________________ Imperial College London
43
Our study: - Random selection of 10 datasets from the literature - Compare LDA with best result in the literature
Dataset
Best method e.r.
Lindisc e.r.
Default rule
Prop linear
Segmentation 0.0140 0.083 0.760 0.907 Pima 0.1979 0.221 0.350 0.848 House-votes16 0.0270 0.046 0.386 0.948 Vehicle 0.1450 0.216 0.750 0.883 Satimage 0.0850 0.160 0.758 0.889 Heart Cleveland 0.1410 0.141 0.560 1.000 Splice 0.0330 0.057 0.475 0.945 Waveform21 0.0035 0.004 0.667 0.999 Led7 0.2650 0.265 0.900 1.000 Breast Wisconsin 0.0260 0.038 0.345 0.963
_____________________________________________________________ Imperial College London
44
An example: the Pima Indians data - 268 positive for diabetes (class 1) - 500 negative (class 0) 8 vars Tree: plateaus because cannot prune back to exact no. nodes; test set e.r.; starting point is ‘assign everyone to larger class’
0 10 20 30 40
Number of leaves
0.22
0.24
0.26
0.28
0.30
0.32
0.34
Erro
r rat
e (te
st s
et)
_____________________________________________________________ Imperial College London
45
An example: the sonar data - 111 for metal (class 1) - 97 for rock (class 0) 60 vars Neural network (95% CIs) (test set e.r.) Starting point is ‘assign everyone to larger class’
0 5 10 15
Number of hidden nodes
0.20
0.25
0.30
0.35
0.40
0.45
0.50
Err
or ra
te (t
est s
et)
_____________________________________________________________ Imperial College London
46
Tree
2 4 6 8
Number of leaves
0.25
0.30
0.35
0.40
0.45
0.50
Err
or ra
te (t
est s
et)
_____________________________________________________________ Imperial College London
47
So : - simple (early) models provide largest gains - can be over 90% of the predictive power that can be achieved - simple models less likely to overfit
_____________________________________________________________ Imperial College London
48
Part III: Some further evidence
Research = progress ? Research in classifier technology ⇒ improving performance Researchers publish their work because they think their method is an improvement So: for the traditional test-bed datasets expect a gradual reduction in the best error rate attained, over time, as more and more sophisticated classification tools were developed (Bounded by Bayes error; overfitting)
_____________________________________________________________ Imperial College London
49
Pima Indian data Error rates against article publication date
Year of study publication
Erro
r rat
e
1994 1996 1998 2000 2002
0.20
0.25
0.30
0.35
0.40
0.45
0.50
_____________________________________________________________ Imperial College London
50
No apparent improvement of the best classifier over time Explanations? - too small a time scale? (10 years) - this is enough for researchers to know they have to beat the earlier results - less powerful models (left of graph) have already eaten most of the predictive power in the explanatory variables - big gains have already been taken, leaving remainder to disappear in other sources of uncertainty (e.g. choice of optimisation criterion)
_____________________________________________________________ Imperial College London
51
Part IV: Conclusion • Large gains in classification accuracy are made by
relatively simple models • Marginal improvement is small beyond very simple
methods • Real problems include elements of uncertainty to the
extent that more complex models often fail to improve things
?Favouring the intensive and extensive use of simple models ?
_____________________________________________________________ Imperial College London
52
http://stats.ma.ic.ac.uk/djhand/public_html/