+ All Categories
Home > Documents > Classifier technology and the illusion of progress · ‘In times of change, learners inherit the...

Classifier technology and the illusion of progress · ‘In times of change, learners inherit the...

Date post: 16-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
52
_____________________________________________________________ Imperial College London 1 Classifier technology and the illusion of progress David J. Hand Imperial College London {[email protected]} British Classification Society and Royal Statistical Society Statistical Computing Section May 2005 2006
Transcript
Page 1: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

1

Classifier technology

and the illusion of progress

David J. Hand Imperial College London

{[email protected]}

British Classification Society and Royal Statistical Society Statistical Computing Section

May 2005 2006

Page 2: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

2

Supervised classification paradigm

- given a set of people with known descriptor variables and known outcome class membership

- construct a rule which will allow new people to be

assigned to an outcome class on the basis of their descriptor variables

Page 3: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

3

Such problems have been examined in several areas: - statistics

- machine learning

- pattern recognition

- data mining

- operations research

Page 4: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

4

⇒ many algorithms and models - linear discriminant analysis - quadratic discriminant analysis - regularised discriminant analysis - the naive Bayes method - logistic discriminant analysis - perceptrons - neural networks - radial basis function methods - vector quantization methods - nearest neighbour and kernel nonparametric methods - tree classifiers such as CART and C4.5 - support vector machines - rule-based methods - random forests - etc etc etc

Page 5: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

5

Different methods have different attributes: - predictive accuracy - speed of classification - speed of rule construction - interpretability - etc.

Page 6: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

6

Work is continuing to progress - optimisation methods - e.g. genetic algorithms

- variable selection

- bagging

- boosting

- assessing classifier performance - error rate, and its estimators - ROC curves - etc. etc. etc.

Page 7: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

7

I’m going to argue:

- this paradigm ignores important aspects of real problems - uncertainty due to these other aspects often swamps the

claimed improvement so that

- the ‘improvement’ claimed for advanced methods is often illusory

I’m not claiming too much - it doesn’t apply for all problems in all situations - but ‘there is some truth in what I say’

Page 8: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

8

Part I: Overmodelling Part II: Marginal improvement Part III: Some further evidence Part IV: Conclusion

Page 9: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

9

Part I: Overmodelling Contention: Much of the apparent improvement is in the context

of a narrowly defined ‘classical’ paradigm which

fails to take account of important aspects of the

problem

Page 10: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

10

Related to overfitting: optimising fit to presented design data, which is merely a sample from the underlying distribution But distinct Here: optimising fit to the presented problem, which is merely a single point sample from the space of possible future problems

Page 11: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

11

I.1 Population drift Classical paradigm assumes the distributions do not change over time This is unrealistic: - commercial applications concerned with human behaviour

customers will change their behaviour with price changes with changes to products with changing competition with changing economic conditions with changes in marketing and advertising practice

Page 12: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

12

Example: - deterioration of scorecards in credit scoring - need updating every few months - because the distributions change 92,258 customers, unsecured personal loans, 2 year term, 5 years of data, 17 predictor variables, 8.86% bad Design set: alternate customers 1 to 4,999 LDA and tree Applied to alternate customers 2 to 60,000 Tree should model design set distributions better than LDA

Page 13: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

13

Lowess smooths of the misclassification cost (costs inversely proportional to priors) iπ is the prior (class size) of class

Note 1: only ‘good’ customers here, so normally even worse Note 2: true class known only at end of loan: taking first 5000 customers as design set means models can only be built when the 40,000 customer applies

Page 14: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

14

‘In times of change, learners inherit the Earth, while

the learned find themselves beautifully equipped to

deal with a world that no longer exists’.

Eric Hoffer

Page 15: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

15

I.2 Sample selectivity bias Classical paradigm assumes design data and new points from same distribution Build a classifier to select ‘good’ customers Those accepted yield known true class Those rejected yield no class information

⇒ biased sample from the population of applicants

⇒ difficulties in building improved classifiers and in assessing performance No point in building a highly accurate model for the available data The problem of ‘reject inference’

Page 16: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

16

e.g. Number of weeks since last CCJ

Level % G in % G in Ratio (weeks) whole accepts pop . 1-26 22.7 44.6 1.964 27-52 30.4 46.9 1.542 53-104 31.4 49.2 1.567 105-208 37.8 55.2 1.460 209-312 42.6 63.1 1.481 > 312 55.6 69.2 1.245

Page 17: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

17

I.3 Errors in class labels Classical paradigm assumes no errors in class labels If expect errors, then can model this If don’t expect, but they occur....

Page 18: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

18

True posterior odds: ( ) ( ) ( )1| 2 |r p p=x x x Proportion δ of each class incorrectly believed to come from the other class at each x

Apparent posterior odds: ( ) ( ) ( )( ) ( )

1*

1 1r

rr

δ δδ δ+ −

=− +

xx

x

Page 19: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

19

For small δ ( )*r x is m.i. in ( )r x

⇒ contours of ( )r x map to corresponding contours of ( )*r x ⇒ decision surface, ( )r k=x maps to identically shaped

decision surface, ( )* *r k=x But ( ) ( )*r r>x x whenever ( ) 1r <x

( ) ( )*r r<x x whenever ( ) 1r >x An advantage if classification threshold = 1 A loss of accuracy otherwise And loss of accuracy in estimating the decision surface

Page 20: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

20

I.4 Arbitrariness in the class definition Classical paradigm assumes the classes are well-defined i.e. a clear external criterion for the class labels Not always true - may decide to vary the definition of bad in response to

changing economic circumstances

Page 21: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

21

e.g. in customer management. Lewis (1994, p36): a 'good' account in a revolving credit operation is:

- (On the books for a minimum of 10 months) - and (Activity in six of the most recent 10 months) - and (Purchases of more than $50 in at least three of the past 24

months) - and (Not more than once 30 days delinquent in the past 24 months)

A 'bad' account is:

- (Delinquent for 90 days or more at any time with an outstanding undisputed balance of $50 or more)

- or (Delinquent three times for 60 days in the past 12 months with an outstanding undisputed balance on each occasion of $50 or more)

- or (Bankrupt while account was open)

Page 22: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

22

No point in building a very accurate model to

predict classes which are different from those of

real interest

Page 23: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

23

I.5 Optimisation criteria, performance assessment To fit a model: optimise measure of goodness-of-fit (perhaps with a penalisation term) But different criteria generally lead to different models

Optimum error rate: top-left to bottom-right Optimum Gini: bottom-left to top right

Benton (2001): ionosphere data

Page 24: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

24

There is little point in finding the model which minimises a performance measure which is irrelevant to one’s aims:

marginal ‘improvements’ of the irrelevant criterion could be adverse changes for the relevant criterion

In particular, likelihood is seldom of direct interest Can view likelihood as a general measure of model fit, for use in cases where one does not know the real criterion

Page 25: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

25

I.6 Other sources of uncertainty • more complicated models often require tuning of the

parameters e.g. the number of hidden nodes in a neural network • simple models can often be applied successfully by

inexperienced users e.g logistic regression, perceptrons

⇒ effectiveness of model depends on expertise of the user

Page 26: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

26

• improvements to be gained by advanced methods may be

subject to large amounts of dataset bias ⇒ different types of data sets, with different structures,

may not respond as well to the advantages of the methods

(e.g. method X may be great on credit scoring data (categorical, poor separability) but not so good on microarray data (continuous, high dimensionality)

Page 27: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

27

● many of the comparative studies between classification rules draw their conclusions from a very small number of points correctly classified by one method but not by another

⇒ again doubt must be cast on the generalisability of the

conclusions to other data sets, and especially to data sets from different kinds of sources

Page 28: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

28

I.7 The UCI data repository Good: can compare methods on same data sets Good: can compare new methods with historical methods But

Evidence of overfitting to the data sets in the repository Evidence that these data sets are ‘easy’ Evidence that these data sets do not well match modern

problems

Page 29: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

29

Complications of comparisons using UCI data - what’s a dataset? - different preprocessing - select a subset of predictors - substitute for missing values? - nominal variables as weights of evidence vs dummies - different subsets of a dataset? - articles not always clear which data set has been used - or how it has been preprocessed

- what’s a method - different k-nn metric means different methods? - are ANNs with different number of nodes different? - logistic regression with/without interaction variables? - user input vs completely automatic - different parameterisation

Page 30: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

30

Part II: Marginal improvement The 80/20 law, Pareto’s Principle: about 80% of the wealth of a country is controlled by 20% of the people Law of diminishing returns: ‘under certain circumstances the returns to a given additional quantity of labour must necessarily diminish’ (Cannan, 1892).

Page 31: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

31

Statistical modelling Sequential nature of modelling - either just add refinement to existing model - or recompute existing model with extra term Compare predictive accuracy of original with predictive accuracy of refined model Marginal improvement will decrease as the model complexity increases: initial crude versions of the model will explain the largest portions of the uncertainty in the prediction

Not surprising: one tries to gain as much as one can with the early models

Page 32: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

32

A simple regression example Response y

Predictors ( )1,...,T

px x = x

( ) ( )1,

1

TTT

TCV yρ ρ − +

=

I 11 τx

τ

with , 0ρ τ ≥

Page 33: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

33

Linear combination of the ix with maximum correlation with y has

( )( )( )1 2' ,

1pR x y

p p pτβ

ρ=

+ −

1R ≤ ⇒ ( )2 1 1p

τ ρ ρ≤ + −

Page 34: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

34

( )V p = conditional variance of y given p predictor variables Then

( )( )( )( )

2 2 2

11 1 1 1p pV p

pτ ρ τρ ρ ρ

= − +− + − −

and

( ) ( ) ( )( )22 2 2 1

11 1 1 1 1

ppV p V pp p

τ ρτρ ρ ρ ρ

+− + = + −

− − + − + with ( ) 11 1p ρ−− < <

Page 35: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

35

Case 1: 0ρ = ⇒ ( ) ( ) 21V p V p τ− + = Case 2: 0ρ > ⇒ ( )V p vs ρ (0.9 ↓ 0.1)

Number of predictors

Con

ditio

nal v

aria

nce

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

rho=0.1

rho=0.3

rho=0.5

rho=0.7

rho=0.9

Page 36: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

36

When there is reasonably strong mutual correlation between the predictor variables, the earliest ones contribute substantially more to the reduction in variance remaining unexplained than do the later ones Reduction in conditional variance of the response variable decreases with each additional predictor we add

- even though each predictor has identical correlation with response - provided this correlation is non zero

Because much of the predictive power of a new predictor has already been accounted for by the existing predictors - via their correlation

Page 37: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

37

Real life even more pronounced because

- predictor variables not identically correlated with the response

- predictors are selected sequentially, beginning with those

which maximally reduce the conditional variance

Page 38: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

38

The flat maximum effect

Predictor variables ( )1,...,T

px x = x

Define 1

p

i ii

w w x=

=∑ 0iw ≥ , 1iw =∑

and 1

p

i ii

v v x=

=∑ , with 1 , 1,...,iv p i n= =

⇒ ( ) ( ) ( )1 1, , min ,i k i kki ir v w r x x r x x

p p≥ ≥∑ ∑

i.e., correlation between arbitrary weighted sum of the x variables (with positive weights summing to 1) and the simple combination using equal weights is bounded below by the smallest row average of the entries in the correlation matrix of the x variables

Page 39: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

39

⇒ if the correlations are all high

the simple average is highly correlated with any other weighted sum so that the choice of weights makes little difference to the scores

⇒ gain to be made by optimising the weights may not be

worth the effort

Page 40: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

40

In classification: To illustrate, use misclassification rate: e Simplest model, no predictors: 0 0e m π= = , the prior probability of the smallest class ⇒ scope for further reduction is 0m Improved model, reduces e to 1e m= ⇒ scope for further reduction is 1 0m m≤ Scope for improvement reduces at each step (In fact, cannot improve below Bayes error rate)

Page 41: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

41

Literature includes (mainly artificial) examples which simple models cannot separate - intertwined spirals - chequerboard patterns

Such problems are exceedingly rare in practice Conversely (two class case) typically the class centroids are different, so simple linear surface does surprisingly well as an estimate of the true decision surface

⇒ dramatic steps in improvement in classifier accuracy are made in the simple first steps

Page 42: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

42

This phenomenon has been noticed by others: e.g. Holte (1993) compared 1R, a multiple split single level tree, with C4.5 ⇒

‘on most of the datasets studied, 1R’s accuracy is about 3 percentage points lower than C4’s.’

Page 43: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

43

Our study: - Random selection of 10 datasets from the literature - Compare LDA with best result in the literature

Dataset

Best method e.r.

Lindisc e.r.

Default rule

Prop linear

Segmentation 0.0140 0.083 0.760 0.907 Pima 0.1979 0.221 0.350 0.848 House-votes16 0.0270 0.046 0.386 0.948 Vehicle 0.1450 0.216 0.750 0.883 Satimage 0.0850 0.160 0.758 0.889 Heart Cleveland 0.1410 0.141 0.560 1.000 Splice 0.0330 0.057 0.475 0.945 Waveform21 0.0035 0.004 0.667 0.999 Led7 0.2650 0.265 0.900 1.000 Breast Wisconsin 0.0260 0.038 0.345 0.963

Page 44: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

44

An example: the Pima Indians data - 268 positive for diabetes (class 1) - 500 negative (class 0) 8 vars Tree: plateaus because cannot prune back to exact no. nodes; test set e.r.; starting point is ‘assign everyone to larger class’

0 10 20 30 40

Number of leaves

0.22

0.24

0.26

0.28

0.30

0.32

0.34

Erro

r rat

e (te

st s

et)

Page 45: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

45

An example: the sonar data - 111 for metal (class 1) - 97 for rock (class 0) 60 vars Neural network (95% CIs) (test set e.r.) Starting point is ‘assign everyone to larger class’

0 5 10 15

Number of hidden nodes

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Err

or ra

te (t

est s

et)

Page 46: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

46

Tree

2 4 6 8

Number of leaves

0.25

0.30

0.35

0.40

0.45

0.50

Err

or ra

te (t

est s

et)

Page 47: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

47

So : - simple (early) models provide largest gains - can be over 90% of the predictive power that can be achieved - simple models less likely to overfit

Page 48: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

48

Part III: Some further evidence

Research = progress ? Research in classifier technology ⇒ improving performance Researchers publish their work because they think their method is an improvement So: for the traditional test-bed datasets expect a gradual reduction in the best error rate attained, over time, as more and more sophisticated classification tools were developed (Bounded by Bayes error; overfitting)

Page 49: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

49

Pima Indian data Error rates against article publication date

Year of study publication

Erro

r rat

e

1994 1996 1998 2000 2002

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Page 50: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

50

No apparent improvement of the best classifier over time Explanations? - too small a time scale? (10 years) - this is enough for researchers to know they have to beat the earlier results - less powerful models (left of graph) have already eaten most of the predictive power in the explanatory variables - big gains have already been taken, leaving remainder to disappear in other sources of uncertainty (e.g. choice of optimisation criterion)

Page 51: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

51

Part IV: Conclusion • Large gains in classification accuracy are made by

relatively simple models • Marginal improvement is small beyond very simple

methods • Real problems include elements of uncertainty to the

extent that more complex models often fail to improve things

?Favouring the intensive and extensive use of simple models ?

Page 52: Classifier technology and the illusion of progress · ‘In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world

_____________________________________________________________ Imperial College London

52

http://stats.ma.ic.ac.uk/djhand/public_html/


Recommended