Classifier technology and the illusion of progress · ‘In times of change, learners inherit the...

_____________________________________________________________ Imperial College London

1

Classifier technology

and the illusion of progress

David J. Hand Imperial College London

{[email protected]}

British Classification Society and Royal Statistical Society Statistical Computing Section

May 2005 2006


2

Supervised classification paradigm

- given a set of people with known descriptor variables and known outcome class membership

- construct a rule which will allow new people to be

assigned to an outcome class on the basis of their descriptor variables


3

Such problems have been examined in several areas: - statistics

- machine learning

- pattern recognition

- data mining

- operations research


4

⇒ many algorithms and models - linear discriminant analysis - quadratic discriminant analysis - regularised discriminant analysis - the naive Bayes method - logistic discriminant analysis - perceptrons - neural networks - radial basis function methods - vector quantization methods - nearest neighbour and kernel nonparametric methods - tree classifiers such as CART and C4.5 - support vector machines - rule-based methods - random forests - etc etc etc


5

Different methods have different attributes: - predictive accuracy - speed of classification - speed of rule construction - interpretability - etc.


6

Work is continuing to progress - optimisation methods - e.g. genetic algorithms

- variable selection

- bagging

- boosting

- assessing classifier performance - error rate, and its estimators - ROC curves - etc. etc. etc.


7

I’m going to argue:

- this paradigm ignores important aspects of real problems - uncertainty due to these other aspects often swamps the

claimed improvement so that

- the ‘improvement’ claimed for advanced methods is often illusory

I’m not claiming too much - it doesn’t apply for all problems in all situations - but ‘there is some truth in what I say’


8

Part I: Overmodelling Part II: Marginal improvement Part III: Some further evidence Part IV: Conclusion


9

Part I: Overmodelling Contention: Much of the apparent improvement is in the context

of a narrowly defined ‘classical’ paradigm which

fails to take account of important aspects of the

problem


10

Related to overfitting: optimising fit to presented design data, which is merely a sample from the underlying distribution But distinct Here: optimising fit to the presented problem, which is merely a single point sample from the space of possible future problems


11

I.1 Population drift Classical paradigm assumes the distributions do not change over time This is unrealistic: - commercial applications concerned with human behaviour

customers will change their behaviour with price changes with changes to products with changing competition with changing economic conditions with changes in marketing and advertising practice


12

Example: - deterioration of scorecards in credit scoring - need updating every few months - because the distributions change 92,258 customers, unsecured personal loans, 2 year term, 5 years of data, 17 predictor variables, 8.86% bad Design set: alternate customers 1 to 4,999 LDA and tree Applied to alternate customers 2 to 60,000 Tree should model design set distributions better than LDA


13

Lowess smooths of the misclassification cost (costs inversely proportional to priors) iπ is the prior (class size) of class

Note 1: only ‘good’ customers here, so normally even worse Note 2: true class known only at end of loan: taking first 5000 customers as design set means models can only be built when the 40,000 customer applies


14

‘In times of change, learners inherit the Earth, while

the learned find themselves beautifully equipped to

deal with a world that no longer exists’.

Eric Hoffer


15

I.2 Sample selectivity bias Classical paradigm assumes design data and new points from same distribution Build a classifier to select ‘good’ customers Those accepted yield known true class Those rejected yield no class information

⇒ biased sample from the population of applicants

⇒ difficulties in building improved classifiers and in assessing performance No point in building a highly accurate model for the available data The problem of ‘reject inference’


16

e.g. Number of weeks since last CCJ

Level % G in % G in Ratio (weeks) whole accepts pop . 1-26 22.7 44.6 1.964 27-52 30.4 46.9 1.542 53-104 31.4 49.2 1.567 105-208 37.8 55.2 1.460 209-312 42.6 63.1 1.481 > 312 55.6 69.2 1.245


17

I.3 Errors in class labels Classical paradigm assumes no errors in class labels If expect errors, then can model this If don’t expect, but they occur....


18

True posterior odds: ( ) ( ) ( )1| 2 |r p p=x x x Proportion δ of each class incorrectly believed to come from the other class at each x

Apparent posterior odds: ( ) ( ) ( )( ) ( )

1*

1 1r

rr

δ δδ δ+ −

=− +

xx

x


19

For small δ ( )*r x is m.i. in ( )r x

⇒ contours of ( )r x map to corresponding contours of ( )*r x ⇒ decision surface, ( )r k=x maps to identically shaped

decision surface, ( )* *r k=x But ( ) ( )*r r>x x whenever ( ) 1r <x

( ) ( )*r r<x x whenever ( ) 1r >x An advantage if classification threshold = 1 A loss of accuracy otherwise And loss of accuracy in estimating the decision surface


20

I.4 Arbitrariness in the class definition Classical paradigm assumes the classes are well-defined i.e. a clear external criterion for the class labels Not always true - may decide to vary the definition of bad in response to

changing economic circumstances


21

e.g. in customer management. Lewis (1994, p36): a 'good' account in a revolving credit operation is:

- (On the books for a minimum of 10 months) - and (Activity in six of the most recent 10 months) - and (Purchases of more than $50 in at least three of the past 24

months) - and (Not more than once 30 days delinquent in the past 24 months)

A 'bad' account is:

- (Delinquent for 90 days or more at any time with an outstanding undisputed balance of $50 or more)

- or (Delinquent three times for 60 days in the past 12 months with an outstanding undisputed balance on each occasion of $50 or more)

- or (Bankrupt while account was open)


22

No point in building a very accurate model to

predict classes which are different from those of

real interest


23

I.5 Optimisation criteria, performance assessment To fit a model: optimise measure of goodness-of-fit (perhaps with a penalisation term) But different criteria generally lead to different models

Optimum error rate: top-left to bottom-right Optimum Gini: bottom-left to top right

Benton (2001): ionosphere data


24

There is little point in finding the model which minimises a performance measure which is irrelevant to one’s aims:

marginal ‘improvements’ of the irrelevant criterion could be adverse changes for the relevant criterion

In particular, likelihood is seldom of direct interest Can view likelihood as a general measure of model fit, for use in cases where one does not know the real criterion


25

I.6 Other sources of uncertainty • more complicated models often require tuning of the

parameters e.g. the number of hidden nodes in a neural network • simple models can often be applied successfully by

inexperienced users e.g logistic regression, perceptrons

⇒ effectiveness of model depends on expertise of the user


26

• improvements to be gained by advanced methods may be

subject to large amounts of dataset bias ⇒ different types of data sets, with different structures,

may not respond as well to the advantages of the methods

(e.g. method X may be great on credit scoring data (categorical, poor separability) but not so good on microarray data (continuous, high dimensionality)


27

● many of the comparative studies between classification rules draw their conclusions from a very small number of points correctly classified by one method but not by another

⇒ again doubt must be cast on the generalisability of the

conclusions to other data sets, and especially to data sets from different kinds of sources


28

I.7 The UCI data repository Good: can compare methods on same data sets Good: can compare new methods with historical methods But

Evidence of overfitting to the data sets in the repository Evidence that these data sets are ‘easy’ Evidence that these data sets do not well match modern

problems


29

Complications of comparisons using UCI data - what’s a dataset? - different preprocessing - select a subset of predictors - substitute for missing values? - nominal variables as weights of evidence vs dummies - different subsets of a dataset? - articles not always clear which data set has been used - or how it has been preprocessed

- what’s a method - different k-nn metric means different methods? - are ANNs with different number of nodes different? - logistic regression with/without interaction variables? - user input vs completely automatic - different parameterisation


30

Part II: Marginal improvement The 80/20 law, Pareto’s Principle: about 80% of the wealth of a country is controlled by 20% of the people Law of diminishing returns: ‘under certain circumstances the returns to a given additional quantity of labour must necessarily diminish’ (Cannan, 1892).


31

Statistical modelling Sequential nature of modelling - either just add refinement to existing model - or recompute existing model with extra term Compare predictive accuracy of original with predictive accuracy of refined model Marginal improvement will decrease as the model complexity increases: initial crude versions of the model will explain the largest portions of the uncertainty in the prediction

Not surprising: one tries to gain as much as one can with the early models


32

A simple regression example Response y

Predictors ( )1,...,T

px x = x

( ) ( )1,

1

TTT

TCV yρ ρ − +

=

I 11 τx

τ

with , 0ρ τ ≥


33

Linear combination of the ix with maximum correlation with y has

( )( )( )1 2' ,

1pR x y

p p pτβ

ρ=

+ −

1R ≤ ⇒ ( )2 1 1p

τ ρ ρ≤ + −


34

( )V p = conditional variance of y given p predictor variables Then

( )( )( )( )

2 2 2

11 1 1 1p pV p

pτ ρ τρ ρ ρ

= − +− + − −

and

( ) ( ) ( )( )22 2 2 1

11 1 1 1 1

ppV p V pp p

τ ρτρ ρ ρ ρ

+− + = + −

− − + − + with ( ) 11 1p ρ−− < <


35

Case 1: 0ρ = ⇒ ( ) ( ) 21V p V p τ− + = Case 2: 0ρ > ⇒ ( )V p vs ρ (0.9 ↓ 0.1)

Number of predictors

Con

ditio

nal v

aria

nce

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

rho=0.1

rho=0.3

rho=0.5

rho=0.7

rho=0.9


36

When there is reasonably strong mutual correlation between the predictor variables, the earliest ones contribute substantially more to the reduction in variance remaining unexplained than do the later ones Reduction in conditional variance of the response variable decreases with each additional predictor we add

- even though each predictor has identical correlation with response - provided this correlation is non zero

Because much of the predictive power of a new predictor has already been accounted for by the existing predictors - via their correlation


37

Real life even more pronounced because

- predictor variables not identically correlated with the response

- predictors are selected sequentially, beginning with those

which maximally reduce the conditional variance


38

The flat maximum effect

Predictor variables ( )1,...,T

px x = x

Define 1

p

i ii

w w x=

=∑ 0iw ≥ , 1iw =∑

and 1

p

i ii

v v x=

=∑ , with 1 , 1,...,iv p i n= =

⇒ ( ) ( ) ( )1 1, , min ,i k i kki ir v w r x x r x x

p p≥ ≥∑ ∑

i.e., correlation between arbitrary weighted sum of the x variables (with positive weights summing to 1) and the simple combination using equal weights is bounded below by the smallest row average of the entries in the correlation matrix of the x variables


39

⇒ if the correlations are all high

the simple average is highly correlated with any other weighted sum so that the choice of weights makes little difference to the scores

⇒ gain to be made by optimising the weights may not be

worth the effort


40

In classification: To illustrate, use misclassification rate: e Simplest model, no predictors: 0 0e m π= = , the prior probability of the smallest class ⇒ scope for further reduction is 0m Improved model, reduces e to 1e m= ⇒ scope for further reduction is 1 0m m≤ Scope for improvement reduces at each step (In fact, cannot improve below Bayes error rate)


41

Literature includes (mainly artificial) examples which simple models cannot separate - intertwined spirals - chequerboard patterns

Such problems are exceedingly rare in practice Conversely (two class case) typically the class centroids are different, so simple linear surface does surprisingly well as an estimate of the true decision surface

⇒ dramatic steps in improvement in classifier accuracy are made in the simple first steps


42

This phenomenon has been noticed by others: e.g. Holte (1993) compared 1R, a multiple split single level tree, with C4.5 ⇒

‘on most of the datasets studied, 1R’s accuracy is about 3 percentage points lower than C4’s.’


43

Our study: - Random selection of 10 datasets from the literature - Compare LDA with best result in the literature

Dataset

Best method e.r.

Lindisc e.r.

Default rule

Prop linear

Segmentation 0.0140 0.083 0.760 0.907 Pima 0.1979 0.221 0.350 0.848 House-votes16 0.0270 0.046 0.386 0.948 Vehicle 0.1450 0.216 0.750 0.883 Satimage 0.0850 0.160 0.758 0.889 Heart Cleveland 0.1410 0.141 0.560 1.000 Splice 0.0330 0.057 0.475 0.945 Waveform21 0.0035 0.004 0.667 0.999 Led7 0.2650 0.265 0.900 1.000 Breast Wisconsin 0.0260 0.038 0.345 0.963


44

An example: the Pima Indians data - 268 positive for diabetes (class 1) - 500 negative (class 0) 8 vars Tree: plateaus because cannot prune back to exact no. nodes; test set e.r.; starting point is ‘assign everyone to larger class’

0 10 20 30 40

Number of leaves

0.22

0.24

0.26

0.28

0.30

0.32

0.34

Erro

r rat

e (te

st s

et)


45

An example: the sonar data - 111 for metal (class 1) - 97 for rock (class 0) 60 vars Neural network (95% CIs) (test set e.r.) Starting point is ‘assign everyone to larger class’

0 5 10 15

Number of hidden nodes

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Err

or ra

te (t

est s

et)


46

Tree

2 4 6 8

Number of leaves

0.25

0.30

0.35

0.40

0.45

0.50

Err

or ra

te (t

est s

et)


47

So : - simple (early) models provide largest gains - can be over 90% of the predictive power that can be achieved - simple models less likely to overfit


48

Part III: Some further evidence

Research = progress ? Research in classifier technology ⇒ improving performance Researchers publish their work because they think their method is an improvement So: for the traditional test-bed datasets expect a gradual reduction in the best error rate attained, over time, as more and more sophisticated classification tools were developed (Bounded by Bayes error; overfitting)


49

Pima Indian data Error rates against article publication date

Year of study publication

Erro

r rat

e

1994 1996 1998 2000 2002

0.20

0.25

0.30

0.35

0.40

0.45

0.50


50

No apparent improvement of the best classifier over time Explanations? - too small a time scale? (10 years) - this is enough for researchers to know they have to beat the earlier results - less powerful models (left of graph) have already eaten most of the predictive power in the explanatory variables - big gains have already been taken, leaving remainder to disappear in other sources of uncertainty (e.g. choice of optimisation criterion)


51

Part IV: Conclusion • Large gains in classification accuracy are made by

relatively simple models • Marginal improvement is small beyond very simple

methods • Real problems include elements of uncertainty to the

extent that more complex models often fail to improve things

?Favouring the intensive and extensive use of simple models ?


52

http://stats.ma.ic.ac.uk/djhand/public_html/

Date post:	16-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Classifier technology and the illusion of progress · ‘In times of change, learners inherit the...

Documents