1 Classification and regression trees Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University...

11

Classification and Classification and regression treesregression trees

Pierre GeurtsPierre GeurtsStochastic methodsStochastic methods

(Prof. L.Wehenkel)(Prof. L.Wehenkel)

University of LiègeUniversity of Liège

2

OutlineOutline

►Supervised learningSupervised learning►Decision tree representationDecision tree representation►Decision tree learningDecision tree learning►ExtensionsExtensions►Regression treesRegression trees►By-productsBy-products

3

DatabaseDatabase► A collection of objects (rows) described by A collection of objects (rows) described by

attributes (columns)attributes (columns)

checkingaccount duration purpose amount savings yearsemployed age good or bad 0<=…<200 DM 48 radiotv 5951 ...<100 DM 1<...<4 22 bad...<0 DM 6 radiotv 1169 unknown ...>7 67 goodno 12 education 2096 ...<100 DM 4<...<7 49 good...<0 DM 42 furniture 7882 ...<100 DM 4<...<7 45 good...<0 DM 24 newcar 4870 ...<100 DM 1<...<4 53 badno 36 education 9055 unknown 1<...<4 35 goodno 24 furniture 2835 500<...<1000 DM...>7 53 good0<=...<200 DM 36 usedcar 6948 ...<100 DM 1<...<4 35 goodno 12 radiotv 3059 ...>1000 DM 4<...<7 61 good0<=...<200 DM 30 newcar 5234 ...<100 DM unemployed 28 bad0<=...<200 DM 12 newcar 1295 ...<100 DM ...<1 25 bad...<0 DM 48 business 4308 ...<100 DM ...<1 24 bad0<=...<200 DM 12 radiotv 1567 ...<100 DM 1<...<4 22 good

4

Supervised learningSupervised learning

► Goal: from the database, find a function f of the Goal: from the database, find a function f of the inputs that approximate at best the output inputs that approximate at best the output

► Discrete output Discrete output classification problem classification problem► Continuous output Continuous output regression problem regression problem

AA11 AA22 …… AAnn YY

2.32.3 onon 3.43.4 C1C1

1.21.2 offoff 0.30.3 C2C2

...... ...... ...... ......

Database=learning sample

Automatic learning

inputs output

Ŷ = f(A1,A2,…,An)

model

5

Examples of application Examples of application (1)(1)

►Predict whether a bank client will be a Predict whether a bank client will be a good debtor or notgood debtor or not

► Image classification:Image classification: Handwritten characters recognition:Handwritten characters recognition:

Face recognitionFace recognition

3 5

6

Examples of application Examples of application (2)(2)

►Classification of cancer types from Classification of cancer types from gene expression profiles (Golub et al gene expression profiles (Golub et al (1999))(1999))

N° patient

Gene 1 Gene 2 … Gene 7129

Leucimia

1 -134 28 … 123 AML

2 -123 0 … 17 AML

3 56 -123 … -23 ALL

… … … … … …

72 89 -123 … 12 ALL

7

Learning algorithmLearning algorithm

► It receives a learning sample and returns a function It receives a learning sample and returns a function hh► A learning algorithm is defined by:A learning algorithm is defined by:

A hypothesis space A hypothesis space HH (=a family of candidate models) (=a family of candidate models) A quality measure for a modelA quality measure for a model An optimisation strategyAn optimisation strategy

A model (h H) obtained by automatic learning

0

1

0 1A1

A2

8

Decision (classification) treesDecision (classification) trees

►A learning algorithm that can handle:A learning algorithm that can handle: Classification problems (binary or multi-Classification problems (binary or multi-

valued)valued) Attributes may be discrete (binary or multi-Attributes may be discrete (binary or multi-

valued) or continuous.valued) or continuous.

►Classification trees were invented twice:Classification trees were invented twice: By statisticians: CART (Breiman et al.)By statisticians: CART (Breiman et al.) By the AI community: ID3, C4.5 (Quinlan et By the AI community: ID3, C4.5 (Quinlan et

al.)al.)

9

Hypothesis spaceHypothesis space

►A decision tree is a tree where:A decision tree is a tree where: Each Each interior nodeinterior node tests an attribute tests an attribute Each Each branch branch corresponds to an attribute corresponds to an attribute

valuevalue Each Each leafleaf node is labelled with a class node is labelled with a class

A1

A2 A3

c1 c2

c1

c2 c1

a11a12

a13

a21 a22 a31 a32

10

A simple database: A simple database: playtennisplaytennis

Day Outlook Temperature

Humidity

Wind Play Tennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild Normal Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool High Strong Yes

D8 Sunny Mild Normal Weak No

D9 Sunny Hot Normal Weak Yes

D10 Rain Mild Normal Strong Yes

D11 Sunny Cool Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

11

A decision tree for playtennisA decision tree for playtennis

Outlook

Humidity Wind

no yes

yes

no yes

SunnyOvercast

Rain

High Normal

Strong

Weak

12

Tree learningTree learning► Tree learning=choose the tree structure and Tree learning=choose the tree structure and

determine the predictions at leaf nodesdetermine the predictions at leaf nodes► Predictions: to minimize the misclassification Predictions: to minimize the misclassification

error, associate the majority class among the error, associate the majority class among the learning sample cases reaching this nodelearning sample cases reaching this node

Outlook

Humidity Wind

no yes

yes

no yes

SunnyOvercast

Rain

High Normal

Strong

Weak

25 yes, 40 no

15 yes, 10 no

14 yes, 2 no

13

How to generate trees ? How to generate trees ? (1)(1)

► What properties do we want the What properties do we want the decision tree to have ?decision tree to have ?

1.1. It should be consistent with the It should be consistent with the learning sample (for the moment)learning sample (for the moment)

• Trivial algorithm: construct a decision Trivial algorithm: construct a decision tree that has one path to a leaf for each tree that has one path to a leaf for each exampleexample

• Problem: it does not capture useful Problem: it does not capture useful information from the databaseinformation from the database

14

How to generate trees ? How to generate trees ? (2)(2)

► What properties do we want the What properties do we want the decision tree to have ?decision tree to have ?

2.2. It should be at the same time as It should be at the same time as simple as possiblesimple as possible

Trivial algorithm: generate all trees and Trivial algorithm: generate all trees and pick the simplest one that is consistent pick the simplest one that is consistent with the learning sample.with the learning sample.

Problem: intractable, there are too many Problem: intractable, there are too many treestrees

15

Top-down induction of DTs Top-down induction of DTs (1)(1)

► Choose « best » attributeChoose « best » attribute► Split the learning sampleSplit the learning sample► Proceed recursively until each object is Proceed recursively until each object is

correctly classifiedcorrectly classified

Outlook

SunnyOvercast

Rain

Day Outlook

Temp. Humidity

Wind Play



D8 Sunny Mild High Weak No


D11

Sunny Cool Normal Strong Yes

Day Outlook

Temp. Humidity

Wind Play



D6 Rain Cool Normal Strong

No

D10

Rain Mild Normal Strong

Yes

D14

Rain Mild High Strong

No

Day Outlook Temp. Humidity

Wind Play

D3 Overcast

Hot High Weak Yes

D7 Overcast

Cool High Strong

Yes

D12

Overcast

Mild High Strong

Yes

D13

Overcast

Hot Normal Weak Yes

16

Top-down induction of DTs Top-down induction of DTs (2)(2)

Procedure Procedure learn_dtlearn_dt(learning sample, (learning sample, LSLS))► If all objects from If all objects from LSLS have the same have the same

classclass Create a leaf with that classCreate a leaf with that class

►ElseElse Find the « best » splitting attribute Find the « best » splitting attribute AA Create a test node for this attributeCreate a test node for this attribute For each value For each value aa of of AA

►BuildBuild LSLSaa= {= {oo LSLS | | AA((oo)) is is aa}}►Use Use Learn_dtLearn_dt((LSLSaa) to grow a subtree from ) to grow a subtree from LSLSaa..

17

Properties of TDIDTProperties of TDIDT

►Hill-climbing algorithm in the space of Hill-climbing algorithm in the space of possible decision trees.possible decision trees. It adds a subIt adds a sub--tree to the current tree and tree to the current tree and

continues its searchcontinues its search It does not backtrackIt does not backtrack

►Sub-optimal but very fastSub-optimal but very fast►Highly dependent upon the criterion Highly dependent upon the criterion

for selecting attributes to testfor selecting attributes to test

18

Which attribute is best ?Which attribute is best ?

►We want a small treeWe want a small tree We should maximize the class separation We should maximize the class separation

at each step, i.e. make successors as pure at each step, i.e. make successors as pure as possibleas possible

it will favour short paths in the treesit will favour short paths in the trees

A1=?

T F

[21+,5-]

[8+,30-]

[29+,35-] A2=?

T F

[18+,33-]

[11+,2-]

[29+,35-]

19

ImpurityImpurity

► Let Let LSLS be a sample of objects, be a sample of objects, ppjj the proportions the proportions of objects of class of objects of class jj ((jj=1,…,=1,…,JJ)) in in LSLS,,

► Define an Define an impurityimpurity measure measure II((LSLS)) that satisfies: that satisfies: II((LSLS)) is minimum only when is minimum only when ppii=1=1 and and ppjj=0=0 for for jjii

(all objects are of the same class)(all objects are of the same class) II((LSLS)) is maximum only when is maximum only when ppj j =1/=1/JJ

(there is exactly the same number of objects of all (there is exactly the same number of objects of all classes)classes)

II((LSLS)) is symmetric with respect to is symmetric with respect to pp11,…,p,…,pJJ

20

Reduction of impurityReduction of impurity

► The “best” split is the split that maximizes the The “best” split is the split that maximizes the expected reduction of impurityexpected reduction of impurity

wherewhere LSLSaa is the subset of objects from is the subset of objects from LSLS such such that that AA==aa..

► II is called a score measure or a splitting criterion is called a score measure or a splitting criterion► There are many other ways to define a splitting There are many other ways to define a splitting

criterion that do not rely on an impurity measurecriterion that do not rely on an impurity measure

a

aa LSI

LS

LSLSIALSI )(

||

||)(),(

21

Example of impurity measure Example of impurity measure (1)(1)

► Shannon’s entropy:Shannon’s entropy: HH((LSLS)=-)=-jj p pjj log log ppjj

If two classes, If two classes, pp11=1-=1-pp22

► Entropy measures impurity, uncertainty, Entropy measures impurity, uncertainty, surprise… surprise…

► The reduction of entropy is called the The reduction of entropy is called the information information gaingain

0

0,5

1

0 0,5 1

p1

I(p

1)

22

Example of impurity measure Example of impurity measure (2)(2)

►Which attribute is best ? Which attribute is best ? A1=?

T F

[21+,5-]

[8+,30-]

[29+,35-] A2=?

T F

[18+,33-]

[11+,2-]

[29+,35-]

I=0.99

I=0.71 I=0.75 I=0.94 I=0.62

I=0.99

I(LS,A1) = 0.99 - (26/64) 0.71 – (38/64) 0.75

= 0.25

I(LS,A2) = 0.99 - (51/64) 0.94 – (13/64) 0.62

= 0.12

23

Other impurity measuresOther impurity measures►Gini index:Gini index:

II((LSLS)=)=jj ppjj (1- (1-ppjj))

►Misclassification error rate:Misclassification error rate: II((LSLS)=1-max)=1-maxjj p pjj

►two-class case:two-class case:

0

0,5

1

0 0,5 1

p1

I(p1)

24

Playtennis problemPlaytennis problem

► Which attribute should be tested here ?Which attribute should be tested here ? II((LSLS,Temp.) = 0.970 - (3/5) 0.918 - (1/5) 0.0 - (1/5) 0.0=0.419,Temp.) = 0.970 - (3/5) 0.918 - (1/5) 0.0 - (1/5) 0.0=0.419 II((LSLS,Hum.) = 0.970 - (3/5) 0.0 - (2/5) 0.0 = 0.970,Hum.) = 0.970 - (3/5) 0.0 - (2/5) 0.0 = 0.970 II((LSLS,Wind) = 0.970 - (2/5) 1.0 - (3/5) 0.918 = 0.019,Wind) = 0.970 - (2/5) 1.0 - (3/5) 0.918 = 0.019

► the best attribute is Humiditythe best attribute is Humidity

Outlook

SunnyOvercast

Rain

Day Outlook

Temp. Humidity

Wind Play



D6 Rain Cool Normal Strong

No

D10

Rain Mild Normal Strong

Yes

D14

Rain Mild High Strong

No

Day Outlook Temp. Humidity

Wind Play

D3 Overcast

Hot High Weak Yes

D7 Overcast

Cool High Strong

Yes

D12

Overcast

Mild High Strong

Yes

D13

Overcast

Hot Normal Weak Yes

Day Outlook

Temp. Humidity

Wind Play



D8 Sunny Mild High Weak No


D11

Sunny Cool Normal Strong Yes

25

Overfitting Overfitting (1)(1)

►Our trees are perfectly consistent with Our trees are perfectly consistent with the learning sample the learning sample

►But, often, we would like them to be But, often, we would like them to be good at predicting classes of unseen good at predicting classes of unseen data from the same distribution data from the same distribution (generalization).(generalization).

►A tree T overfits the learning sample iff A tree T overfits the learning sample iff T’ such that: T’ such that: ErrorErrorLSLS(T) < Error(T) < ErrorLSLS(T’)(T’) ErrorErrorunseenunseen(T) > Error(T) > Errorunseenunseen(T’)(T’)

26

Overfitting Overfitting (2)(2)

► In practice, ErrorIn practice, Errorunseenunseen(T) is estimated from a (T) is estimated from a separate test sampleseparate test sample

Error

OverfittingUnderfitting

Complexity

ErrorLS

Errorunseen

27

Reasons for overfitting Reasons for overfitting (1)(1)

► Data is noisy or attributes don’t completely Data is noisy or attributes don’t completely predict the outcomepredict the outcome

Day Outlook Temperature Humidity Wind Play Tennis

D15 Sunny Mild Normal Strong No

yes

Outlook

Humidity Wind

no

yes

no yes

SunnyOvercast

Rain

High Normal

Strong

Weak

Temperature

no yes

Mild Cool,Hot

Add a test Add a test herehere

yes

28

Reasons for overfitting Reasons for overfitting (2)(2)

► Data is incomplete (not all caseData is incomplete (not all casess covered) covered)

► We do not have enough data in some part of the We do not have enough data in some part of the learning sample to make a good decision learning sample to make a good decision

area with probablywrong predictions

+

++

++ +

+

-

-

- -

-

---

---

-

-

- +

---

-

-

29

How can we avoid How can we avoid overfitting ?overfitting ?

►Pre-pruningPre-pruning: stop growing the tree : stop growing the tree earlier, before it reaches the point where earlier, before it reaches the point where it perfectly classifies the learning sampleit perfectly classifies the learning sample

►Post-pruningPost-pruning: allow the tree to overfit : allow the tree to overfit and then post-prune the treeand then post-prune the tree

►Ensemble methods (this afternoon)Ensemble methods (this afternoon)

30

Pre-pruningPre-pruning

►Stop splitting a node ifStop splitting a node if The number of objects is too smallThe number of objects is too small The impurity is low enoughThe impurity is low enough The best test is not statistically significant The best test is not statistically significant

(according to some statistical test)(according to some statistical test)►Problem: Problem:

the optimum value of the parameter (the optimum value of the parameter (nn, , IIth th , , significance level) is problem dependent.significance level) is problem dependent.

We may miss the optimumWe may miss the optimum

31

Post-pruning Post-pruning (1)(1)

► Split the learning sample Split the learning sample LSLS into two sets: into two sets: a growing sample a growing sample GSGS to build the tree to build the tree A validation sample A validation sample VSVS to evaluate its to evaluate its

generalization errorgeneralization error► Build a complete tree from Build a complete tree from GSGS► Compute a sequence of trees {TCompute a sequence of trees {T11,T,T22,…} where,…} where

TT11 is the complete tree is the complete tree TTii is obtained by removing some is obtained by removing some test nodetest nodes from Ts from Ti-i-

11

► Select the tree TSelect the tree Tii* from the sequence that * from the sequence that minimizes the error on minimizes the error on VSVS

32


Error

OverfittingUnderfitting

Optimal tree Complexity

Error on GSTree growing

Error on VSTree puning

33


►How to build the sequence of trees ?How to build the sequence of trees ? Reduced error pruning:Reduced error pruning:

►At each step, remove the node that most At each step, remove the node that most decreases the error on decreases the error on VSVS

Cost-complexity pruning:Cost-complexity pruning:►Define a cost-complexity criterion:Define a cost-complexity criterion:

ErrorErrorGSGS(T)+(T)+.Complexity(T).Complexity(T)

►Build the sequence of trees that minimize this Build the sequence of trees that minimize this criterion for increasing criterion for increasing

34


Outlook

Humidity Wind

no

yes

no yes

Sunny Overcast

Rain

High Normal Strong

Weak

Temp.

no yesMild Cool,Ho

t

Outlook

Humidity Wind

no

yes

no yes

Sunny Overcast

Rain

High Normal Strong

Weak

yes

Outlook

Wind

noyes

no yes

Sunny Overcast

Rain

Strong

Weak

Outlook

noyes

Sunny Overcast

Rain

yes

yes

TT22

TT11 TT33

TT44

TT55

ErrorErrorGSGS=0%, =0%, ErrorErrorVSVS=10%=10%

ErrorErrorGSGS=6%, Error=6%, ErrorVSVS=8%=8%




35


► Problem: require to dedicate one part of the Problem: require to dedicate one part of the learning sample as a validation set learning sample as a validation set may be may be a problem in the case of a small databasea problem in the case of a small database

► Solution: Solution: NN-fold cross-validation-fold cross-validation Split the training set into Split the training set into NN parts (often 10) parts (often 10) Generate Generate NN trees, each leaving one part among trees, each leaving one part among NN Make a prediction for each learning object with the Make a prediction for each learning object with the

(only) tree built without this case.(only) tree built without this case. Estimate the error of this predictionEstimate the error of this prediction

► May be combined with pruningMay be combined with pruning

36

How to use decision trees ?How to use decision trees ?

► Large datasets (ideal case):Large datasets (ideal case): Split the dataset into three parts: Split the dataset into three parts: GSGS, , VSVS, , TSTS Grow a tree from Grow a tree from GSGS Post-prune it from Post-prune it from VSVS Test it on Test it on TSTS

► Small datasets (often)Small datasets (often) Grow a tree from the whole databaseGrow a tree from the whole database Pre-prune with default parameters (risky), post-Pre-prune with default parameters (risky), post-

prune it by 10-fold cross-validation (costly)prune it by 10-fold cross-validation (costly) Estimate its accuracy by 10-fold cross-validationEstimate its accuracy by 10-fold cross-validation

37

OutlineOutline

► Supervised learningSupervised learning► Tree representationTree representation► Tree learningTree learning► ExtensionsExtensions

Continuous attributesContinuous attributes Attributes with many valuesAttributes with many values Missing valuesMissing values

► Regression treesRegression trees► By-productsBy-products

38

Continuous attributes Continuous attributes (1)(1)

► Example: temperature as a number instead of Example: temperature as a number instead of a discrete valuea discrete value

► Two solutions:Two solutions: Pre-discretize: Cold if Temperature<70, Mild Pre-discretize: Cold if Temperature<70, Mild

between 70 and 75, Hot if Temperature>75between 70 and 75, Hot if Temperature>75 Discretize during tree growing:Discretize during tree growing:

► How to find the cut-point ?How to find the cut-point ?

Temperature

no yes

65.4 >65.4

39

Continuous attributes Continuous attributes (2)(2)

Temp. Play

80 No

85 No

83 Yes

75 Yes

68 Yes

65 No

64 Yes

72 No

75 Yes

70 Yes

69 Yes

72 Yes

81 Yes

71 No No85

Yes81

Yes83

Yes75

Yes75

No80

Yes70

No71

No72

Yes72

Yes69

Yes68

No65

Yes64

PlayTemp.

SortSort

Temp.< 64.5 Temp.< 64.5 I=0.048I=0.048

Temp.< 84 Temp.< 84 I=0.113I=0.113

Temp.< 82 Temp.< 82 I=0.010I=0.010

Temp.< 80.5 Temp.< 80.5 I=0.000I=0.000

Temp.< 77.5 Temp.< 77.5 I=0.025I=0.025

Temp.< 73.5 Temp.< 73.5 I=0.001I=0.001

Temp.< 71.5 Temp.< 71.5 I=0.001I=0.001

Temp.< 70.5 Temp.< 70.5 I=0.045I=0.045

Temp.< 69.5 Temp.< 69.5 I=0.015I=0.015

Temp.< 68.5 Temp.< 68.5 I=0.000I=0.000

Temp.< 66.5 Temp.< 66.5 I=0.010I=0.010

40

Continuous attribute Continuous attribute (3)(3)

0

1

0 1A1

A2

A2<0.33 ?

good A1<0.91 ?

A1<0.23 ? A2<0.91 ?

A2<0.75 ?A2<0.49 ?

A2<0.65 ?

good

bad good

bad

badbad

good

yes no

Number

A1 A2 Colour

1 0.58 0.75 Red

2 0.78 0.65 Red

3 0.89 0.23 Green

4 0.12 0.98 Red

5 0.17 0.26 Green

6 0.50 0.48 Red

7 0.45 0.16 Green

8 0.80 0.75 Green

… … … …

100 0.75 0.13 Green

41

Attributes with many values Attributes with many values (1)(1)

► Problem: Problem: Not good splits: they fragment the data too Not good splits: they fragment the data too

quickly, leaving insufficient data at the next levelquickly, leaving insufficient data at the next level The reduction of impurity of such test is often high The reduction of impurity of such test is often high

(example: split on the object id).(example: split on the object id).► Two solutions:Two solutions:

Change the splitting criterion to penalize attributes Change the splitting criterion to penalize attributes with many valueswith many values

Consider only binary splits (preferable)Consider only binary splits (preferable)

Letter

a b c y z…

42


► Modified splitting criterion:Modified splitting criterion: Gainratio(Gainratio(LSLS,,AA)= )= HH((LSLS,,AA)/Splitinformation()/Splitinformation(LSLS,,AA)) Splitinformation(Splitinformation(LSLS,,AA)=-)=-aa | |LSLSaa|/||/|LSLS| log(|| log(|LSLSaa|/||/|LSLS|)|)

The split information is high when there are The split information is high when there are many valuesmany values

► Example: outlook in the playtennisExample: outlook in the playtennis HH((LSLS,outlook) = 0.246,outlook) = 0.246 Splitinformation(Splitinformation(LSLS,outlook,outlook) ) = 1.577= 1.577 Gainratio(Gainratio(LSLS,outlook,outlook) ) = 0.246/1.577=0.156 < 0.246= 0.246/1.577=0.156 < 0.246

► Problem: the gain ratio favours unbalanced testsProblem: the gain ratio favours unbalanced tests

43


► Allow binary tests only:Allow binary tests only:

► There are 2There are 2NN-1 possible subsets for -1 possible subsets for NN values values► If If NN is small, determination of the best is small, determination of the best

subsets by enumerationsubsets by enumeration► If If NN is large, heuristics exist (e.g. greedy is large, heuristics exist (e.g. greedy

approach)approach)

Letter

{a,d,o,m,t} All other letters

44

Missing attribute valuesMissing attribute values

► Not all attribute values known for every Not all attribute values known for every objectobjects when learning or when testings when learning or when testing

► Three strategies:Three strategies: Assign most common value in the learning sampleAssign most common value in the learning sample Assign most common value in treeAssign most common value in tree Assign probability to each possible valueAssign probability to each possible value

Day Outlook Temperature Humidity Wind Play Tennis

D15 Sunny Hot ? Strong No

45

Regression trees Regression trees (1)(1)

► Tree Tree for regressionfor regression: : exactly the same modelexactly the same model but but with a number in each leafwith a number in each leaf instead of a class instead of a class

Outlook

Humidity Wind

22.3

45.6

64.4 7.4

SunnyOvercast

Rain

High Normal

Strong

Weak

Temperature

1.2

3.4

<71 >71

46

Regression trees Regression trees (2)(2)

► A regression tree is a piecewise constant A regression tree is a piecewise constant function of the input attributesfunction of the input attributes

X1 t1

X2 t2 X1 t3

X2 t4r1 r2 r3

r4 r5

r2

r1

r3

r5

r4t2

t1 t3 X1

X2

47

Regression tree growingRegression tree growing

► To minimize the square error on the learning To minimize the square error on the learning sample, the prediction at a leaf is the average sample, the prediction at a leaf is the average output of the learning cases reaching that leafoutput of the learning cases reaching that leaf

► Impurity of a sample is defined by the Impurity of a sample is defined by the variance of the output in that sample:variance of the output in that sample:

II((LSLS)=var)=varyy||LSLS{{yy}=E}=Eyy||LSLS{({(yy-E-Eyy||LSLS{{yy})})22}}► The best split is the one that reduces the The best split is the one that reduces the

most variance:most variance:

}{var||

||}{var),( || y

LS

LSyALSI

aLSya

aLSy

48

Regression tree pruningRegression tree pruning

► Exactly the same algorithms apply: pre-Exactly the same algorithms apply: pre-pruning and post-pruning.pruning and post-pruning.

► In post-pruning, the tree that minimizes the In post-pruning, the tree that minimizes the squared error squared error on on VSVS is selected.is selected.

► In practice, pruning is more important in In practice, pruning is more important in regression because full trees are much more regression because full trees are much more complex (often all objects have a different complex (often all objects have a different output values and hence the full tree has as output values and hence the full tree has as many leaves as there are objects in the many leaves as there are objects in the learning sample)learning sample)

49

OutlineOutline

► Supervised learningSupervised learning► Tree representationTree representation► Tree learningTree learning► ExtensionsExtensions► Regression treesRegression trees► By-productsBy-products

InterpretabilityInterpretability Variable selectionVariable selection Variable importanceVariable importance

50

Interpretability Interpretability (1)(1)

► ObviousObvious

► Compare with a neural networks:Compare with a neural networks:

Outlook

Humidity Wind

no yes

yes

no yes

SunnyOvercast

Rain

High Normal

Strong

Weak

Outlook

HumidityWind

Temperature

Play

Don’t play

51

Interpretability Interpretability (2)(2)

► A tree may be converted into a set of rulesA tree may be converted into a set of rules If (outlook=sunny) and (humidity=high) If (outlook=sunny) and (humidity=high) then PlayTennis=No then PlayTennis=No If (outlook=sunny) and (humidity=normal) then PlayTennis=YesIf (outlook=sunny) and (humidity=normal) then PlayTennis=Yes If (outlook=overcast) If (outlook=overcast) then PlayTennis=Yes then PlayTennis=Yes If (outlook=rain) and (wind=strong) If (outlook=rain) and (wind=strong) then PlayTennis=No then PlayTennis=No If (outlook=rain) and (wind=weak) If (outlook=rain) and (wind=weak) then PlayTennis=Yes then PlayTennis=Yes

Outlook

Humidity Wind

no yes

yes

no yes

SunnyOvercast

Rain

High Normal

Strong

Weak

52

Attribute selectionAttribute selection

► If some attributes are not useful for classification, If some attributes are not useful for classification, they will not be selected in the (pruned) treethey will not be selected in the (pruned) tree

► Of practical importanceOf practical importance, if measuring the value of , if measuring the value of an attribute is costly (e.g.an attribute is costly (e.g. medical diagnosis medical diagnosis))

► Decision trees are often used as a pre-processing Decision trees are often used as a pre-processing for other learning algorithms that suffer for other learning algorithms that suffer more more when there are irrelevant variableswhen there are irrelevant variables

53

Variable importanceVariable importance

► In many applications, all variables do not In many applications, all variables do not contribute equally in predicting the output.contribute equally in predicting the output.

► We can We can evaluateevaluate variable importance by variable importance by computing computing the total reduction of impurity the total reduction of impurity broughtbrought by each variable by each variable:: Imp(Imp(AA)=)=nodes where nodes where AA is tested is tested ||LSLSnodenode| | I(I(LSLSnodenode,,AA))

Outlook

Humidity

Wind

Temperature

54

When are decision trees When are decision trees useful ?useful ?

► AdvantagesAdvantages Very fast: can handle very large datasets with Very fast: can handle very large datasets with

many attributes (Complexity many attributes (Complexity OO((nn..NN log log N N )))) Flexible: several attribute types, classification and Flexible: several attribute types, classification and

regressionregression problems problems, missing values…, missing values… Interpretability: provide rules and attribute Interpretability: provide rules and attribute

importanceimportance

► DisadvantagesDisadvantages Instability of the trees (high variance)Instability of the trees (high variance) Not always competitive with other algorithms in Not always competitive with other algorithms in

terms of accuracyterms of accuracy

55

Further extensions and Further extensions and researchresearch

►Cost and un-balanced learning sampleCost and un-balanced learning sample►Oblique trees (test like Oblique trees (test like ii AAii < < aathth))

►Using predictive models in leaves (e.g. Using predictive models in leaves (e.g. linear regression)linear regression)

► Induction graphsInduction graphs►Fuzzy decision trees (from a crisp Fuzzy decision trees (from a crisp

partition to a fuzzy partition of the partition to a fuzzy partition of the learning sample)learning sample)

56

DemoDemo

► IllustrationIllustration with pepito on two with pepito on two datasets:datasets: titanictitanic

► http://www.cs.toronto.edu/~delve/data/titanic/desc.htmlhttp://www.cs.toronto.edu/~delve/data/titanic/desc.html

splice junctionsplice junction► http://www.cs.toronto.edu/~delve/data/splice/desc.htmlhttp://www.cs.toronto.edu/~delve/data/splice/desc.html

57

ReferencesReferences

► About tree algorithms:About tree algorithms: Classification and regression treesClassification and regression trees, L.Breiman et , L.Breiman et

al., Wadsworth, 1984al., Wadsworth, 1984 C4.5: programs for machine learningC4.5: programs for machine learning, J.R.Quinlan, , J.R.Quinlan,

Morgan Kaufmann, 1993Morgan Kaufmann, 1993 Graphes d’inductionGraphes d’induction, D.Zighed and , D.Zighed and

R.Rakotomalala, Hermes, 2000R.Rakotomalala, Hermes, 2000► More general textbooks:More general textbooks:

Artificial intelligence, a modern approachArtificial intelligence, a modern approach, , S.Russel and P.Norvig, Prentice Hall, 2003S.Russel and P.Norvig, Prentice Hall, 2003

The elements of statistical learningThe elements of statistical learning, T.Hastie et , T.Hastie et al., Springer, 2001al., Springer, 2001

Pattern classificationPattern classification, R.O.Duda et al., John Wiley , R.O.Duda et al., John Wiley and sons, 200and sons, 200

58

SoftwaresSoftwares

► In R:In R: Packages tree and rpartPackages tree and rpart

► C4.5:C4.5: http://www.cse.unwe.edu.au/~quinlanhttp://www.cse.unwe.edu.au/~quinlan

► Java applet:Java applet: http://www.montefiore.ulg.ac.be/~geurts/http://www.montefiore.ulg.ac.be/~geurts/

► Pepito:Pepito: http://www.pepite.behttp://www.pepite.be

► Weka Weka http://www.cs.waikato.ac.nz/ml/wekahttp://www.cs.waikato.ac.nz/ml/weka

Date post:	21-Dec-2015
Category:	Documents
View:	228 times
Download:	1 times

1 Classification and regression trees Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University...

Documents