Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 228 times |
Download: | 1 times |
11
Classification and Classification and regression treesregression trees
Pierre GeurtsPierre GeurtsStochastic methodsStochastic methods
(Prof. L.Wehenkel)(Prof. L.Wehenkel)
University of LiègeUniversity of Liège
2
OutlineOutline
►Supervised learningSupervised learning►Decision tree representationDecision tree representation►Decision tree learningDecision tree learning►ExtensionsExtensions►Regression treesRegression trees►By-productsBy-products
3
DatabaseDatabase► A collection of objects (rows) described by A collection of objects (rows) described by
attributes (columns)attributes (columns)
checkingaccount duration purpose amount savings yearsemployed age good or bad 0<=…<200 DM 48 radiotv 5951 ...<100 DM 1<...<4 22 bad...<0 DM 6 radiotv 1169 unknown ...>7 67 goodno 12 education 2096 ...<100 DM 4<...<7 49 good...<0 DM 42 furniture 7882 ...<100 DM 4<...<7 45 good...<0 DM 24 newcar 4870 ...<100 DM 1<...<4 53 badno 36 education 9055 unknown 1<...<4 35 goodno 24 furniture 2835 500<...<1000 DM...>7 53 good0<=...<200 DM 36 usedcar 6948 ...<100 DM 1<...<4 35 goodno 12 radiotv 3059 ...>1000 DM 4<...<7 61 good0<=...<200 DM 30 newcar 5234 ...<100 DM unemployed 28 bad0<=...<200 DM 12 newcar 1295 ...<100 DM ...<1 25 bad...<0 DM 48 business 4308 ...<100 DM ...<1 24 bad0<=...<200 DM 12 radiotv 1567 ...<100 DM 1<...<4 22 good
4
Supervised learningSupervised learning
► Goal: from the database, find a function f of the Goal: from the database, find a function f of the inputs that approximate at best the output inputs that approximate at best the output
► Discrete output Discrete output classification problem classification problem► Continuous output Continuous output regression problem regression problem
AA11 AA22 …… AAnn YY
2.32.3 onon 3.43.4 C1C1
1.21.2 offoff 0.30.3 C2C2
...... ...... ...... ......
Database=learning sample
Automatic learning
inputs output
Ŷ = f(A1,A2,…,An)
model
5
Examples of application Examples of application (1)(1)
►Predict whether a bank client will be a Predict whether a bank client will be a good debtor or notgood debtor or not
► Image classification:Image classification: Handwritten characters recognition:Handwritten characters recognition:
Face recognitionFace recognition
3 5
6
Examples of application Examples of application (2)(2)
►Classification of cancer types from Classification of cancer types from gene expression profiles (Golub et al gene expression profiles (Golub et al (1999))(1999))
N° patient
Gene 1 Gene 2 … Gene 7129
Leucimia
1 -134 28 … 123 AML
2 -123 0 … 17 AML
3 56 -123 … -23 ALL
… … … … … …
72 89 -123 … 12 ALL
7
Learning algorithmLearning algorithm
► It receives a learning sample and returns a function It receives a learning sample and returns a function hh► A learning algorithm is defined by:A learning algorithm is defined by:
A hypothesis space A hypothesis space HH (=a family of candidate models) (=a family of candidate models) A quality measure for a modelA quality measure for a model An optimisation strategyAn optimisation strategy
A model (h H) obtained by automatic learning
0
1
0 1A1
A2
8
Decision (classification) treesDecision (classification) trees
►A learning algorithm that can handle:A learning algorithm that can handle: Classification problems (binary or multi-Classification problems (binary or multi-
valued)valued) Attributes may be discrete (binary or multi-Attributes may be discrete (binary or multi-
valued) or continuous.valued) or continuous.
►Classification trees were invented twice:Classification trees were invented twice: By statisticians: CART (Breiman et al.)By statisticians: CART (Breiman et al.) By the AI community: ID3, C4.5 (Quinlan et By the AI community: ID3, C4.5 (Quinlan et
al.)al.)
9
Hypothesis spaceHypothesis space
►A decision tree is a tree where:A decision tree is a tree where: Each Each interior nodeinterior node tests an attribute tests an attribute Each Each branch branch corresponds to an attribute corresponds to an attribute
valuevalue Each Each leafleaf node is labelled with a class node is labelled with a class
A1
A2 A3
c1 c2
c1
c2 c1
a11a12
a13
a21 a22 a31 a32
10
A simple database: A simple database: playtennisplaytennis
Day Outlook Temperature
Humidity
Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild Normal Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool High Strong Yes
D8 Sunny Mild Normal Weak No
D9 Sunny Hot Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Cool Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
11
A decision tree for playtennisA decision tree for playtennis
Outlook
Humidity Wind
no yes
yes
no yes
SunnyOvercast
Rain
High Normal
Strong
Weak
12
Tree learningTree learning► Tree learning=choose the tree structure and Tree learning=choose the tree structure and
determine the predictions at leaf nodesdetermine the predictions at leaf nodes► Predictions: to minimize the misclassification Predictions: to minimize the misclassification
error, associate the majority class among the error, associate the majority class among the learning sample cases reaching this nodelearning sample cases reaching this node
Outlook
Humidity Wind
no yes
yes
no yes
SunnyOvercast
Rain
High Normal
Strong
Weak
25 yes, 40 no
15 yes, 10 no
14 yes, 2 no
13
How to generate trees ? How to generate trees ? (1)(1)
► What properties do we want the What properties do we want the decision tree to have ?decision tree to have ?
1.1. It should be consistent with the It should be consistent with the learning sample (for the moment)learning sample (for the moment)
• Trivial algorithm: construct a decision Trivial algorithm: construct a decision tree that has one path to a leaf for each tree that has one path to a leaf for each exampleexample
• Problem: it does not capture useful Problem: it does not capture useful information from the databaseinformation from the database
14
How to generate trees ? How to generate trees ? (2)(2)
► What properties do we want the What properties do we want the decision tree to have ?decision tree to have ?
2.2. It should be at the same time as It should be at the same time as simple as possiblesimple as possible
Trivial algorithm: generate all trees and Trivial algorithm: generate all trees and pick the simplest one that is consistent pick the simplest one that is consistent with the learning sample.with the learning sample.
Problem: intractable, there are too many Problem: intractable, there are too many treestrees
15
Top-down induction of DTs Top-down induction of DTs (1)(1)
► Choose « best » attributeChoose « best » attribute► Split the learning sampleSplit the learning sample► Proceed recursively until each object is Proceed recursively until each object is
correctly classifiedcorrectly classified
Outlook
SunnyOvercast
Rain
Day Outlook
Temp. Humidity
Wind Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D9 Sunny Hot Normal Weak Yes
D11
Sunny Cool Normal Strong Yes
Day Outlook
Temp. Humidity
Wind Play
D4 Rain Mild Normal Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong
No
D10
Rain Mild Normal Strong
Yes
D14
Rain Mild High Strong
No
Day Outlook Temp. Humidity
Wind Play
D3 Overcast
Hot High Weak Yes
D7 Overcast
Cool High Strong
Yes
D12
Overcast
Mild High Strong
Yes
D13
Overcast
Hot Normal Weak Yes
16
Top-down induction of DTs Top-down induction of DTs (2)(2)
Procedure Procedure learn_dtlearn_dt(learning sample, (learning sample, LSLS))► If all objects from If all objects from LSLS have the same have the same
classclass Create a leaf with that classCreate a leaf with that class
►ElseElse Find the « best » splitting attribute Find the « best » splitting attribute AA Create a test node for this attributeCreate a test node for this attribute For each value For each value aa of of AA
►BuildBuild LSLSaa= {= {oo LSLS | | AA((oo)) is is aa}}►Use Use Learn_dtLearn_dt((LSLSaa) to grow a subtree from ) to grow a subtree from LSLSaa..
17
Properties of TDIDTProperties of TDIDT
►Hill-climbing algorithm in the space of Hill-climbing algorithm in the space of possible decision trees.possible decision trees. It adds a subIt adds a sub--tree to the current tree and tree to the current tree and
continues its searchcontinues its search It does not backtrackIt does not backtrack
►Sub-optimal but very fastSub-optimal but very fast►Highly dependent upon the criterion Highly dependent upon the criterion
for selecting attributes to testfor selecting attributes to test
18
Which attribute is best ?Which attribute is best ?
►We want a small treeWe want a small tree We should maximize the class separation We should maximize the class separation
at each step, i.e. make successors as pure at each step, i.e. make successors as pure as possibleas possible
it will favour short paths in the treesit will favour short paths in the trees
A1=?
T F
[21+,5-]
[8+,30-]
[29+,35-] A2=?
T F
[18+,33-]
[11+,2-]
[29+,35-]
19
ImpurityImpurity
► Let Let LSLS be a sample of objects, be a sample of objects, ppjj the proportions the proportions of objects of class of objects of class jj ((jj=1,…,=1,…,JJ)) in in LSLS,,
► Define an Define an impurityimpurity measure measure II((LSLS)) that satisfies: that satisfies: II((LSLS)) is minimum only when is minimum only when ppii=1=1 and and ppjj=0=0 for for jjii
(all objects are of the same class)(all objects are of the same class) II((LSLS)) is maximum only when is maximum only when ppj j =1/=1/JJ
(there is exactly the same number of objects of all (there is exactly the same number of objects of all classes)classes)
II((LSLS)) is symmetric with respect to is symmetric with respect to pp11,…,p,…,pJJ
20
Reduction of impurityReduction of impurity
► The “best” split is the split that maximizes the The “best” split is the split that maximizes the expected reduction of impurityexpected reduction of impurity
wherewhere LSLSaa is the subset of objects from is the subset of objects from LSLS such such that that AA==aa..
► II is called a score measure or a splitting criterion is called a score measure or a splitting criterion► There are many other ways to define a splitting There are many other ways to define a splitting
criterion that do not rely on an impurity measurecriterion that do not rely on an impurity measure
a
aa LSI
LS
LSLSIALSI )(
||
||)(),(
21
Example of impurity measure Example of impurity measure (1)(1)
► Shannon’s entropy:Shannon’s entropy: HH((LSLS)=-)=-jj p pjj log log ppjj
If two classes, If two classes, pp11=1-=1-pp22
► Entropy measures impurity, uncertainty, Entropy measures impurity, uncertainty, surprise… surprise…
► The reduction of entropy is called the The reduction of entropy is called the information information gaingain
0
0,5
1
0 0,5 1
p1
I(p
1)
22
Example of impurity measure Example of impurity measure (2)(2)
►Which attribute is best ? Which attribute is best ? A1=?
T F
[21+,5-]
[8+,30-]
[29+,35-] A2=?
T F
[18+,33-]
[11+,2-]
[29+,35-]
I=0.99
I=0.71 I=0.75 I=0.94 I=0.62
I=0.99
I(LS,A1) = 0.99 - (26/64) 0.71 – (38/64) 0.75
= 0.25
I(LS,A2) = 0.99 - (51/64) 0.94 – (13/64) 0.62
= 0.12
23
Other impurity measuresOther impurity measures►Gini index:Gini index:
II((LSLS)=)=jj ppjj (1- (1-ppjj))
►Misclassification error rate:Misclassification error rate: II((LSLS)=1-max)=1-maxjj p pjj
►two-class case:two-class case:
0
0,5
1
0 0,5 1
p1
I(p1)
24
Playtennis problemPlaytennis problem
► Which attribute should be tested here ?Which attribute should be tested here ? II((LSLS,Temp.) = 0.970 - (3/5) 0.918 - (1/5) 0.0 - (1/5) 0.0=0.419,Temp.) = 0.970 - (3/5) 0.918 - (1/5) 0.0 - (1/5) 0.0=0.419 II((LSLS,Hum.) = 0.970 - (3/5) 0.0 - (2/5) 0.0 = 0.970,Hum.) = 0.970 - (3/5) 0.0 - (2/5) 0.0 = 0.970 II((LSLS,Wind) = 0.970 - (2/5) 1.0 - (3/5) 0.918 = 0.019,Wind) = 0.970 - (2/5) 1.0 - (3/5) 0.918 = 0.019
► the best attribute is Humiditythe best attribute is Humidity
Outlook
SunnyOvercast
Rain
Day Outlook
Temp. Humidity
Wind Play
D4 Rain Mild Normal Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong
No
D10
Rain Mild Normal Strong
Yes
D14
Rain Mild High Strong
No
Day Outlook Temp. Humidity
Wind Play
D3 Overcast
Hot High Weak Yes
D7 Overcast
Cool High Strong
Yes
D12
Overcast
Mild High Strong
Yes
D13
Overcast
Hot Normal Weak Yes
Day Outlook
Temp. Humidity
Wind Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D9 Sunny Hot Normal Weak Yes
D11
Sunny Cool Normal Strong Yes
25
Overfitting Overfitting (1)(1)
►Our trees are perfectly consistent with Our trees are perfectly consistent with the learning sample the learning sample
►But, often, we would like them to be But, often, we would like them to be good at predicting classes of unseen good at predicting classes of unseen data from the same distribution data from the same distribution (generalization).(generalization).
►A tree T overfits the learning sample iff A tree T overfits the learning sample iff T’ such that: T’ such that: ErrorErrorLSLS(T) < Error(T) < ErrorLSLS(T’)(T’) ErrorErrorunseenunseen(T) > Error(T) > Errorunseenunseen(T’)(T’)
26
Overfitting Overfitting (2)(2)
► In practice, ErrorIn practice, Errorunseenunseen(T) is estimated from a (T) is estimated from a separate test sampleseparate test sample
Error
OverfittingUnderfitting
Complexity
ErrorLS
Errorunseen
27
Reasons for overfitting Reasons for overfitting (1)(1)
► Data is noisy or attributes don’t completely Data is noisy or attributes don’t completely predict the outcomepredict the outcome
Day Outlook Temperature Humidity Wind Play Tennis
D15 Sunny Mild Normal Strong No
yes
Outlook
Humidity Wind
no
yes
no yes
SunnyOvercast
Rain
High Normal
Strong
Weak
Temperature
no yes
Mild Cool,Hot
Add a test Add a test herehere
yes
28
Reasons for overfitting Reasons for overfitting (2)(2)
► Data is incomplete (not all caseData is incomplete (not all casess covered) covered)
► We do not have enough data in some part of the We do not have enough data in some part of the learning sample to make a good decision learning sample to make a good decision
area with probablywrong predictions
+
++
++ +
+
-
-
- -
-
---
---
-
-
- +
---
-
-
29
How can we avoid How can we avoid overfitting ?overfitting ?
►Pre-pruningPre-pruning: stop growing the tree : stop growing the tree earlier, before it reaches the point where earlier, before it reaches the point where it perfectly classifies the learning sampleit perfectly classifies the learning sample
►Post-pruningPost-pruning: allow the tree to overfit : allow the tree to overfit and then post-prune the treeand then post-prune the tree
►Ensemble methods (this afternoon)Ensemble methods (this afternoon)
30
Pre-pruningPre-pruning
►Stop splitting a node ifStop splitting a node if The number of objects is too smallThe number of objects is too small The impurity is low enoughThe impurity is low enough The best test is not statistically significant The best test is not statistically significant
(according to some statistical test)(according to some statistical test)►Problem: Problem:
the optimum value of the parameter (the optimum value of the parameter (nn, , IIth th , , significance level) is problem dependent.significance level) is problem dependent.
We may miss the optimumWe may miss the optimum
31
Post-pruning Post-pruning (1)(1)
► Split the learning sample Split the learning sample LSLS into two sets: into two sets: a growing sample a growing sample GSGS to build the tree to build the tree A validation sample A validation sample VSVS to evaluate its to evaluate its
generalization errorgeneralization error► Build a complete tree from Build a complete tree from GSGS► Compute a sequence of trees {TCompute a sequence of trees {T11,T,T22,…} where,…} where
TT11 is the complete tree is the complete tree TTii is obtained by removing some is obtained by removing some test nodetest nodes from Ts from Ti-i-
11
► Select the tree TSelect the tree Tii* from the sequence that * from the sequence that minimizes the error on minimizes the error on VSVS
32
Post-pruning Post-pruning (2)(2)
Error
OverfittingUnderfitting
Optimal tree Complexity
Error on GSTree growing
Error on VSTree puning
33
Post-pruning Post-pruning (3)(3)
►How to build the sequence of trees ?How to build the sequence of trees ? Reduced error pruning:Reduced error pruning:
►At each step, remove the node that most At each step, remove the node that most decreases the error on decreases the error on VSVS
Cost-complexity pruning:Cost-complexity pruning:►Define a cost-complexity criterion:Define a cost-complexity criterion:
ErrorErrorGSGS(T)+(T)+.Complexity(T).Complexity(T)
►Build the sequence of trees that minimize this Build the sequence of trees that minimize this criterion for increasing criterion for increasing
34
Post-pruning Post-pruning (4)(4)
Outlook
Humidity Wind
no
yes
no yes
Sunny Overcast
Rain
High Normal Strong
Weak
Temp.
no yesMild Cool,Ho
t
Outlook
Humidity Wind
no
yes
no yes
Sunny Overcast
Rain
High Normal Strong
Weak
yes
Outlook
Wind
noyes
no yes
Sunny Overcast
Rain
Strong
Weak
Outlook
noyes
Sunny Overcast
Rain
yes
yes
TT22
TT11 TT33
TT44
TT55
ErrorErrorGSGS=0%, =0%, ErrorErrorVSVS=10%=10%
ErrorErrorGSGS=6%, Error=6%, ErrorVSVS=8%=8%
ErrorErrorGSGS=13%, =13%, ErrorErrorVSVS=15%=15%
ErrorErrorGSGS=27%, =27%, ErrorErrorVSVS=25%=25%
ErrorErrorGSGS=33%, =33%, ErrorErrorVSVS=35%=35%
35
Post-pruning Post-pruning (5)(5)
► Problem: require to dedicate one part of the Problem: require to dedicate one part of the learning sample as a validation set learning sample as a validation set may be may be a problem in the case of a small databasea problem in the case of a small database
► Solution: Solution: NN-fold cross-validation-fold cross-validation Split the training set into Split the training set into NN parts (often 10) parts (often 10) Generate Generate NN trees, each leaving one part among trees, each leaving one part among NN Make a prediction for each learning object with the Make a prediction for each learning object with the
(only) tree built without this case.(only) tree built without this case. Estimate the error of this predictionEstimate the error of this prediction
► May be combined with pruningMay be combined with pruning
36
How to use decision trees ?How to use decision trees ?
► Large datasets (ideal case):Large datasets (ideal case): Split the dataset into three parts: Split the dataset into three parts: GSGS, , VSVS, , TSTS Grow a tree from Grow a tree from GSGS Post-prune it from Post-prune it from VSVS Test it on Test it on TSTS
► Small datasets (often)Small datasets (often) Grow a tree from the whole databaseGrow a tree from the whole database Pre-prune with default parameters (risky), post-Pre-prune with default parameters (risky), post-
prune it by 10-fold cross-validation (costly)prune it by 10-fold cross-validation (costly) Estimate its accuracy by 10-fold cross-validationEstimate its accuracy by 10-fold cross-validation
37
OutlineOutline
► Supervised learningSupervised learning► Tree representationTree representation► Tree learningTree learning► ExtensionsExtensions
Continuous attributesContinuous attributes Attributes with many valuesAttributes with many values Missing valuesMissing values
► Regression treesRegression trees► By-productsBy-products
38
Continuous attributes Continuous attributes (1)(1)
► Example: temperature as a number instead of Example: temperature as a number instead of a discrete valuea discrete value
► Two solutions:Two solutions: Pre-discretize: Cold if Temperature<70, Mild Pre-discretize: Cold if Temperature<70, Mild
between 70 and 75, Hot if Temperature>75between 70 and 75, Hot if Temperature>75 Discretize during tree growing:Discretize during tree growing:
► How to find the cut-point ?How to find the cut-point ?
Temperature
no yes
65.4 >65.4
39
Continuous attributes Continuous attributes (2)(2)
Temp. Play
80 No
85 No
83 Yes
75 Yes
68 Yes
65 No
64 Yes
72 No
75 Yes
70 Yes
69 Yes
72 Yes
81 Yes
71 No No85
Yes81
Yes83
Yes75
Yes75
No80
Yes70
No71
No72
Yes72
Yes69
Yes68
No65
Yes64
PlayTemp.
SortSort
Temp.< 64.5 Temp.< 64.5 I=0.048I=0.048
Temp.< 84 Temp.< 84 I=0.113I=0.113
Temp.< 82 Temp.< 82 I=0.010I=0.010
Temp.< 80.5 Temp.< 80.5 I=0.000I=0.000
Temp.< 77.5 Temp.< 77.5 I=0.025I=0.025
Temp.< 73.5 Temp.< 73.5 I=0.001I=0.001
Temp.< 71.5 Temp.< 71.5 I=0.001I=0.001
Temp.< 70.5 Temp.< 70.5 I=0.045I=0.045
Temp.< 69.5 Temp.< 69.5 I=0.015I=0.015
Temp.< 68.5 Temp.< 68.5 I=0.000I=0.000
Temp.< 66.5 Temp.< 66.5 I=0.010I=0.010
40
Continuous attribute Continuous attribute (3)(3)
0
1
0 1A1
A2
A2<0.33 ?
good A1<0.91 ?
A1<0.23 ? A2<0.91 ?
A2<0.75 ?A2<0.49 ?
A2<0.65 ?
good
bad good
bad
badbad
good
yes no
Number
A1 A2 Colour
1 0.58 0.75 Red
2 0.78 0.65 Red
3 0.89 0.23 Green
4 0.12 0.98 Red
5 0.17 0.26 Green
6 0.50 0.48 Red
7 0.45 0.16 Green
8 0.80 0.75 Green
… … … …
100 0.75 0.13 Green
41
Attributes with many values Attributes with many values (1)(1)
► Problem: Problem: Not good splits: they fragment the data too Not good splits: they fragment the data too
quickly, leaving insufficient data at the next levelquickly, leaving insufficient data at the next level The reduction of impurity of such test is often high The reduction of impurity of such test is often high
(example: split on the object id).(example: split on the object id).► Two solutions:Two solutions:
Change the splitting criterion to penalize attributes Change the splitting criterion to penalize attributes with many valueswith many values
Consider only binary splits (preferable)Consider only binary splits (preferable)
Letter
a b c y z…
42
Attributes with many values Attributes with many values (2)(2)
► Modified splitting criterion:Modified splitting criterion: Gainratio(Gainratio(LSLS,,AA)= )= HH((LSLS,,AA)/Splitinformation()/Splitinformation(LSLS,,AA)) Splitinformation(Splitinformation(LSLS,,AA)=-)=-aa | |LSLSaa|/||/|LSLS| log(|| log(|LSLSaa|/||/|LSLS|)|)
The split information is high when there are The split information is high when there are many valuesmany values
► Example: outlook in the playtennisExample: outlook in the playtennis HH((LSLS,outlook) = 0.246,outlook) = 0.246 Splitinformation(Splitinformation(LSLS,outlook,outlook) ) = 1.577= 1.577 Gainratio(Gainratio(LSLS,outlook,outlook) ) = 0.246/1.577=0.156 < 0.246= 0.246/1.577=0.156 < 0.246
► Problem: the gain ratio favours unbalanced testsProblem: the gain ratio favours unbalanced tests
43
Attributes with many values Attributes with many values (3)(3)
► Allow binary tests only:Allow binary tests only:
► There are 2There are 2NN-1 possible subsets for -1 possible subsets for NN values values► If If NN is small, determination of the best is small, determination of the best
subsets by enumerationsubsets by enumeration► If If NN is large, heuristics exist (e.g. greedy is large, heuristics exist (e.g. greedy
approach)approach)
Letter
{a,d,o,m,t} All other letters
44
Missing attribute valuesMissing attribute values
► Not all attribute values known for every Not all attribute values known for every objectobjects when learning or when testings when learning or when testing
► Three strategies:Three strategies: Assign most common value in the learning sampleAssign most common value in the learning sample Assign most common value in treeAssign most common value in tree Assign probability to each possible valueAssign probability to each possible value
Day Outlook Temperature Humidity Wind Play Tennis
D15 Sunny Hot ? Strong No
45
Regression trees Regression trees (1)(1)
► Tree Tree for regressionfor regression: : exactly the same modelexactly the same model but but with a number in each leafwith a number in each leaf instead of a class instead of a class
Outlook
Humidity Wind
22.3
45.6
64.4 7.4
SunnyOvercast
Rain
High Normal
Strong
Weak
Temperature
1.2
3.4
<71 >71
46
Regression trees Regression trees (2)(2)
► A regression tree is a piecewise constant A regression tree is a piecewise constant function of the input attributesfunction of the input attributes
X1 t1
X2 t2 X1 t3
X2 t4r1 r2 r3
r4 r5
r2
r1
r3
r5
r4t2
t1 t3 X1
X2
47
Regression tree growingRegression tree growing
► To minimize the square error on the learning To minimize the square error on the learning sample, the prediction at a leaf is the average sample, the prediction at a leaf is the average output of the learning cases reaching that leafoutput of the learning cases reaching that leaf
► Impurity of a sample is defined by the Impurity of a sample is defined by the variance of the output in that sample:variance of the output in that sample:
II((LSLS)=var)=varyy||LSLS{{yy}=E}=Eyy||LSLS{({(yy-E-Eyy||LSLS{{yy})})22}}► The best split is the one that reduces the The best split is the one that reduces the
most variance:most variance:
}{var||
||}{var),( || y
LS
LSyALSI
aLSya
aLSy
48
Regression tree pruningRegression tree pruning
► Exactly the same algorithms apply: pre-Exactly the same algorithms apply: pre-pruning and post-pruning.pruning and post-pruning.
► In post-pruning, the tree that minimizes the In post-pruning, the tree that minimizes the squared error squared error on on VSVS is selected.is selected.
► In practice, pruning is more important in In practice, pruning is more important in regression because full trees are much more regression because full trees are much more complex (often all objects have a different complex (often all objects have a different output values and hence the full tree has as output values and hence the full tree has as many leaves as there are objects in the many leaves as there are objects in the learning sample)learning sample)
49
OutlineOutline
► Supervised learningSupervised learning► Tree representationTree representation► Tree learningTree learning► ExtensionsExtensions► Regression treesRegression trees► By-productsBy-products
InterpretabilityInterpretability Variable selectionVariable selection Variable importanceVariable importance
50
Interpretability Interpretability (1)(1)
► ObviousObvious
► Compare with a neural networks:Compare with a neural networks:
Outlook
Humidity Wind
no yes
yes
no yes
SunnyOvercast
Rain
High Normal
Strong
Weak
Outlook
HumidityWind
Temperature
Play
Don’t play
51
Interpretability Interpretability (2)(2)
► A tree may be converted into a set of rulesA tree may be converted into a set of rules If (outlook=sunny) and (humidity=high) If (outlook=sunny) and (humidity=high) then PlayTennis=No then PlayTennis=No If (outlook=sunny) and (humidity=normal) then PlayTennis=YesIf (outlook=sunny) and (humidity=normal) then PlayTennis=Yes If (outlook=overcast) If (outlook=overcast) then PlayTennis=Yes then PlayTennis=Yes If (outlook=rain) and (wind=strong) If (outlook=rain) and (wind=strong) then PlayTennis=No then PlayTennis=No If (outlook=rain) and (wind=weak) If (outlook=rain) and (wind=weak) then PlayTennis=Yes then PlayTennis=Yes
Outlook
Humidity Wind
no yes
yes
no yes
SunnyOvercast
Rain
High Normal
Strong
Weak
52
Attribute selectionAttribute selection
► If some attributes are not useful for classification, If some attributes are not useful for classification, they will not be selected in the (pruned) treethey will not be selected in the (pruned) tree
► Of practical importanceOf practical importance, if measuring the value of , if measuring the value of an attribute is costly (e.g.an attribute is costly (e.g. medical diagnosis medical diagnosis))
► Decision trees are often used as a pre-processing Decision trees are often used as a pre-processing for other learning algorithms that suffer for other learning algorithms that suffer more more when there are irrelevant variableswhen there are irrelevant variables
53
Variable importanceVariable importance
► In many applications, all variables do not In many applications, all variables do not contribute equally in predicting the output.contribute equally in predicting the output.
► We can We can evaluateevaluate variable importance by variable importance by computing computing the total reduction of impurity the total reduction of impurity broughtbrought by each variable by each variable:: Imp(Imp(AA)=)=nodes where nodes where AA is tested is tested ||LSLSnodenode| | I(I(LSLSnodenode,,AA))
Outlook
Humidity
Wind
Temperature
54
When are decision trees When are decision trees useful ?useful ?
► AdvantagesAdvantages Very fast: can handle very large datasets with Very fast: can handle very large datasets with
many attributes (Complexity many attributes (Complexity OO((nn..NN log log N N )))) Flexible: several attribute types, classification and Flexible: several attribute types, classification and
regressionregression problems problems, missing values…, missing values… Interpretability: provide rules and attribute Interpretability: provide rules and attribute
importanceimportance
► DisadvantagesDisadvantages Instability of the trees (high variance)Instability of the trees (high variance) Not always competitive with other algorithms in Not always competitive with other algorithms in
terms of accuracyterms of accuracy
55
Further extensions and Further extensions and researchresearch
►Cost and un-balanced learning sampleCost and un-balanced learning sample►Oblique trees (test like Oblique trees (test like ii AAii < < aathth))
►Using predictive models in leaves (e.g. Using predictive models in leaves (e.g. linear regression)linear regression)
► Induction graphsInduction graphs►Fuzzy decision trees (from a crisp Fuzzy decision trees (from a crisp
partition to a fuzzy partition of the partition to a fuzzy partition of the learning sample)learning sample)
56
DemoDemo
► IllustrationIllustration with pepito on two with pepito on two datasets:datasets: titanictitanic
► http://www.cs.toronto.edu/~delve/data/titanic/desc.htmlhttp://www.cs.toronto.edu/~delve/data/titanic/desc.html
splice junctionsplice junction► http://www.cs.toronto.edu/~delve/data/splice/desc.htmlhttp://www.cs.toronto.edu/~delve/data/splice/desc.html
57
ReferencesReferences
► About tree algorithms:About tree algorithms: Classification and regression treesClassification and regression trees, L.Breiman et , L.Breiman et
al., Wadsworth, 1984al., Wadsworth, 1984 C4.5: programs for machine learningC4.5: programs for machine learning, J.R.Quinlan, , J.R.Quinlan,
Morgan Kaufmann, 1993Morgan Kaufmann, 1993 Graphes d’inductionGraphes d’induction, D.Zighed and , D.Zighed and
R.Rakotomalala, Hermes, 2000R.Rakotomalala, Hermes, 2000► More general textbooks:More general textbooks:
Artificial intelligence, a modern approachArtificial intelligence, a modern approach, , S.Russel and P.Norvig, Prentice Hall, 2003S.Russel and P.Norvig, Prentice Hall, 2003
The elements of statistical learningThe elements of statistical learning, T.Hastie et , T.Hastie et al., Springer, 2001al., Springer, 2001
Pattern classificationPattern classification, R.O.Duda et al., John Wiley , R.O.Duda et al., John Wiley and sons, 200and sons, 200
58
SoftwaresSoftwares
► In R:In R: Packages tree and rpartPackages tree and rpart
► C4.5:C4.5: http://www.cse.unwe.edu.au/~quinlanhttp://www.cse.unwe.edu.au/~quinlan
► Java applet:Java applet: http://www.montefiore.ulg.ac.be/~geurts/http://www.montefiore.ulg.ac.be/~geurts/
► Pepito:Pepito: http://www.pepite.behttp://www.pepite.be
► Weka Weka http://www.cs.waikato.ac.nz/ml/wekahttp://www.cs.waikato.ac.nz/ml/weka