of 51
8/7/2019 7-Decision Trees Learning
1/51
Decision Trees Learning
8/7/2019 7-Decision Trees Learning
2/51
2
Outline
Decision tree representation
ID3 learning algorithm
Entropy, information gain
Overfitting
8/7/2019 7-Decision Trees Learning
3/51
3
Decision Tree for PlayTennis
Attributes and their values:
Outlook: Sunny, Overcast, Rain
Humidity: High, Normal
Wind: Strong, Weak
Temperature: Hot, Mild, Cool
Target concept - Play Tennis: Yes, No
8/7/2019 7-Decision Trees Learning
4/51
4
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
8/7/2019 7-Decision Trees Learning
5/51
5
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
No Yes
Each internal node tests an attribute
Each branch corresponds to anattribute value node
Each leaf node assigns a classification
8/7/2019 7-Decision Trees Learning
6/51
6
No
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
Outlook Temperature Humidity Wind PlayTennisSunny Hot High Weak ?
8/7/2019 7-Decision Trees Learning
7/51
7
Decision Tree for Conjunction
Outlook
Sunny Overcast Rain
Wind
Strong Weak
No Yes
No
Outlook=Sunny Wind=Weak
No
8/7/2019 7-Decision Trees Learning
8/51
8
Decision Tree for Disjunction
Outlook
Sunny Overcast Rain
Yes
Outlook=Sunny Wind=Weak
Wind
Strong Weak
No Yes
Wind
Strong Weak
No Yes
8/7/2019 7-Decision Trees Learning
9/51
9
Decision Tree for XOR
Outlook
Sunny Overcast Rain
Wind
Strong Weak
Yes No
Outlook=Sunny XOR Wind=Weak
Wind
Strong Weak
No Yes
Wind
Strong Weak
No Yes
8/7/2019 7-Decision Trees Learning
10/51
10
Decision Tree
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
decision trees represent disjunctions of conjunctions
(Outlook=Sunny Humidity=Normal) (Outlook=Overcast)
(Outlook=Rain
Wind=Weak)
8/7/2019 7-Decision Trees Learning
11/51
11
When to consider Decision Trees
Instances describable by attribute-value pairs
e.g Humidity: High, Normal
Target function is discrete valued
e.g Play tennis;Y
es, No Disjunctive hypothesis may be required
e.g Outlook=Sunny Wind=Weak
Possibly noisy training data
Missing attribute values Examples:
Medical diagnosis
Credit risk analysis
Object classification for robot manipulator (Tan 1993)
8/7/2019 7-Decision Trees Learning
12/51
12
Top-Down Induction of Decision Trees ID3
1. An the best decision attribute for next node2. AssignA as decision attribute fornode
3. For each value ofA createnew descendant4. Sort training examples to leafnode according to
the attribute value of the branch5. If all training examples are perfectlyclassified
(same value of targetattribute) stop, elseiterate overnew leafnodes.
8/7/2019 7-Decision Trees Learning
13/51
13
Which Attribute is best?
A1=?
True False
[21+, 5-] [8+, 30-]
[29+,35-] A2=?
True False
[18+, 33-] [11+, 2-]
[29+,35-]
8/7/2019 7-Decision Trees Learning
14/51
14
Entropy
S is a sample of training examples
p+ is the proportion of positive examples
p- is the proportion of negative examples
Entropy measures the impurity ofS
Entropy(S) = -p+ log2 p+ - p- log2 p-
8/7/2019 7-Decision Trees Learning
15/51
15
Entropy
Entropy(S)= expected number of bits needed to
encode class (+ or -) of randomly drawn members of
S (under the optimal, shortest length-code)
Information theory optimal length code assign
log2 p bits to messages having probability p.
So the expected number of bits to encode
(+ or -) of random member ofS:
-p+ log2 p+ - p- log2 p-(Note that: 0log 0 = 0)
8/7/2019 7-Decision Trees Learning
16/51
16
Information Gain
Gain(S,A): expected reduction in entropy due to sortingS
on attribute A
A1=?
True False
[21+, 5-] [8+, 30-]
[29+,35-] A2=?
True False
[18+, 33-] [11+, 2-]
[29+,35-]
Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)
Entropy([29+,35-]) = -29/64 log2 29/64 35/64 log2 35/64= 0.99
8/7/2019 7-Decision Trees Learning
17/51
17
Information Gain
A1=?
True False
[21+, 5-] [8+, 30-]
[29+,35-]
Entropy([21+,5-]) = 0.71
Entropy([8+,30-]) = 0.74
Gain(S,A1)=Entropy(S)
-26/64*Entropy([21+,5-])
-38/64*Entropy([8+,30-])
=0.27
Entropy([18+,33-]) = 0.94
Entropy([8+,30-]) = 0.62Gain(S,A2)=Entropy(S)
-51/64*Entropy([18+,33-])
-13/64*Entropy([11+,2-])=0.12
A2=?
True False
[18+, 33-][11+
, 2-]
[29+,35-]
8/7/2019 7-Decision Trees Learning
18/51
18
Training Examples
Day Outlook Temp. Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
8/7/2019 7-Decision Trees Learning
19/51
19
Selecting the Next Attribute
Humidity
High Normal
[3+, 4-] [6+, 1-]
S=[9+,5-]E=0.940
Gain(S,Humidity)=0.940-(7/14)*0.985 (7/14)*0.592
=0.151
E=0.985 E=0.592
Wind
Weak Strong
[6+, 2-] [3+, 3-]
S=[9+,5-]E=0.940
E=0.811 E=1.0Gain(S,Wind)=0.940-(8/14)*0.811 (6/14)*1.0
=0.048Humidity provides greater info. gain than Wind, w.r.t target classification.
8/7/2019 7-Decision Trees Learning
20/51
20
Selecting the Next Attribute
Outlook
Sunny Rain
[2+, 3-] [3+, 2-]
S=[9+,5-]E=0.940
Gain(S,Outlook)=0.940-(5/14)*0.971
-(4/14)*0.0 (5/14)*0.0971=0.247
E=0.971 E=0.971
Overcast
[4+, 0]
E=0.0
8/7/2019 7-Decision Trees Learning
21/51
21
Selecting the Next Attribute
Gain(S, Temperature)=0.940-(4/14)*1
-(6/14)*0.918 (4/14)*0.811=0.029
E=1 E=0.811
Temperature
Hot Cool
[2+, 2-] [3+, 1-]
S=[9+,5-]E=0.940
Mild
[4+, 2-]
E=0.918
Temperature ?
8/7/2019 7-Decision Trees Learning
22/51
22
Selecting the Next Attribute
Theinformationgainvalues for the 4 attributesare: Gain(S,Outlook) =0.247
Gain(S,Humidity) =0.151 Gain(S,Wind) =0.048 Gain(S,Temperature) =0.029
whereS denotes thecollection of training examples
Note: 0Log20 =0
8/7/2019 7-Decision Trees Learning
23/51
23
ID3 Algorithm
Outlook
Sunny Overcast Rain
Yes
[D1,D2,,D14][9+,5-]
Ssunny=[D1,D2,D8,D9,D11][2+,3-]
? ?
[D3,D7,D12,D13][4+,0-]
[D4,D5,D6,D10,D14][3+,2-]
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 2/5(0.0) = 0.970Gain(Ssunny , Temp.)=0.970-(2/5)0.0 2/5(1.0)-(1/5)0.0 = 0.570Gain(Ssunny , Wind)=0.970= -(2/5)1.0 3/5(0.918) = 0.019
Test for this node
8/7/2019 7-Decision Trees Learning
24/51
24
ID3 Algorithm
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
[D3,D7,D12,D13]
[D8,D9,D11] [D6,D14
][D1,D2] [D4,D
5,D
10
]
8/7/2019 7-Decision Trees Learning
25/51
25
Hypothesis Space Search ID3
+
- +
+ - +
A1
- - ++ - +
A2
+ - -
+ - +
A2
-
A4+ -
A2
-
A3- +
8/7/2019 7-Decision Trees Learning
26/51
26
Hypothesis Space Search ID3
Hypothesis space is complete!
Target function surely in there
Outputs a single hypothesis
No backtracking on selected attributes (greedy search)
ocal minimal (suboptimal splits)
Statistically-based search choices
Robust to noisy data
Inductive bias (search bias)
Prefer shorter trees over longer ones Place high information gain attributes close to the root
8/7/2019 7-Decision Trees Learning
27/51
27
Converting a Tree to Rules
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
R1: If (Outlook=Sunny)(Humidity=High) Then PlayTennis=NoR2: If (Outlook=Sunny)(Humidity=Normal) Then PlayTennis=YesR3: If (Outlook=Overcast) Then PlayTennis=YesR4: If (Outlook=Rain)(Wind=Strong) Then PlayTennis=No
R5: If (Outlook=Rain)(Wind=Weak) Then PlayTennis=Yes
8/7/2019 7-Decision Trees Learning
28/51
28
Continuous Valued Attributes
Create a discrete attribute to test continuous
Temperature = 24.50C
(Temperature > 20.00C) = {true, false}
Where to set the threshold?
Temperature 150C 180C 190C 220C 240C 270C
PlayTennis No No Yes Yes Yes No
(see paper by [Fayyad, Irani 1993]
8/7/2019 7-Decision Trees Learning
29/51
29
Attributes with many Values
Problem: if an attribute has many values, maximizing
InformationGain will select it.
E.g.: Imagine using Date=12.7.1996 as attribute
perfectly splits the data into subsets of size 1
Use GainRatio instead of information gain as criteria:
GainRatio(S,A) = Gain(S,A) / SplitInformation(S,A)
SplitInformation(S,A) = -7i=1..c |Si|/|S| log2 |Si|/|S| Where Si is the subset for which attribute A has the
value vi
8/7/2019 7-Decision Trees Learning
30/51
30
Attributes with Cost
Consider:
Medical diagnosis : blood test costs 1000 SEK(
) Robotics: width_from_one_feet has cost 23 secs.
How to learn a consistent tree with low expected cost?
Replace Gain by :
Gain2(S,A)/Cost(A) [Tan, Schimmer 1990]2Gain(S,A)-1/(Cost(A)+1)w w [0,1] [Nunez 1988]
8/7/2019 7-Decision Trees Learning
31/51
31
Unknown Attribute Values
What if examples are missing values ofA?
Use training example anyway sort through tree
If node n tests A, assign most common value ofA among other
examples sorted to node n.
Assign most common value ofA among other examples with
same target value
Assign probability pi to each possible value vi ofA
Assign fraction pi of example to each descendant in tree
Classify new examples in the same fashion
8/7/2019 7-Decision Trees Learning
32/51
32
Occams Razor
Prefer shorter hypotheses
Why prefer short hypotheses?
Argument in favor:
Fewer short hypotheses than long hypotheses
A short hypothesis that fits the data is unlikely to be a coincidence
A long hypothesis that fits the data might be a coincidence
Argument opposed:
T
here are many ways to define small sets of hypotheses What is so special about small sets based on size of hypothesis
8/7/2019 7-Decision Trees Learning
33/51
33
Overfitting
Consider error of hypothesis h over
Training data: errortrain(h)
Entire distribution D of data: errorD(h)Hypothesis hHoverfits training data if there is an
alternative hypothesis hH such that
errortrain(h) < errortrain(h)
anderrorD(h) > errorD(h)
8/7/2019 7-Decision Trees Learning
34/51
34
Overfitting in Decision Tree Learning
8/7/2019 7-Decision Trees Learning
35/51
35
Avoid Overfitting
How can we avoid overfitting?
Stop growing when data split not
statistically significant
Grow full tree then post-prune
8/7/2019 7-Decision Trees Learning
36/51
36
Reduced-Error Pruning
Split data into trainingand validation set
Do until further pruning is harmful:
1. Evaluate impact on validation set of pruning each
possible node (plus those below it)2. Greedily remove the one that less improves the
validation set accuracy
Produces smallest version of most accurate subtree
8/7/2019 7-Decision Trees Learning
37/51
37
ReducedReduced--ErrorError PruningPruning
Split data into training and validationsets.
Pruning a decision node dconsists of:
1. removing the subtree rooted at d.
2. making d a leaf node.3. assigning d the most common
classification of the traininginstances associated with d.
Do until further pruning is harmful:
1. Evaluate impact on validation setof pruning each possible node(plus those below it).
2. Greedily remove the one thatmost improves validation setaccuracy.
Outlook
sunny overcast rainy
Humidity Windy
high normal
no
false true
yes
yes yes no
8/7/2019 7-Decision Trees Learning
38/51
38
Effect ofReduced Error Pruning
8/7/2019 7-Decision Trees Learning
39/51
39
Rule Post-Pruning
Infer the decision tree from the training setallow overfitting
Convert tree into equivalent set of rules
Prune each rule by removing preconditions that result in
improving its estimated accuracy
Sort the pruned rules by estimated accuracy and considerthem in order when classifying
8/7/2019 7-Decision Trees Learning
40/51
40
Outlook
Humidity WindYes
Strong Weak
No Yes No Yes
Sunny Overcast Rain
High Normal
If (Outlook = Sunny) ( Humidity = High) Then (PlayTennis = No)
8/7/2019 7-Decision Trees Learning
41/51
41
Why convertthe decision tree to rules
before pruning?
Allows distinguishing among the different contexts
in which a decision node is used
Removes the distinction between attribute tests
near the root and those that occur near leaves Enhances readability
8/7/2019 7-Decision Trees Learning
42/51
42
Evaluation
Training accuracy How many training instances can be correctly classify based on
the available data?
Is high when the tree is deep/large, or when there is lessconfliction in the training instances.
however, higher training accuracy does not mean goodgeneralization
Testing accuracy Given a number of new instances, how many of them can we
correctly classify?
Cross validation
8/7/2019 7-Decision Trees Learning
43/51
43
Strengths
can generate understandable rules
perform classification without much computation
can handle continuous and categorical variables
provide a clear indication of which fields are most important
for prediction or classification
8/7/2019 7-Decision Trees Learning
44/51
44
Weakness
Not suitable for prediction of continuous attribute.
Perform poorly with many class and small data.
Computationally expensive to train.
At each node, each candidate splitting field must be sorted beforeits best split can be found.
In some algorithms, combinations of fields are used and a searchmust be made for optimal combining weights.
Pruning algorithms can also be expensive since many candidatesub-trees must be formed and compared.
Do not treat well non-rectangular regions.
8/7/2019 7-Decision Trees Learning
45/51
45
Cross-Validation
Estimate the accuracy of a hypothesis induced by
a supervised learning algorithm
Predict the accuracy of a hypothesis over future
unseen instances
Select the optimal hypothesis from a given set of
alternative hypotheses
Pruning decision trees
Model selection
Feature selection
Combining multiple classifiers (boosting)
8/7/2019 7-Decision Trees Learning
46/51
46
Holdout Method
Partition data set D = {(v1,y1),,(vn,yn)} into trainingDt and
validation set Dh=D\Dt
Training Dt Validation D\Dt
acch= 1/h (vi,yi)DhH(I(Dt,vi),yi)
I(Dt,vi) : output ofhypothesisinduced by learnerItrained on data D
t
forinstanceviH(i,j) = 1ifi=j and 0 otherwise
Problems: makesinsufficient use of data
trainingand validationsetarecorrelated
8/7/2019 7-Decision Trees Learning
47/51
47
Cross-Validation
k-fold cross-validation splits the data set D into k mutually
exclusive subsets D1,D2,,Dk
Train and test the learning algorithm k times, each time it istrained on D\Di and tested on Di
D1 D2 D3 D4
D1 D2 D3 D4 D1 D2 D3 D4
D1 D2 D3 D4 D1 D2 D3 D4
acccv= 1/n (vi,yi)D H(I(D\Di,vi),yi)
8/7/2019 7-Decision Trees Learning
48/51
48
Cross-Validation
Uses all the data for training and testing
Complete k-fold cross-validation splits the
dataset of size m in all (m over m/k) possible
ways (choosing m/k instances out of m) Leave n-out cross-validation sets n instances
aside for testing and uses the remaining ones for
training (leave one-out is equivalent to n-fold
cross-validation)
Leave one out is widely used
In stratified cross-validation, the folds are
stratified so that they contain approximately the
same proportion of labels as the original data set
8/7/2019 7-Decision Trees Learning
49/51
49
= { X1 , X2, .. ,Xn } n p(Xi)
= (Xi Xj ... Xk)
(Xi Xj ... Xk) = I(Xi Xj ... Xk)
p(X) = 1, I(X) = 0 p(X) = 0, I(X) = g
p(Xi) > P(Xj), I(Xi) < I(Xj)
XiXjXk I (XiXjXk) = I (Xi) + I (Xj) + I (Xk)
8/7/2019 7-Decision Trees Learning
50/51
50
I (X) = log2[ 1/ p(X) ] = - log2 p(X) (bit)
[] X = (X1X2X3) , p(X1),p(X2),p(X3)I(X)
I (X) = I (X1X2X3)
= -log [ p(X1X2X3) ]
= -log[ p(X1) p(X2) p(X3) ] Xi = -log p(X1) -log p(X2) -log p(X3)
= I (X1) + I (X2) + I (X3)
[] (X1X2X3X4), 1 p(X1) = 1/2, p(X2) = 1/4, p(X3) = 1/8, p(X4) = 1/8
2 p(X1) = 1/4, p(X2) = 1/4, p(X3) = 1/4, p(X4) = 1/4
1 I(X1X2X3X4) = I(X1) + I(X2) + I(X3) + I(X4) = -log(1/2) -log(1/4) -log(1/8) - log(1/8)
= 1 + 2 + 3 +3 = 9 (bit)
2 I(X1X2X3X4) = I(X1) + I(X2) + I(X3) + I(X4)
= -log(1/4) -log(1/4) -log(1/4) - log(1/4)
= 2 + 2 + 2 +2 = 8 (bit) Imin
8/7/2019 7-Decision Trees Learning
51/51
51
( ) H(X)
[] {X1, X2 , X3 , X4 },
p(X1) = 1/2, p(X2) = 1/4, p(X3) = 1/8, p(X4) = 1/8 (Hmax )
H(X) =
= [ -(1/2)log(1/2) ] + [ -(1/4)log(1/4) ] + [ -(1/8)log(1/8) ] + [ -(1/8)log(1/8) ]
= 1/2 + (1/4)
2 + (1/8)
3 + (1/8)
3 = 1/2 + 1/2 + 3/8 + 3/8
= 1.75 (bit)
!
!!n
i
n
i
XipXipXiIXipXH11
)(log)()()()(
!
p i p ii
n
( ) log ( )1