Basic Learning Methods: 1R, Decision Trees
1
Supervised Learning● Aim: Construct a model that is able to predict the class
label of a data instance.◆ Classification learning
● Training / Learning◆ Automatically construct the model using training
data● Testing / Operational Usage
◆ Make use of the learned model to predict an unseen data instance
◆ Measure the performance of the model
2
Simplicity first● Simple algorithms sometimes work well!● There are many kinds of simple structure, e.g.
◆ One attribute does all the work◆ All attributes contribute equally & independently◆ A weighted linear combination might do◆ Instance-based: use a few prototypes◆ Use simple logical rules
● Sometimes, success of method depends on the domain
3
Inferring rudimentary rules
● 1R: learns a 1-level decision tree◆ i.e., rules that all test one particular attribute
● Basic version◆ one branch for each value◆ each branch assigns most frequent class
● Error rate: proportion of instances that don’t belong to the majority class of their corresponding branch
● Choose attribute with lowest error rate (assumes nominal attributes)
4
Input instances with attributes
NoTrueHighMildRainy
YesFalseNormalHotOvercast
YesTrueHighMildOvercast
YesTrueNormalMildSunny
YesFalseNormalMildRainy
YesFalseNormalCoolSunny
NoFalseHighMildSunny
YesTrueNormalCoolOvercast
NoTrueNormalCoolRainy
YesFalseNormalCoolRainy
YesFalseHighMildRainy
YesFalseHighHot Overcast
NoTrueHighHotSunny
NoFalseHighHotSunny
PlayWindyHumidityTempOutlook● Play attribute has a
special role – class attribute
● Learn a model to predict the outcome of the class attribute (i.e., Play)
5
Rule Template for 1RTemplate of the knowledge (simple rule)If <attribute> is:
<value1>, then <class> is <outcome1><value2> , then <class> is <outcome2>
::
6
Pseudo‐code for 1R
For each attribute,For each value of the attribute, make a rule as follows:
count how often each class appearsfind the most frequent classmake the rule assign that class to this attribute-value
Calculate the error rate of the rulesChoose the rules with the smallest error rate
Note: “missing” is treated as a separate attribute value
If <attribute> is:<value1>, then <class> is <outcome1><value2> , then <class> is <outcome2>
::
Pseudo-code
Template of the knowledge (simple rule)
7
Processing the Attributes
3/6True No
5/142/8False YesWindy
1/7Normal Yes
4/143/7High NoHumidity
5/14
4/14
Total errors
1/4Cool Yes
2/6Mild Yes
2/4Hot NoTemp
2/5Rainy Yes
0/4Overcast Yes
2/5Sunny NoOutlook
ErrorsRulesAttribute
NoTrueHighMildRainy
YesFalseNormalHotOvercast
YesTrueHighMildOvercast
YesTrueNormalMildSunny
YesFalseNormalMildRainy
YesFalseNormalCoolSunny
NoFalseHighMildSunny
YesTrueNormalCoolOvercast
NoTrueNormalCoolRainy
YesFalseNormalCoolRainy
YesFalseHighMildRainy
YesFalseHighHot Overcast
NoTrueHighHotSunny
NoFalseHighHotSunny
PlayWindyHumidityTempOutlook
8
Output Solution● There are two solutions as shown below. The final
solution can be arbitrarily selected from one of them.● First solution –
If Outlook is:● Sunny, then play is no● Overcast, then play is yes● Rainy, then play is yes
● Second solution –If Humidity is:
● High, then play is no● Normal, then play is yes
9
Dealing with numeric attributes
64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
……………
YesFalse8075Rainy
YesFalse8683Overcast
NoTrue9080Sunny
NoFalse8585Sunny
PlayWindyHumidityTemperatureOutlook
● Discretize numeric attributes● Divide each attribute’s range into intervals● Sort instances according to attribute’s values● Place breakpoints where class changes (majority class)● This minimizes the total error● Example: temperature from weather data
10
The problem of overfitting
64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No
● This procedure is very sensitive to noise● One instance with an incorrect class label will probably
produce a separate interval● Also: time stamp attribute will have zero errors● Simple solution:
◆ enforce minimum number of instances in majority class per interval
● Example (with min = 3):
11
With overfitting avoidance
0/1> 95.5 Yes
3/6True No*
5/142/8False YesWindy
2/6> 82.5 and 95.5 No
3/141/7 82.5 YesHumidity
5/14
4/14
Total errors
2/4> 77.5 No*
3/10 77.5 YesTemperature
2/5Rainy Yes
0/4Overcast Yes
2/5Sunny NoOutlook
ErrorsRulesAttribute
● The final solution -If Humidity is:
● <= 82.5, then play is yes● > 82.5 and <= 95.5, play is no● > 95.5, play is yes 12
Discussion of 1R
Robert Holte, “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets”, Machine Learning, 11:63-91, 1993.
● 1R was described in a paper by Holte (1993)● Contains an experimental evaluation on 16 datasets
(using cross-validation so that results were representative of performance on future data)
● Minimum number of instances was set to 6 after some experimentation
● 1R’s simple rules performed not much worse than much more complex decision trees
● Simplicity first pays off!
13
Decision Trees● Found in various applications such as product
recommendation
14
One example ‐ Netflix
Decision Trees● Decision tree
◆ A flow-chart-like tree structure◆ Internal node denotes a test on an attribute◆ Branch represents an outcome of the test◆ Leaf nodes represent class labels or class
distribution● Use of decision tree: Classifying an unknown
sample◆ Test the attribute values of the sample against
the decision tree
15
Decision Trees
16
age?
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
31..40
Learning Decision Trees From Data● Strategy: top down ● Recursive divide-and-conquer fashion
◆ First: select attribute for root nodeCreate branch for each possible attribute value
◆ Then: split instances into subsetsOne for each branch extending from the node
◆ Finally: repeat recursively for each branch, using only instances that reach the branch
● Stop if all instances have the same class
17
Attribute Selection by Information Gain Computation
attribute1 attribute2 class label
high high yeshigh high yeshigh high yeshigh low yeshigh low yeshigh low yeshigh low nolow low nolow low nolow high nolow high nolow high no
18
Attribute Selection by Information Gain Computation
attribute1 attribute2 class label
high high yeshigh high yeshigh high yeshigh low yeshigh low yeshigh low yeshigh low nolow low nolow low nolow high nolow high nolow high no
Consider the attribute1:
attribute1 yes nohigh 6 1low 0 5
attribute2 yes nohigh 3 3low 3 3
Consider the attribute2:
19
Attribute Selection by Information Gain Computation
attribute1 attribute2 class label
high high yeshigh high yeshigh high yeshigh low yeshigh low yeshigh low yeshigh low nolow low nolow low nolow high nolow high nolow high no
Consider the attribute1:
attribute1 yes nohigh 6 1low 0 5
attribute2 yes nohigh 3 3low 3 3
Consider the attribute2:
attribute1 is better than attribute2 for classification purpose !
20
Which attribute to select?
21
Which attribute to select?
22
Criterion for attribute selection
● Which is the best attribute?● Want to get the smallest tree● Heuristic: choose the attribute that produces the
“purest” nodes● Popular impurity criterion: information gain● Information gain increases with the average purity of
the subsets● Strategy: choose attribute that gives greatest
information gain
23
Computing Information● Measure information in bits● Given a probability distribution, the info required to
predict an event is the distribution’s entropy● Entropy gives the information required in bits
(can involve fractions of bits!)● Formula for computing the entropy:
entropy , , ⋯ log log ⋯ log
24
Example: attribute OutlookOutlook = Sunny :
Note: thisis normallyundefined.
Outlook = Overcast :
Outlook = Rainy :
Expected information for attribute:
info 2,3 entropy 25 ,35
25 log
25
35 log
35 0.971bits
info 4,0 entropy 1,0 1 log 1 0 log 0 0bit
info 3,2 entropy 35 ,25
35 log
35
25 log
25 0.971bits
info 2,3 , 4,0 , 3,2 514 0.971 4
14 0 514 0.971 0.693bits
25
Computing Information GainInformation gain: information before splitting – information after splitting
Information gain for attributes from weather data:
gain(Outlook ) = 0.247 bitsgain(Temperature ) = 0.029 bitsgain(Humidity ) = 0.152 bitsgain(Windy ) = 0.048 bits
gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])= 0.940 – 0.693= 0.247 bits
26
Continuing to split
gain(Temperature ) = 0.571 bitsgain(Humidity ) = 0.971 bitsgain(Windy ) = 0.020 bits
27
Final Decision Tree
Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further
28
Wish list for a purity measure● Properties we require from a purity measure:
◆ When node is pure, measure should be zero◆ When impurity is maximal (i.e. all classes equally
likely), measure should be maximal◆ Measure should obey multistage property (i.e.
decisions can be made in several stages):
● Entropy is the only function that satisfies all three properties!
measure 2,3,4 measure 2,7 79 measure 3,4
29
Properties of the entropy
The multistage property:
Simplification of computation:
Note: instead of maximizing info gain we could just minimize information
2 log2 3 log3 4 log4 9 log9 9
entropy , , entropy , entropy ,
info 2,3,4 29 log 2 9
39 log 3 9
49 log 4 9
30
Highly‐branching attributes● Problematic: attributes with a large number of values
(extreme case: ID code)● Subsets are more likely to be pure if there is a large
number of values● Information gain is biased towards choosing attributes
with a large number of values● This may result in overfitting (selection of an attribute
that is non-optimal for prediction)● Another problem: fragmentation
31
Weather data with ID code
N
M
L
K
J
I
H
G
F
E
D
C
B
A
ID code
NoTrueHighMildRainy
YesFalseNormalHotOvercast
YesTrueHighMildOvercast
YesTrueNormalMildSunny
YesFalseNormalMildRainy
YesFalseNormalCoolSunny
NoFalseHighMildSunny
YesTrueNormalCoolOvercast
NoTrueNormalCoolRainy
YesFalseNormalCoolRainy
YesFalseHighMildRainy
YesFalseHighHot Overcast
NoTrueHighHotSunny
NoFalseHighHotSunny
PlayWindyHumidityTemp.Outlook
32
Tree stump for ID code attribute
Entropy of split:
Implies that information gain is maximal for ID code(namely 0.940 bits)
info info 0,1 info 0,1 ⋯ info 0,1 0bit
33
Gain ratio● Gain ratio: a modification of the information gain that
reduces its bias● Gain ratio takes number and size of branches into
account when choosing an attribute◆ It corrects the information gain by taking the
intrinsic information of a split into account● Intrinsic information: entropy of distribution of
instances into branches (i.e. how much info do we need to tell which branch an instance belongs to)
34
Computing the gain ratioExample: intrinsic information for ID code
Value of attribute decreases as intrinsic information gets larger
Definition of gain ratio:
Example:
info 1,1,⋯ , 1 14 114 log 1
14 3.807bits
gain_ratio attributegain attribute
intrinsic_info attribute
gain_ratio IDcode0.940bits3.807bits 0.246
35
Gain ratios for weather data
0.019Gain ratio: 0.029/1.5570.157Gain ratio: 0.247/1.577
1.557Split info: info([4,6,4])1.577 Split info: info([5,4,5])
0.029Gain: 0.940-0.9110.247Gain: 0.940-0.693
0.911Info:0.693Info:
TemperatureOutlook
0.049Gain ratio: 0.048/0.9850.152Gain ratio: 0.152/1
0.985Split info: info([8,6])1.000 Split info: info([7,7])
0.048Gain: 0.940-0.8920.152Gain: 0.940-0.788
0.892Info:0.788Info:
WindyHumidity
36
More on the gain ratio● “Outlook” still comes out top● However: “ID code” has greater gain ratio
◆ Standard fix: ad hoc test to prevent splitting on that type of attribute
● Problem with gain ratio: it may overcompensate◆ May choose an attribute just because its intrinsic
information is very low◆ Standard fix: only consider attributes with greater
than average information gain
37
Discussion● Top-down induction of decision trees: ID3, algorithm
developed by Ross Quinlan◆ Gain ratio just one modification of this basic
algorithm◆ C4.5: deals with numeric attributes, missing
values, noisy data● There are many other attribute selection criteria!
(But little difference in accuracy of result)
38