Decision Tree Learning
Debapriyo Majumdar Data Mining – Fall 2014
Indian Statistical Institute Kolkata
August 25, 2014
Example: Age, Income and Owning a flat
2
0
50
100
150
200
250
0 10 20 30 40 50 60 70
Mon
thly income
(tho
usand rupe
es)
Age
Training set • Owns a
house
• Does not own a house
§ If the training data was as above – Could we define some simple rules by observation?
§ Any point above the line L1 à Owns a house § Any point to the right of L2 à Owns a house § Any other point à Does not own a house
L1
L2
Example: Age, Income and Owning a flat
3
0
50
100
150
200
250
0 10 20 30 40 50 60 70
Mon
thly income
(tho
usand rupe
es)
Age
Training set • Owns a
house
• Does not own a house
L1
L2
Root node: Split at Income = 101
Income ≥ 101: Label = Yes
Income < 101: Split at Age = 54
Age ≥ 54: Label = Yes
Age < 54: Label = No
In general, the data
won’t be such as above
Example: Age, Income and Owning a flat
4
0
50
100
150
200
250
0 10 20 30 40 50 60 70
Mon
thly income
(tho
usand rupe
es)
Age
Training set • Owns a
house
• Does not own a house
§ Approach: recursively split the data into partitions so that each partition becomes purer till …
How to decide the split?
How to measure purity? When to stop?
Approach for spliKng § What are the possible lines for splitting? – For each variable, midpoints between pairs of consecutive
values for the variable – How many? – If N = number of points in training set and m = number of
variables – About O(N × m)
§ How to choose which line to use for splitting? – The line which reduce impurity (~ heterogeneity of
composition) the most
§ How to measure impurity?
5
Gini Index for Measuring Impurity § Suppose there are C classes § Let p(i|t) = fraction of observations belonging to class
i in rectangle (node) t § Gini index:
6
Gini(t) =1− p(i | t)2i=1
C
∑
§ If all observations in t belong to one single class Gini(t) = 0
§ When is Gini(t) maximum?
Entropy § Average amount of information contained § From another point of view – average amount of
information expected – hence amount of uncertainty – We will study this in more detail later
§ Entropy:
7
Entropy(t) = − p(i | t)× log2i=1
C
∑ p(i | t)
Where 0 log20 is defined to be 0
ClassificaOon Error § What if we stop the tree building at a
node – That is, do not create any further branches
for that node – Make that node a leaf – Classify the node with the most frequent
class present in the node
§ Classification error as measure of impurity
8
This rectangle (node) is still impure
ClassificationError(t) =1−maxi[p(i | t)]
§ Intuitively – the impurity of the most frequent class in the rectangle (node)
The Full Blown Tree § Recursive splitting § Suppose we don’t stop until
all nodes are pure § A large decision tree with
leaf nodes having very few data points – Does not represent classes well – Overfitting
§ Solution: – Stop earlier, or – Prune back the tree
9
Root 1000
400 600
200 200 240 160
2 1 5
Number of points
StaOsOcally not significant
Prune back § Pruning step: collapse leaf
nodes and make the immediate parent a leaf node
§ Effect of pruning – Lose purity of nodes – But were they really pure or was
that a noise? – Too many nodes ≈ noise
§ Trade-off between loss of purity and gain in complexity
10
Leaf node (label = Y) Freq = 5
Decision node
(Freq = 7)
Leaf node (label = B) Freq = 2
Leaf node (label = Y) Freq = 7
Prune
Prune back: cost complexity § Cost complexity of a (sub)tree: § Classification error (based on
training data) and a penalty for size of the tree
11
Leaf node (label = Y) Freq = 5
Decision node
(Freq = 7)
Leaf node (label = B) Freq = 2
Leaf node (label = Y) Freq = 7
Prune
tradeoff (T ) = Err(T )+α L(T )
§ Err(T) is the classification error § L(T) = number of leaves in T § Penalty factor α is between 0
and 1 – If α=0, no penalty for bigger tree
Different Decision Tree Algorithms § Chi-square Automatic Interaction Detector (CHAID)
– Gordon Kass (1980) – Stop subtree creation if not statistically significant by chi-square test
§ Classification and Regression Trees (CART) – Breiman et al. – Decision tree building by Gini’s index
§ Iterative Dichotomizer 3 (ID3) – Ross Quinlan (1986) – Splitting by information gain (difference in entropy)
§ C4.5 – Quinlan’s next algorithm, improved over ID3 – Bottom up pruning, both categorical and continuous variables – Handling of incomplete data points
§ C5.0 – Ross Quinlan’s commercial version
12
ProperOes of Decision Trees § Non parametric approach – Does not require any prior assumptions regarding the
probability distribution of the class and attributes § Finding an optimal decision tree is an NP-complete
problem – Heuristics used: greedy, recursive partitioning, top-down,
bottom-up pruning § Fast to generate, fast to classify § Easy to interpret or visualize § Error propagation – An error at the top of the tree propagates all the way down
13
References § Introduction to Data Mining, by Tan, Steinbach,
Kumar – Chapter 4 is available online:
http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
14