Decision(Tree(Learning - Indian Statistical...

Decision Tree Learning

Debapriyo Majumdar Data Mining – Fall 2014

Indian Statistical Institute Kolkata

August 25, 2014

Example: Age, Income and Owning a flat

2

0

50

100

150

200

250

0 10 20 30 40 50 60 70

Mon

thly income

(tho

usand rupe

es)

Age

Training set •  Owns a

house

•  Does not own a house

§  If the training data was as above –  Could we define some simple rules by observation?

§  Any point above the line L1 à Owns a house §  Any point to the right of L2 à Owns a house §  Any other point à Does not own a house

L1

L2


3

0

50

100

150

200

250

0 10 20 30 40 50 60 70

Mon

thly income

(tho

usand rupe

es)

Age


house


L1

L2

Root node: Split at Income = 101

Income ≥ 101: Label = Yes

Income < 101: Split at Age = 54

Age ≥ 54: Label = Yes

Age < 54: Label = No

In general, the data

won’t be such as above


4

0

50

100

150

200

250

0 10 20 30 40 50 60 70

Mon

thly income

(tho

usand rupe

es)

Age


house


§  Approach: recursively split the data into partitions so that each partition becomes purer till …

How to decide the split?

How to measure purity? When to stop?

Approach for spliKng §  What are the possible lines for splitting? –  For each variable, midpoints between pairs of consecutive

values for the variable –  How many? –  If N = number of points in training set and m = number of

variables –  About O(N × m)

§  How to choose which line to use for splitting? –  The line which reduce impurity (~ heterogeneity of

composition) the most

§  How to measure impurity?

5

Gini Index for Measuring Impurity §  Suppose there are C classes §  Let p(i|t) = fraction of observations belonging to class

i in rectangle (node) t §  Gini index:

6

Gini(t) =1− p(i | t)2i=1

C

∑

§  If all observations in t belong to one single class Gini(t) = 0

§  When is Gini(t) maximum?

Entropy §  Average amount of information contained §  From another point of view – average amount of

information expected – hence amount of uncertainty –  We will study this in more detail later

§  Entropy:

7

Entropy(t) = − p(i | t)× log2i=1

C

∑ p(i | t)

Where 0 log20 is defined to be 0

ClassificaOon Error §  What if we stop the tree building at a

node –  That is, do not create any further branches

for that node –  Make that node a leaf –  Classify the node with the most frequent

class present in the node

§  Classification error as measure of impurity

8

This rectangle (node) is still impure

ClassificationError(t) =1−maxi[p(i | t)]

§  Intuitively – the impurity of the most frequent class in the rectangle (node)

The Full Blown Tree §  Recursive splitting §  Suppose we don’t stop until

all nodes are pure §  A large decision tree with

leaf nodes having very few data points –  Does not represent classes well –  Overfitting

§  Solution: –  Stop earlier, or –  Prune back the tree

9

Root 1000

400 600

200 200 240 160

2 1 5

Number of points

StaOsOcally not significant

Prune back §  Pruning step: collapse leaf

nodes and make the immediate parent a leaf node

§  Effect of pruning –  Lose purity of nodes –  But were they really pure or was

that a noise? –  Too many nodes ≈ noise

§  Trade-off between loss of purity and gain in complexity

10

Leaf node (label = Y) Freq = 5

Decision node

(Freq = 7)

Leaf node (label = B) Freq = 2


Prune

Prune back: cost complexity §  Cost complexity of a (sub)tree: §  Classification error (based on

training data) and a penalty for size of the tree

11


Decision node

(Freq = 7)

Leaf node (label = B) Freq = 2


Prune

tradeoff (T ) = Err(T )+α L(T )

§  Err(T) is the classification error §  L(T) = number of leaves in T §  Penalty factor α is between 0

and 1 –  If α=0, no penalty for bigger tree

Different Decision Tree Algorithms §  Chi-square Automatic Interaction Detector (CHAID)

–  Gordon Kass (1980) –  Stop subtree creation if not statistically significant by chi-square test

§  Classification and Regression Trees (CART) –  Breiman et al. –  Decision tree building by Gini’s index

§  Iterative Dichotomizer 3 (ID3) –  Ross Quinlan (1986) –  Splitting by information gain (difference in entropy)

§  C4.5 –  Quinlan’s next algorithm, improved over ID3 –  Bottom up pruning, both categorical and continuous variables –  Handling of incomplete data points

§  C5.0 –  Ross Quinlan’s commercial version

12

ProperOes of Decision Trees §  Non parametric approach –  Does not require any prior assumptions regarding the

probability distribution of the class and attributes §  Finding an optimal decision tree is an NP-complete

problem –  Heuristics used: greedy, recursive partitioning, top-down,

bottom-up pruning §  Fast to generate, fast to classify §  Easy to interpret or visualize §  Error propagation –  An error at the top of the tree propagates all the way down

13

References §  Introduction to Data Mining, by Tan, Steinbach,

Kumar –  Chapter 4 is available online:

http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf

14

Date post:	08-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Decision(Tree(Learning - Indian Statistical...

Documents