Decision Tree Report

Seminar Report

On

Decision Trees Bachelor of Technology

InComputer Science

Submitted By:

ISHAN DALMIA

Regd. No.:0911012067

Section: CSE-A

DEPARTMENT OF COMPUTER SCIENCEINSTITUTE OF TECHNICAL EDUCATION & RESEARCH

(Faculty of Engineering)

SIKSHA O ANUSANDHAN UNIVERSITY(Declared u/s. 3 of the UGC Act. 1956)

BHUBANESWAR – 751 0302013-2014

INSTITUTE OF TECHNICAL EDUCATION & RESEARCH(Faculty of Engineering)

1

SIKSHA O ANUSANDHAN UNIVERSITY(Declared u/s. 3 of the UGC Act. 1956)

Jagamohan Nagar, Jagamara, Bhubaneswar – 751030.

CertificateThis is to certify that the Seminar entitled “Decision Trees”

being submitted by Ishan Dalmia bearing Registration No. :

0911012067 in partial fulfillment of the requirement for the

award of the degree of Bachelor of Technology in Computer

Science is a bona fide work carried out at Institute of

Technical Education & Research under my Supervision.

Sir Name

Mrs. Sashikala Mishra

2

ACKNOWLEDGEMENT

It is matter of great pleasure for me to submit this seminar report on “Decision Trees”, as a part of curriculum for award of “Bachelor in Technology” degree of Siksha O Anusandhan University, Bhubaneswar, Orissa.

I am thankful to Prof. (Dr) Binod Kumar Pattanayak, Head of Department and Associate Professor in Computer Science & Engg. department for his constant encouragement and able guidance.

I am also thankful to Ms. Sashikala Mishra, & Mr. Swadhin Kumar Barisal, Assistant Professors & Seminar in- charge for Computer Science & Engineering Department, for their valuable support.

I take this opportunity to express my deep sense of gratitude towards those, who have helped me in various ways for preparing my seminar. At the last but not least, I am thankful to my parents, who had encouraged & inspired me with their blessings.

Ishan Dalmia

Date - 04.04.2013

Place - Bhubaneswar

3

ABSTRACT

Decision tree learning is a method for approximating discrete-valued targetfunctions, in which the learned function is represented by a decision tree. Decision tree learning is one of the most widely used and practical methods for inductive inference’. Decision tree learning algorithm has been successfully used in expert systems in capturing knowledge. The main task performed in these systems is using inductive methods to the given values of attributes of an unknown object to determine appropriate classification according to decision tree rules. A decision tree is a method you can use to help make good choices, especially decisions that involve high costs and risks. Decision trees use a graphic approach to compare competing alternatives and assign values to those alternatives by combining uncertainties, costs, and payoffs into specific numerical values. A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision. Decision tree are commonly used for gaining information for the purpose of decision -making. Decision tree starts with a root node on which it is for users to take actions. From this node, users split each node recursively according to decision tree learning algorithm. The final result is a decision tree in which each branch represents a possible scenario of decision and its outcome. ID3 is a simple decision tree learning algorithm developed by Ross Quinlan (1983). The basic idea of ID3 algorithm is to construct the decision tree by employing a top-down, greedy search through the given sets to test each attribute at every tree node. In order to select the attribute that is most useful for classifying a given sets, we introduce a metric---information gain.

Keywords: Decision tree, Information Gain, Entropy, ID3 Algorithm.

GUIDED BY: SUBMITTED BY:

Mrs. Sashikala Mishra Ishan Dalmia

Assistant Professor 0911012067

ITER (Branch:CSE) CSE-A

4

CONTENTS

Introduction

What is Decision Tree?

What is decision tree learning algorithm?

Decision Trees Inducers.

Why is Decision Tree Learning an attractive Inductive

Learning method?

ID3 (Iterative Dichotomiser 3) Algorithm

Pruning Methods

Advantages

Disadvantages

References

5

INTRODUCTION

1.1 What is Decision Tree?

What is decision tree: A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision. A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm.

Decision tree are commonly used for gaining information for the purpose of decision -making. Decision tree starts with a root node on which it is for users to take actions. From this node, users split each node recursively according to decision tree learning algorithm. The final result is a decision tree in which each branch represents a possible scenario of decision and its outcome.

1.2 What is decision tree learning algorithm?

'Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is represented by a decision tree. Decision tree learning is one of the most widely used and practical methods for inductive inference'.

Decision tree learning algorithm has been successfully used in expert systems in capturing knowledge. The main task performed in these systems is using inductive methods to the given values of attributes of an unknown object to determine appropriate classification according to decision tree rules.

6

Decision trees classify instances by traverse from root node to leaf node. We start from root node of decision tree, testing the attribute specified by this node, then moving down the tree branch according to the attribute value in the given set. This process is the repeated at the sub-tree level.

What is decision tree learning algorithm suited for:

1. Instance is represented as attribute-value pairs.

2. The target function has discrete output values. It can easily deal with instance which is assigned to a boolean decision, such as 'true' and 'false', 'p(positive)' and 'n(negative)'.

3. The training data may contain errors. This can be dealt with pruning techniques.

1.3. Decision Trees Inducers

1.3.1 ID3

The ID3 algorithm is considered as a very simple decision tree algorithm(Quinlan, 1986). ID3 uses information gain as splitting criteria. The growing stops when all instances belong to a single value of target feature or when best information gain is not greater than zero. ID3 does not apply any pruning procedures nor does it handle numeric attributes or missing values.

1.3.2 C4.5

C4.5 is an evolution of ID3, presented by the same author (Quinlan, 1993).It uses gain ratio as splitting criteria. The splitting ceases when the number of instances to be split is below a certain threshold. Error–based

7

pruning is performed after the growing phase. C4.5 can handle numeric attributes.

1.3.3 CART

CART stands for Classification and Regression Trees (Breiman et al., 1984).It is characterized by the fact that it constructs binary trees, namely each internal node has exactly two outgoing edges. The splits are selected using the twoing criteria and the obtained tree is pruned by cost–complexity Pruning. When provided, CART can consider misclassification costs in the tree induction. It also enables users to provide prior probability distribution.

1.3.4 QUEST

The QUEST (Quick, Unbiased, Efficient, Statistical Tree) algorithm supports univariate and linear combination splits (Loh and Shih, 1997). For each split, the association between each input attribute and the target attribute is computed using the ANOVA F–test or Levene’s test (for ordinal and continuous attributes) or Pearson’s chi–square (for nominal attributes). If the target attribute is multinomial, two–means clustering is used to create two super– classes. The attribute that obtains the highest association with the target attribute is selected for splitting. Quadratic Discriminant Analysis (QDA) is applied to find the optimal splitting point for the input attribute. QUEST has negligible bias and it yields binary decision trees. Ten–fold cross–validation is used to prune the trees.

1.4 Why is Decision Tree Learning an attractive Inductive Learning method?

'Purely inductive learning methods formulate general hypotheses by finding empirical regularities over the training examples.'

8

For inductive learning, decision tree learning is attractive for 3 reasons:

1. Decision tree is a good generalization for unobserved instance, only if the instances are described in terms of features that are correlated with the target concept.

2. The methods are efficient in computation that is proportional to the number of observed training instances.

3. The resulting decision tree provides a representation of the concept those appeals to human because it renders the classification process self-evident.

9

ID3 (Iterative Dichotomiser 3) Algorithm.

2.1. ID3 Basic

ID3 is a simple decision tree learning algorithm developed by Ross Quinlan (1983). The basic idea of ID3 algorithm is to construct the decision tree by employing a top-down, greedy search through the given sets to test each attribute at every tree node. In order to select the attribute that is most useful for classifying a given sets, we introduce a metric---information gain.

To find an optimal way to classify a learning set, what we need to do is to minimize the questions asked (i.e. minimizing the depth of the tree). Thus, we need some function which can measure which questions provide the most balanced splitting. The information gain metric is such a function.

2.2 Entropy --- measuring homogeneity of a learning set

In order to define information gain precisely, we need to discuss entropy first.

First, let’s assume, without loss of generality, that the resulting decision tree classifies instances into two categories, we'll call them P(positive)and N(negative).

Given a set S, containing these positive and negative targets, the entropy of S related to this boolean classification is:Entropy(S)= - P(positive)log2P(positive) -P(negative)log2P(negative)

P(positive): proportion of positive examples in S

10

P(negative): proportion of negative examples in S

For example, if S is (0.5+, 0.5-) then Entropy(S) is 1, if S is (0.67+, 0.33-) then Entropy(S) is 0.92, if P is (1+, 0 -) then Entropy(S) is 0. Note that the more uniform is the probability distribution, the greater isits information.

2.3 Information Gain --- measuring the expected reduction in Entropy

As we mentioned before, to minimize the decision tree depth, when we traverse the tree path, we need to select the optimal attribute for splitting the tree node, which we can easily imply that the attribute with the most entropy reduction is the best choice.

We define information gain as the expected reduction of entropy related to specified attribute when splitting a decision tree node.

The information gain, Gain(S,A) of an attribute A,Gain(S,A)= Entropy(S) - Sum for v from 1 to n of (|Sv|/|S|) * Entropy(Sv)

We can use this notion of gain to rank attributes and to build decision trees where at each node is located the attribute with greatest gain among the attributes not yet considered in the path from the root.

The intention of this ordering is:

1. To create small decision trees so that records can be identified after only a few decision tree splitting.2. To match a hoped for minimalism of the process of decision making

2.4 ID3 Algorithm :11

The ID3 algorithm begins with the original set as the root node. On each iteration of the algorithm, it iterates through every unused attribute of the set and calculates the entropy (or information gain ) of that attribute. Then selects the attribute which has the smallest entropy (or largest information gain) value. The set is then split by the selected attribute (e.g. age < 30 and age >= 30) to produce subsets of the data. The algorithm continues to recurse on each subset. When every element in a subset belongs to the same class, this subset will no longer be recursed on, and this node in the decision tree becomes a terminal node with a class label same as the class all its elements belong to. The ID3 algorithm terminates when every subset is classified. Throughout the algorithm, the decision tree is constructed with each non-terminal node representing the selected attribute on which the data was split, and terminal nodes representing the class label of the final subset of this branch.

ID3 does not guarantee an optimal solution; it can get stuck in local optimums. It uses a greedy approach by selecting the best attribute to split the dataset on each iteration. One improvement that can be made on the algorithm can be to use backtracking during the search for the optimal decision tree.

ID3 can overfit to the training data, to avoid overfitting, smaller decision trees should be preferred over larger ones. This algorithm usually produces small trees, but it does not always produce the smallest possible tree.

ID3 is also harder to use on continuous data. If the values of any given attribute is continuous, then there are many more places to split the data on this attribute, and searching for the best value to split by can be time consuming.

ID3 ( Learning Sets S, Attributes Sets A, Attributesvalues V)12

Return Decision Tree.

Begin

Load learning sets first, create decision tree root node 'rootNode', add learning set S into root node as its subset.

For rootNode, we compute Entropy(rootNode.subset) first

If Entropy(rootNode.subset)==0, then rootNode.subset consists of records all with the same value for the categorical attribute, return a leaf node with decision attribute:attribute value;

If Entropy(rootNode.subset)!=0, then compute information gain for each attribute left(have not been used in splitting), find attribute A with Maximum(Gain(S,A)). Create child nodes of this rootNode and add to rootNode in the decision tree.

For each child of the rootNode, apply ID3(S,A,V) recursively until reach node that has entropy=0 or reach leaf node.

End ID3.

2.5 Example of ID3 Algorithm :

This example shows the construction of a decision tree where P, Q, and C are the predictive attributes and R is the classification attribute.

I've added line numbers for convenient reference: this is _not_ an attribute.

Line number P Q R C Number of instances13

1 Y Y 1 Y 1 2 Y Y 2 N 10 3 Y Y 3 Y 3 4 Y N 1 Y 2 5 Y N 2 Y 11 6 Y N 3 Y 0 7 N Y 1 Y 2 8 N Y 2 N 20 9 N Y 3 Y 3 10 N N 1 Y 1 11 N N 2 Y 15 12 N N 3 Y 3

Total number of instances 71

The above table is T0.

Create root node N1.C1: ID3(T,R) C2: AVG_ENTROPY(P,R,T) C3: FREQUENCY(P,Y,T) = (1+10+3+2+11+0)/71 = 27/71 = 0.3803 C4: SUBTABLE(P,Y,T) = lines 1,2,3,4,5,6. Call this T1. Size(T1)= 27. C5: ENTROPY(R,T1) C6: FREQUENCY(R,1,T1) = (1+2)/27 = 0.1111 C7: FREQUENCY(R,2,T1) = (10+11)/27 = 0.7778 C8: FREQUENCY(R,3,T1) = (3+0)/27 = 0.1111 C5: Return -(0.1111 log(0.1111) + 0.7777 log(0.7777) + 0.1111 log(0.1111)) = 0.9864 C8.1: FREQUENCY(P,Y,T) = (2+20+3+1+15+3)/71 = 44/71 = 0.6197 C9: SUBTABLE(P,N,T) = lines 7,8,9,10,11,12. Call this T2. Size(T2) = 44. C10: ENTROPY(R,T2) C11: FREQUENCY(R,1,T2) = (2+1)/44 = 0.0682 C12: FREQUENCY(R,2,T2) = (20+15)/44 = 0.7955 C13: FREQUENCY(R,3,T2) = (3+3)/44 = 0.1364 C10: Return -(0.0682 log(0.0682) + 0.7955 log(0.7955) + 0.1364 log(0.1364)) = 0.9188 C2: Return (27/71) * 0.9864 + (44/71) * 0.9188 = 0.9445

14

C14: AVG_ENTROPY(Q,R,T) C15: FREQUENCY(Q,Y,T) = (1+10+3+2+20+3)/71 = 39/71 = 0.5493 C16: SUBTABLE(Q,Y,T) = lines 1,2,3,7,8,9. Call this T3. Size(T3)= 39. C17: ENTROPY(R,T3) C18: FREQUENCY(R,1,T3) = (1+2)/39 = 0.0769 C19: FREQUENCY(R,2,T3) = (10+20)/39 = 0.7692 C20: FREQUENCY(R,3,T3) = (3+3)/39 = 0.1538 C17: Return -(0.0769 log(0.0769) + 0.7692 log(0.7692) + 0.1538 log(0.1538)) = 0.9914 C21: FREQUENCY(Q,N,T) = (2+11+0+1+15+3)/71 = 32/71 = 0.4507 C21: SUBTABLE(Q,N,T) = lines 4,5,6,10,11,12. Call this T4. Size(T2) = 32. C22: ENTROPY(R,T4) C23: FREQUENCY(R,1,T4) = (2+1)/32 = 0.0938 C24: FREQUENCY(R,2,T4) = (11+15)/32 = 0.8125 C25: FREQUENCY(R,3,T4) = (0+3)/32 = 0.0938 C22: Return -(0.0938 log(0.0938) + 0.8125 log(0.8125) + 0.0938 log(0.0938)) = 0.8838 C14: Return (39/71) * 0.9914 + (32/71) * 0.8836 = 0.9394 From here on down, I'm abbreviating.

C26: AVG_ENTROPY(C,R,T) C27: FREQUENCY(C,Y,T) = 41/71 = 0.5775 C28: SUBTABLE(C,Y,T) = all lines but 2 and 8. Call this T5. C29: ENTROPY(R,T5) = (6/41) log(6/41) + (26/41) log (26/41) + (9/41) log(9/41) = 1.3028 C30: FREQUENCY(C,N,T) = 30/71 = 0.4225 C31: SUBTABLE(C,N,T) = lines 2 and 8. Call this T6. C32: ENTROPY(R,T6) = 0 log 0 + (30/30) log(30/30) + 0 log 0 = 0 C26 returns (41/71) * 1.3028 = 0.7523.

ENTROPY(R,T) = -((6/71) log(6/71) + (56/71) log(56/71) + (9/71) log(9/71)) =0.9492

Choose AS=C

Mark N1 as split on attributes C. C33: SUBTABLE(C,N,T) is T6 as before.

15

C34: ID3(T6,R) Make a new node N2 In all instances in T5 (lines 2 and 8), X.R = 2. Therefore, this is base case 2. Label node N2 as "X.R=2" C34 returns N2 to C1

In C1: Make an arc labelled "N" from N1 to N2.

C35: SUBTABLE(C,Y,T) is T5 as above. C36: ID3(T5,R) Create new node N3; C37: AVG_ENTROPY(P,R,T5) C38: SUBTABLE(P,Y,T5) is lines 1,3,4,5,6. Call this T7. C39: ENTROPY(R,T7) = -((3/17) log(3/17) + (11/17)log(11/17) + (3/17) log(3/17)) = 1.2898 C40: SUBTABLE(P,N,T5) is lines 7,9,10,11,12. Call this T8. C41: ENTROPY(R,T8) = (3/24) log(3/24) + (15/24) log(15/24) + (6/24) log(6/24) = 1.2988 C37: AVG_ENTROPY(P,R,T5) = (17/41) 1.2898 + (24/41) 1.2988 = 1.2951 C42: AVG_ENTROPY(Q,R,T5) C43: SUBTABLE(Q,Y,T5) is lines 1,3,7,9. Call this T9. C44: ENTROPY(R,T9) = (3/9) log(3/9) + 0 log 0 + (6/9) log (6/9) = 0.9184 C45: SUBTABLE(Q,N,T5) is lines 4,5,6,10,11,12. This is table T4, above. (except that the C column has been deleted) C46: ENTROPY(R,T4) = 0.8836 (see C22 above) C42: AVG_ENTROPY(Q,R,T5) = (9/41) 0.9184 + (32/41) 0.8836 = 0.8912 So we choose AS = Q. C47: ENTROPY(R,T5) was calculated in C29 above to be 1.3029

Mark N3 as split on attribute Q.

C48: SUBTABLE(T5,Q,N) is T4 above: Lines 4,5,6,10,11,12 (minus columns C and Q) C49: ID3(T4,R) Create new node N4 C50: AVG_ENTROPY(P,R,T4) C51: SUBTABLE(P,Y,T4) = lines 4,5,6. Call this T10. C52: ENTROPY(R,T10) = (2/13) log(2/13) + (11/13) log(11/13) + 0 log 0 =

16

0.6194 C53: SUBTABLE(P,N,T4) = lines 10,11,12. Call this T11. C54: ENTROPY(R,T11) = (1/19) log(1/19) + (15/19) log(15/19) + (3/19) log(3/19) = 0.8264 C50: AVG_ENTROPY(P,R,T4) = (13/32) * 0.6194 + (19/32) * 0.8264 = 0.7423 Choose AS = P (no other choices) C55: ENTROPY(R,T4) was calculated in C22 to be 0.8836.

C49 continuing: Mark node N4 as split on P. C56: SUBTABLE(T4,P,N) is T11 above (lines 10,11,12) C57: ID3(T11,R) Make new node N5 No predictive attributes remain Label N5: "Prob(X.R=1) = 1/19. Prob(X.R=2) = 15/19 Prob(X.R=3) = 3/19" Return N5 to C49 C49 continuing: Make an arc labelled "N" from N4 to N5. C58: SUBTABLE(T4,P,Y) is T10 above (lines 4,5,6) C59: ID3(T4,R) Make new node N6 No predictive attributes remain in T10 Label N6: "Prob(X.R=1) = 2/13. Prob(X.R=2) = 11/13 Prob(X.R=3) = 0" Return N6 to C49 C49 continuing: Make an arc labelled "Y" from N4 to N5 C49 returns N4 to C36. C36 continuing: Make an arc labelled "N" from N3 to N4. C60: SUBTABLE(T5,Q,Y) is T9 above (lines 1,3,7,9) C61: ID3(T9,R) Make a new node N7 C62: AVG_ENTROPY(P,R,T9) C63: SUBTABLE(P,Y,T9) is lines 1 and 3. Call this T12. C64: ENTROPY(R,T12) = -((1/4) log(1/4) + (3/4) log(3/4)) = 0.8113 C65: SUBTABLE(P,N,T9) is lines 7 and 9. Call this T11 C67: ENTROPY(R,T11) = -((2/5) log(2/5) + (3/5) log(3/5)) = 0.9710 C68: AVG_ENTROPY(P,R,T9) = (4/9) * 0.8113 * (5/9) * 0.9710 = 0.9000

17

AS is P C69: ENTROPY(R,T9) is calculated in C44 as 0.9184 The result in C50 is not a substantial improvement over C51, particularly considering the size of the table T9. N7 is a leaf, labelled "Prob(X.R=1) = 3/9. Prob(X.R=3) = 6/9" C61 returns N7 to C36. C36 continuing: Make an arc labelled Y from N3 to N7 C36 returns N3 to C1C1 continuing: Make an arc labelled Y from N1 to N3C1 returns N1.Final tree: ______________ | N1 | | Split on C | -------------- | ________N___________|________Y__________ | | ______|_______ _______|_______ | N2 | | N3 | |Prob(R=2)=1.| | Split on Q | -------------- --------------- | _____N_______|______Y______ | | _______|________ ______|_________ | N4 | | N7 | | Split on P | |Prob(R=1)=3/9 | ---------------- |Prob(R=3)=6/9 | | ---------------- _______N________|_______Y_______ | | _______|_________ ________|________ | N5 | | N6 | |Prob(R=1)=1/19 | |Prob(R=1)=2/13 | |Prob(R=2)=15/19| |Prob(R=2)=11/13| |Prob(R=3)=3/19 | ----------------- -----------------

Note that, in a deterministic tree, there would be no point in the split at N4, since both N5 and N6 predict R=2. This split would be eliminated in post-processing.

Example 2:

18

Sample training data to determine whether an animal lays eggs.

For attribute ‘Warm-blooded’:Values(Warm-blooded) : [Yes,No]S = [4Y,2N]SYes = [3Y,2N] E(SYes) = 0.97095SNo = [1Y,0N] E(SNo) = 0 (all members belong to same class) Gain(S,Warm-blooded) = 0.91829 – [(5/6)*0.97095 + (1/6)*0]

= 0.10916

For attribute ‘Feathers’:Values(Feathers) : [Yes,No]S = [4Y,2N]SYes = [3Y,0N] E(SYes) = 0SNo = [1Y,2N] E(SNo) = 0.91829Gain(S,Feathers) = 0.91829 – [(3/6)*0 + (3/6)*0.91829] = 0.45914

19

Independent/Condition attributes Dependent/Decision attributes

Animal Warm-blooded

Feathers Fur Swims Lays Eggs

Ostrich Yes Yes No No Yes

Crocodile No No No Yes Yes

Raven Yes Yes No No Yes

Albatross Yes Yes No No Yes

Dolphin Yes No No Yes No

Koala Yes No Yes No No

For attribute ‘Fur’:Values(Fur) : [Yes,No]S = [4Y,2N]SYes = [0Y,1N] E(SYes) = 0SNo = [4Y,1N] E(SNo) = 0.7219Gain(S,Fur) = 0.91829 – [(1/6)*0 + (5/6)*0.7219] = 0.3167

For attribute ‘Swims’:Values(Swims) : [Yes,No]S = [4Y,2N]SYes = [1Y,1N] E(SYes) = 1 (equal members in both classes)SNo = [3Y,1N] E(SNo) = 0.81127Gain(S,Swims) = 0.91829 – [(2/6)*1 + (4/6)*0.81127] = 0.04411

Gain(S,Warm-blooded) = 0.10916Gain(S,Feathers) = 0.45914Gain(S,Fur) = 0.31670Gain(S,Swims) = 0.04411

Gain(S,Feathers) is maximum, so it is considered as the root node

We now repeat the procedure,S: [Crocodile, Dolphin, Koala]S: [1+,2-]Entropy(S) = -(1/3)log2(1/3) – (2/3)log2(2/3) = 0.91829

For attribute ‘Warm-blooded’:Values(Warm-blooded) : [Yes,No]S = [1Y,2N]SYes = [0Y,2N] E(SYes) = 0SNo = [1Y,0N] E(SNo) = 0 Gain(S,Warm-blooded) = 0.91829 – [(2/3)*0 + (1/3)*0] = 0.91829

20

For attribute ‘Fur’:Values(Fur) : [Yes,No]S = [1Y,2N]SYes = [0Y,1N] E(SYes) = 0SNo = [1Y,1N] E(SNo) = 1Gain(S,Fur) = 0.91829 – [(1/3)*0 + (2/3)*1] = 0.25162

For attribute ‘Swims’:Values(Swims) : [Yes,No]S = [1Y,2N]SYes = [1Y,1N] E(SYes) = 1SNo = [0Y,1N] E(SNo) = 0 Gain(S,Swims) = 0.91829 – [(2/3)*1 + (1/3)*0] = 0.25162

Gain(S,Warm-blooded) is maximum.

21

22

Pruning Methods3.1 Overview

Employing tightly stopping criteria tends to create small and under–fitted decision trees. On the other hand, using loosely stopping criteria tends to generate large decision trees that are over–fitted to the training set. Pruning methods originally suggested in (Breiman et al., 1984) were developed for solving this dilemma. According to this methodology, a loosely stopping criterion is used, letting the decision tree to overfit the training set. Then the over-fitted tree is cut back into a smaller tree by removing sub–branches that are not contributing to the generalization accuracy. It has been shown in various studies that employing pruning methods can improve the generalization performance of a decision tree, especially in noisy domains.

Another key motivation of pruning is “trading accuracy for simplicity” as presented in (Bratko and Bohanec, 1994). When the goal is to produce a sufficiently accurate compact concept description, pruning is highly useful. Within this process, the initial decision tree is seen as a completely accurate one. Thus the accuracy of a pruned decision tree indicates how close it is to the initial tree.

There are various techniques for pruning decision trees. Most of them perform top-down or bottom-up traversal of the nodes. A node is pruned if this operation improves a certain criteria. The following subsections describe the most popular techniques.

3.2 Cost–Complexity PruningCost-complexity pruning (also known as weakest link pruning or errorcomplexity pruning) proceeds in two stages (Breiman et al., 1984). In the first stage, a sequence of trees T0, T1, . . . , Tk is built on the

23

training data where T0 is the original tree before pruning and Tk is the root tree.

In the second stage, one of these trees is chosen as the pruned tree, based on its generalization error estimation. The tree Ti+1 is obtained by replacing one or more of the sub–trees in the predecessor tree Ti with suitable leaves. The sub–trees that are pruned are those that obtain the lowest increase in apparent error rate per pruned leaf.

In the second phase the generalization error of each pruned tree T0, T1, . . . , Tk is estimated. The best pruned tree is then selected. If the given dataset is large enough, the authors suggest breaking it into a training set and a pruning set. The trees are constructed using the training set and evaluated on the pruning set. On the other hand, if the given dataset is not large enough, they propose to use cross–validation methodology, despite the computational complexity implications.

3.3 Reduced Error Pruning

A simple procedure for pruning decision trees, known as reduced error pruning, has been suggested by Quinlan (1987). While traversing over the internal nodes from the bottom to the top, the procedure checks for each internal node, whether replacing it with the most frequent class does not reduce the tree’s accuracy. In this case, the node is pruned. The procedure continues until any further pruning would decrease the accuracy.

In order to estimate the accuracy, Quinlan (1987) proposes to use a pruning set. It can be shown that this procedure ends with the smallest accurate sub–tree with respect to a given pruning set.

3.4 Minimum Error Pruning (MEP)

The minimum error pruning has been proposed in (Olaru and Wehenkel,24

2003). It performs bottom–up traversal of the internal nodes. In each node it compares the l-probability error rate estimation with and without pruning. The l-probability error rate estimation is a correction to the simple probability estimation using frequencies.

The error rate of an internal node is the weighted average of the error rate of its branches. The weight is determined according to the proportion of instances along each branch. The calculation is performed recursively up to the leaves. If an internal node is pruned, then it becomes a leaf and its error rate is calculated directly using the last equation. Consequently, we can compare the error rate before and after pruning a certain internal node. If pruning this node does not increase the error rate, the pruning should be accepted.

3.5 Error–based Pruning (EBP)

Error–based pruning is an evolution of pessimistic pruning. It is implemented in the well–known C4.5 algorithm.

3.6 Optimal Pruning

The issue of finding optimal pruning has been studied in (Bratko and Bohanec, 1994) and (Almuallim, 1996). The first research introduced an algorithm which guarantees optimality, known as OPT. This algorithm finds the optimal pruning based on dynamic programming, with the complexity of £(jleveas(T)j2), where T is the initial decision tree. The second research introduced an improvement of OPT called OPT–2, which also performs optimal pruning using dynamic programming. However, the time and space complexities of OPT–2 are both £(jleveas(T¤)j ¢ jinternal(T)j), where T¤ is the target (pruned) decision tree and T is the initial decision tree.

Since the pruned tree is habitually much smaller than the initial tree and the number of internal nodes is smaller than the number of leaves, OPT–

25

2 is usually more efficient than OPT in terms of computational complexity.

3.7 Minimum Description Length (MDL) Pruning

The minimum description length can be used for evaluating the generalized accuracy of a node (Rissanen, 1989; Quinlan and Rivest, 1989; Mehta et al., 1995). This method measures the size of a decision tree by means of the number of bits required to encode the tree. The MDL method prefers decision trees that can be encoded with fewer bits.

3.8 Comparison of Pruning Methods

Several studies aim to compare the performance of different pruning techniques (Quinlan, 1987; Mingers, 1989; Esposito et al., 1997). The results indicate that some methods (such as cost–complexity pruning, reduced error pruning) tend to over–pruning, i.e. creating smaller but less accurate decision trees. Other methods (like error-based pruning, pessimistic error pruning and minimum error pruning) bias toward under–pruning. Most of the comparisons concluded that the “no free lunch” theorem applies in this case also, namely there is no pruning method that in any case outperforms other pruning methods.

ADVANTAGES26

Several advantages of the decision tree as a classification tool have been

pointed out in the literature:

Decision trees are self–explanatory and when compacted they are

also easy to follow. In other words if the decision tree has a

reasonable number of leaves, it can be grasped by non–professional

users. Furthermore decision trees can be converted to a set of rules.

Thus, this representation is considered as comprehensible.

Decision trees can handle both nominal and numeric input attributes.

Decision tree representation is rich enough to represent any discrete

value classifier.

Decision trees are capable of handling datasets that may have errors.

Decision trees are capable of handling datasets that may have

missing values.

Decision trees are considered to be a nonparametric method. This

means that decision trees have no assumptions about the space

distribution and the classifier structure.

DISADVANTAGES27

1. Most of the algorithms (like ID3 and C4.5) require that the target

attribute will have only discrete values.

2. As decision trees use the “divide and conquer” method, they tend to

perform well if a few highly relevant attributes exist, but less so if many

complex interactions are present. One of the reasons for this is that other

classifiers can compactly describe a classifier that would be very

challenging to represent using a decision tree. Note that the tree contains

two copies of the same sub-tree.

3. The greedy characteristic of decision trees leads to another

disadvantage that should be pointed out. This is its over–sensitivity to

the training set, to irrelevant attributes and to noise (Quinlan, 1993).

REFERENCES

1. http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htm

28

2. http://decisiontrees.net/node/27

3. "Building Decision Trees with the ID3 Algorithm", by: Andrew Colin,

Dr. Dobbs Journal, June 1996

4. http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/dt_prob1.html

5. http://dms.irb.hr/tutorial/tut_dtrees.php

6. http://www.dcs.napier.ac.uk/~peter/vldb/dm/node11.html

29

Date post:	08-Nov-2014
Category:	Documents
Upload:	ishan-dalmia
View:	43 times
Download:	4 times

Decision Tree Report

Documents