Post on 22-Feb-2016
description
transcript
C4.5 and CHAID AlgorithmPavan J Joshi2010MCS2095
Special Topics in Database Systems
Outline• Disadvantages of ID3 algorithm• C4.5 algorithm• Gain ratio• Noisy Data and overfitting• Tree pruning• Handling of missing values• Error estimation• Continuous data
• CHAID
ID3 Algorithm• Top down construction of decision tree by recursively selecting
the “best attribute” to use at the current node, based on the training data
• It can only deal with nominal data• It is not robust in dealing with noisy data sets• It overfits the tree to the training data• It creates unnecessarily complex trees without pruning• It does not handle missing data values well
C4.5 Algorithm• An Improvement over ID3 algorithm• Designed to handle
Noisy data better Missing data Pre and post pruning of decision trees Attributes with continuous values Rule Derivation
Using Gain Ratios• The notion of Gain introduced earlier favors attributes that have a
large number of values. • If we have an attribute D that has a distinct value for each
record, then Info(D,T) is 0, thus Gain(D,T) is maximal.
• To compensate for this Quinlan suggests using the following ratio instead of Gain:
GainRatio(D,T) = Gain(D,T) / SplitInfo(D,T)
• SplitInfo(D,T) is the information due to the split of T on the basis of value of categorical attribute D.
SplitInfo(D,T) = I(|T1|/|T|, |T2|/|T|, .., |Tm|/|T|)
where {T1, T2, .. Tm} is the partition of T induced by value of D.
Noisy data• Many kinds of "noise" that could occur in the examples:
• Two examples have same attribute/value pairs, but different classifications • Some values of attributes are incorrect because of:
• Errors in the data acquisition process• Errors in the preprocessing phase
• The classification is wrong (e.g., + instead of -) because of some error
• Some attributes are irrelevant to the decision-making process,• e.g., color of a die is irrelevant to its outcome. • Irrelevant attributes can result in overfitting the training data.
What’s Overfitting?• Overfitting = Given a hypothesis space H, a hypothesis hєH is said to
overfit the training data if there exists some alternative hypothesis h’єH, such that
1. h has smaller error than h’ over the training examples, but2. h’ has a smaller error than h over the entire distribution of instances.
Why Does my Method Overfit ?
• In domains with noise or uncertainty the system may try to decrease the training error by completely fitting all the training examples
Fix overfitting/overlearning problem
Ok, my system may overfit… Can I avoid it?• Yes! Do not include branches that fit data too
specificallyHow?1. Pre-prune: Stop growing a branch when information becomes unreliable2. Post-prune: Take a fully-grown decision tree and discard unreliable parts
Pre - Pruning• Based on statistical significance test• Stop growing the tree when there is no statistically significantassociation between any attribute and the class at a particularnode• Use all available data for training and apply the statistical testto estimate whether expanding/pruning a node is to produce animprovement beyond the training set
• Most popular test: chi-squared test• chi2 = sum( (O-E)2 / E )
Where, O = observed data, E = expected values based on hypothesis.
Example • Example : 5 schools have the same test. Total score is 375,
individual results are: 50, 93, 67, 78 and 87. Is this distribution significant, or was it just luck? Average is 75.
(50-75)2/75 + (93-75)2/75 + (67-75)2/75 + (78-75)2/75 +(87-75)2/75 = 15.55This distribution is significant !
Post – pruning • Two pruning operations:
1. Subtree replacement2. Subtree raising
Subtree Replacement
Subtree Replacement• Pruning of the decision tree is done by replacing a whole subtree by a leaf
node.
• The replacement takes place if a decision rule establishes that the expected error rate in the subtree is greater than in the single leaf.
• E.g.,• Training: eg, one training red success and one training blue Failures• Test: three red failures and one blue success• Consider replacing this subtree by a single Failure node.
• After replacement we will have only two errors instead of five failures.
Color
1 success0 failure
0 success1 failure
red blue
Color
1 success3 failure
1 success1 failure
red blue 2 success4 failure
FAILURE
Subtree Raising
Error Estimation
• Error estimate of a subtree is a weighted sum of error estimates of all its leaves• Error estimation at every node
• Z is a constant 0.69• F is the error on the training data• N is the number of instances covered by the leaf
Deal with continuous data
• When dealing with nominal data, We evaluated the grain for each possible value• In continuous data, we have infinite values. What
should we do?• Continuous-valued attributes may take infinite values,
but we have a limited number of values in our instances (at most N if we have N instances)
• Therefore, simulate that you have N nominal values• Evaluate information gain for every possible split point of the
Attribute Choose the best split point• The information gain of the attribute is the information gain of
the best split
Example
Split in continuous data• Split on temperature attribute
• For example, in the above array of values the split is occurring between 71 and 72( N distinct values meaning at most N-1 splits)• The threshold value is the largest value from the whole
training set which lies between 71 and 72• Of all such splits , the one with the best Information Gain
is chosen for the node
Deal with missing values• Many possible approaches• Treat them as different values• Propogate the cases containing such values down the tree
without considering them in the “Information Gain” calculation
From Trees to Rules• Now we've built a tree, it might be desirable to re-express
it as a list of rules.• Simple Method: Generate a rule by conjunction of tests in
each path through the tree.• Eg:
if temp > 71.5 and ... and windy = false then play=yesif temp > 71.5 and ... and windy = true then play=no
• But these rules are more complicated than necessary.• Instead we could use the pruning method of C4.5 to prune
rules as well as trees.
Rule Derivation
for each rule,e = error rate of rulee' = error rate of rule - finalConditionif e' < e,rule = rule-finalConditionrecurseremove duplicate rules
• Expensive: Need to reevaluate entire training set for every condition!• Might create duplicate rules if all of the final conditions
from a path are removed.
Chi-Squared Automatic Interaction Detection(CHAID)• It is one of the oldest tree classification methods originally
proposed by Kass in 1980• The first step is to create categorical predictors out of any
continuous predictors by dividing the respective continuous distributions into a number of categories with an approximately equal number of observations
• The next step is to cycle through the predictors to determine for each predictor the pair of (predictor) categories that is least significantly different with respect to the dependent variable
• The next step is to choose the split the predictor variable with the smallest adjusted p-value, i.e., the predictor variable that will yield the most significant split
• Continue this process until no further splits can be performed
AlgorithmDividing the cases that reach a certain node in the tree1. Cross tabulate the response variable (target) with each of the explanatory variables.
A < =10 A > 10Good
Bad
Algorithm – step 2
2. When there are more than two columns, find the "best" subtable formed by combining column categories
2.1 This is applied to each table with more than 2 columns. 2.2 Compute Pearson X2 tests for independence for each allowable subtable2.3 Look for the smallest X2 value. If it is notsignificant, combine the column categories.2.4 Repeat step 2 if the new table has more than two columns
Algorithm – step 3
3 Allows categories combined at step 2 to be broken apart.3.1 For each compound category consisting of at least 3 of the original categories, find the “most significant" binary split3.2 if X2 is significant, implement the split and return to step 2.3.3 otherwise retain the compound categories for this variable, and move on to the next variable
Algorithm - Step 4
4. You have now completed the “optimal” combining of categories for each explanatory variable.
4.1 Find the most significant of these “optimally” merged explanatory variables4.2 Compute a “Bonferroni” adjusted chi-squared test of independence for the reduced table for each explanatory variable.
Algorithm – Step 5
5 Use the “most significant" variable in step 4 to split the node with respect to the merged categories for that variable.
5.1 repeat steps 1-5 for each of the offspring nodes. 5.2 Stop if
• no variable is significant in step 4.• the number of cases reaching a node is below a specified limit.
References• C4.5 Algorithm and Multivariate decision trees by Thales senh
Korting• http://www.statsoft.com/textbook/chaid-analysis/• http://www.public.iastate.edu/~kkoehler/stat557/tree14p.pdf
Thank you !