Date post: | 28-Mar-2015 |
Category: |
Documents |
Upload: | stephanie-crawford |
View: | 226 times |
Download: | 0 times |
COMP3740 CR32:Knowledge Management
and Adaptive Systems
Supervised ML to learn Classifiers: Decision Trees and Classification Rules
Eric Atwell, School of Computing, University of Leeds
(including re-use of teaching resources from other sources, esp. Knowledge Management by Stuart Roberts,
School of Computing, University of Leeds)
Reminder:Objectives of data mining
• Data mining aims to find useful patterns in data.
• For this we need:– Data mining techniques, algorithms, tools, eg WEKA– A methodological framework to guide us, in collecting
data and applying the best algorithms, CRISP-DM
• TODAY’S objective: learn how to learn classifiers
• Decision Trees and Classification Rules
• Supervised Machine Learning: training set has the “answer” (class) for each example (instance)
Reminder:Concepts that can be “learnt”
The types of concepts we try to ‘learn’ include:• Clusters or ‘Natural’ partitions;
– Eg we might cluster customers according to their shopping habits.
• Rules for classifying examples into pre-defined classes.– Eg “Mature students studying information systems with high grade
for General Studies A level are likely to get a 1st class degree”
• General Associations – Eg “People who buy nappies are in general likely also to buy beer”
• Numerical prediction– Eg Salary = a*A-level + b*Age + c*Gender + d*Prog + e*Degree
(but are Gender, Programme really numbers???)
Output: decision treeOutlook
Humidity
sunny
high
Play = ‘no’
normal
Play = ‘yes’
Windy
rainy
true
Play = ‘no’
false
Play = ‘yes’
Decision Tree Analysis
• Example instance setShares files Uses
scanner Infected before Risk
Yes Yes No High Yes No No High No No Yes Medium Yes Yes Yes Low Yes Yes No High No Yes No Low Yes No Yes High
Can we predict, from the first 3 columns, the risk of getting a virus?
For convenience later:F = ‘shares Files’, S = ‘uses Scanner’, I = ‘Infected before’
Decision tree building method• Forms a decision tree
– tries for a small tree covering all or most of the training set– internal nodes represent a test on an attribute value– branches represent outcome of the test
• Decides which attribute to test at each node– this is based on a measure of ‘entropy’
• Must avoid ‘over-fitting’– if the tree is complex enough it might describe the training
set exactly, but be no good for prediction
• May leave some ‘exceptions’
Building a decision tree (DT)The algorithm is recursive, at any step:
T = set of (remaining) training instances,
{C1, …, Ck} = set of classes
• If all instances in T belong to a single class Ci, then DT is a leaf node identifying class Ci. (done!)
…continued
Building a decision tree (DT)…continued • If T contains instances belonging to mixed
classes, then choose a test based on a single attribute that will partition T into subsets {T1, …, Tn} according to n outcomes of the test.The DT for T comprises a root node identifying the test and one branch for each outcome of the test.
• The branches are formed by applying the rules above recursively on each of the subsets {T1, …, Tn} .
F S I Risk Yes Yes No High Yes No No High No No Yes Medium Yes Yes Yes Low Yes Yes No High No Yes No Low Yes No Yes High
T = Classes = {High, Medium, Low}
Choose a test based on F, number of outcomes, n = 2 (Yes or No)
Fyes no
F S I Risk Yes Yes No High Yes No No High Yes Yes Yes Low Yes Yes No High Yes No Yes High
T1 = F S I Risk No No Yes Medium No Yes No Low
T2 =
Tree Building example
Tree Building example
T1 = Classes = {High, Medium, Low}
Choose a test based on I, number of outcomes, n = 2 (Yes or No)
F S I Risk Yes Yes No High Yes No No High Yes Yes Yes Low Yes Yes No High Yes No Yes High
T3 = F S I Risk Yes Yes Yes Low Yes No Yes High
T4 = F S I Risk Yes Yes No High Yes No No High Yes Yes No High
Iyes no
Fyes no
Tree Building example
T1 = Classes = {High, Medium, Low}
Choose a test based on I, number of outcomes, n = 2 (Yes or No)
F S I Risk Yes Yes No High Yes No No High Yes Yes Yes Low Yes Yes No High Yes No Yes High
T3 = F S I Risk Yes Yes Yes Low Yes No Yes High
Iyes no
Fyes no
Risk = ‘High’
Tree Building example
Classes = {High, Medium, Low}
Choose a test based on S, number of outcomes, n = 2 (Yes or No)
Iyes no
T3 = F S I Risk Yes Yes Yes Low Yes No Yes High
Fyes no
Risk = ‘High’Syes no
F S I Risk Yes No Yes High
F S I Risk Yes Yes Yes Low
Tree Building example
Classes = {High, Medium, Low}
Choose a test based on S, number of outcomes, n = 2 (Yes or No)
Iyes no
T3 = F S I Risk Yes Yes Yes Low Yes No Yes High
Fyes no
Risk = ‘High’Syes no
Risk = ‘Low’ Risk = ‘High’
Tree Building example
Classes = {High, Medium, Low}
Choose a test based on S, number of outcomes, n = 2 (Yes or No)
Iyes no
T2 =
Fyes no
Syes no
Risk = ‘Low’
F S I Risk No No Yes Medium No Yes No Low
Risk = ‘High’
Risk = ‘High’
Syes no
F S I Risk No Yes No Low
F S I Risk No No Yes Medium
Tree Building example
Classes = {High, Medium, Low}
Choose a test based on S, number of outcomes, n = 2 (Yes or No)
Iyes no
T2 =
Fyes no
Syes no
Risk = ‘Low’
F S I Risk No No Yes Medium No Yes No Low
Risk = ‘High’
Risk = ‘High’
Syes no
Risk = ‘Low’
Risk = ‘Medium’
Example Decision TreeShares files?
no yes
Uses scanner?
no yes
Infected before?
yes no
Uses scanner?
no Yes
medium low
lowhigh
high
Which attribute to test?• The ROOT could be S or I instead of F – leading to a
different Decision Tree
• Best DT is the “smallest”, most concise model
• The search space in general is too large to find the smallest tree by exhaustive searching (try them all).
• Instead we look for the attribute which splits the training set into the most homogeneous sets.
• The measure used for ‘homogeneity’ is based on entropy.
F S I High Risk?
Yes Yes No Yes Yes No No Yes No No Yes No Yes Yes Yes No Yes Yes No Yes No Yes No No Yes No Yes Yes
T = Classes = {Yes, No}
Choose a test based on F, number of outcomes, n = 2 (Yes or No)Fyes no
Tree Building example (modified)
F S I High Risk?
Yes Yes No Yes Yes No No Yes Yes Yes Yes No Yes Yes No Yes Yes No Yes Yes
F S I High Risk?
No No Yes No No Yes No No
F S I High Risk?
Yes Yes No Yes Yes No No Yes No No Yes No Yes Yes Yes No Yes Yes No Yes No Yes No No Yes No Yes Yes
T = Classes = {Yes, No}
Choose a test based on F, number of outcomes, n = 2 (Yes or No)Fyes no
Tree Building example (modified)
High Risk = ‘yes’
5, 1
High Risk = ‘no’
2, 0
F S I High Risk?
Yes Yes No Yes Yes No No Yes No No Yes No Yes Yes Yes No Yes Yes No Yes No Yes No No Yes No Yes Yes
T = Classes = {Yes, No}
Choose a test based on S, number of outcomes, n = 2 (Yes or No)Syes no
Tree Building example (modified)
F S I High Risk?
Yes Yes No Yes Yes Yes Yes No Yes Yes No Yes No Yes No No
F S I High Risk?
Yes No No Yes No No Yes No Yes No Yes Yes
F S I High Risk?
Yes Yes No Yes Yes No No Yes No No Yes No Yes Yes Yes No Yes Yes No Yes No Yes No No Yes No Yes Yes
T = Classes = {Yes, No}
Choose a test based on S, number of outcomes, n = 2 (Yes or No)Syes no
Tree Building example (modified)
High Risk = ‘no’
4,2
High Risk = ‘yes’
3,1
F S I High Risk?
Yes Yes No Yes Yes No No Yes No No Yes No Yes Yes Yes No Yes Yes No Yes No Yes No No Yes No Yes Yes
T = Classes = {Yes, No}
Choose a test based on I, number of outcomes, n = 2 (Yes or No)Iyes no
Tree Building example (modified)
F S I High Risk?
No No Yes No Yes Yes Yes No Yes No Yes Yes
F S I High Risk?
Yes Yes No Yes Yes No No Yes Yes Yes No Yes No Yes No No
F S I High Risk?
Yes Yes No Yes Yes No No Yes No No Yes No Yes Yes Yes No Yes Yes No Yes No Yes No No Yes No Yes Yes
T = Classes = {Yes, No}
Choose a test based on I, number of outcomes, n = 2 (Yes or No)Iyes no
Tree Building example (modified)
High Risk = ‘no’
3, 1
High Risk = ‘yes’
4,1
Decision tree building algorithm• For each decision point,
– If all remaining examples are all +ve or all -ve, stop.– Else if there are some +ve and some -ve examples left and
some attributes left, pick the remaining attribute with largest information gain
– Else if there are no examples left, no such example has been observed; return default
– Else if there are no attributes left, examples with the same description have different classifications: noise or insufficient attributes or nondeterministic domain
Evaluation of decision trees• At the leaf nodes two numbers are given:
– N: the coverage for that node: how many instances– E: the error rate: how many wrongly classified instances
• The whole tree can be evaluated in terms of its size (number of nodes) and overall error-rate expressed in terms of the number and percentage of cases wrongly classified.
• We seek small trees that have low error rates.
Evaluation of decision trees• The error rate for the whole tree can also be
displayed in terms of a confusion matrix:(A) (B) (C) Classified as
35 2 1 Class (A) = high
4 41 5 Class (B) = medium
2 5 68 Class (C) = low
Evaluation of decision trees• The error rates mentioned on previous slides are
normally computed usinga. The training set of instances.
b. A test set of instances – some different examples!
• If the decision tree algorithm has ‘over-fitted’ the data, then the error rate based on the training set will be far less than that based on the test set.
Evaluation of decision trees• 10-fold cross-validation can be used when the training set is
limited in size:– Divide the test set randomly into 10 subsets.
– Build a tree from 9 of the subsets and test using the 10th.
– Repeat the experiment 9 more times, using a different test set each time.
– Overall error rate is average of 10 experiments
• 10-fold cross-validation will lead to up to 10 different decision trees being built. The method for selecting or constructing the best tree is not clear.
From decision trees to rules• Decision trees may not be easy to interpret:
– tests associated with lower nodes have to be read in the context of tests further up the tree
– ‘sub-concepts’ may sometimes be split up and distributed to different parts of the tree (see next slide)
– Computer Scientists may prefer “if … then …” rules!
DT for “F = G = 1 or J = K = 1”
F= 0;J = 0; noJ = 1;
K = 0; noK = 1; yes
F = 1;G = 1; yesG = 0;
J = 0; noJ = 1;
K = 0; noK = 1; yes
J=K=1 is split across two subtrees.
F
J
K
G
J
K
0
0
0
0
0
0
1
1
1
1
1
1yes
yes
yes
no
no no
no
Converting DT to rules• Step 1: Every path from root to leaf represents a
rule:F= 0;
J = 0; noJ = 1;
K = 0; noK = 1; yes
F = 1;G = 1; yesG = 0;
J = 0; noJ = 1;
K = 0; noK = 1; yes
If F = 0 and J = 0 then class no;
If F = 0 and J = 1 and K = 0 then class no
If F = 0 and J = 1 and K = 1 then class yes
….
If F = 1 and G = 0 and J = 1 and K = 1then class yes
Generalising rulesIf F = 0 and J = 1 and K = 1 then class yes
If F = 1 and G = 0 and J = 1 and K = 1then class yes
If G = 1 then class yes
If J =1 and K = 1 then class yes
Tidying up rule sets• Generalisation leads to 2 problems:• Rules no longer mutually exclusive
– Order rules and use the first matching rule used as the operative rule.
– Ordering is based on how many false positive errors the rule makes
• Rule set no longer exhaustive– Choose a default value for the class when no rule applies– Default class is that which contains the most training
cases not covered by any rule.
Decision Tree - Revision
Decision tree builder algorithm discovers rules for classifying instances.
At each step, it needs to decide which attribute to test at that point in the tree; a measure of ‘information gain’ can be used.
The output is a decision tree based on the ‘training’ instances, evaluated with separate “test” instances.
Leaf nodes which have a small coverage may be pruned if the error rate is small for the pruned tree.
Pruning example (from W & F)
Health plan contribution
4 bad2 good
1 bad1 good
4 bad2 good
none half full
We replace the subtree with: Bad
14, 5Number of instances
number of errors
Decision trees v classification rules• Decision trees can be used for prediction or
interpretation.– Prediction: compare an unclassified instance against the
tree and predict what class it is in (with error estimate)– Interpretation: examine tree and try to understand why
instances end up in the class they are in.
• Rule sets are often better for interpretation.– ‘Small’, accurate rules can be examined, even if overall
accuracy of the rule set is poor.
Self Check• You should be able to:
– Describe how the decision-trees are built from a set of instances.
– Build a decision tree based on a given attribute– Explain what the ‘training’ and ‘test’ sets are for.– Explain what “Supervised” means, and why classification
is an example of supervised ML