Homework 4: Solutions
CS4445/B12Provided by: Kenneth J. Loomis
Homework 4 SolutionsCLASSIFICATION RULES: RIPPER ALGORITHM
RIPPER: First Rule• The first thing that needs to be determined is the consequence of the rule: Recall that a rule is made up of an antecedent consequence.• The table below contains the frequency counts of the possible consequences of the rules from the userprofile dataset using budget as the classification attribute:Rule Frequency
… budget=low 35… budget=medium
91
… budget=high 5… budget=? 7
• We can see that budget=high has the lowest frequency count in our training dataset, so we choose that as the first antecedent that we will find rules for.• Note: I have included missing values here as one could classify the target as missing. Alternately, these instances could be removed.
RIPPER: First Rule• Next we attempt to find the first condition in the antecedent. We need only look at possible conditions that exists in the 5 instances that have budget=high.• The list of possible conditions are in the table below.
Rule: ___ -> budget=highsmoker=true ambience=family personality=hard-workersmoker=false ambience=friends personality=conformist
drink_level=abstemious transport=car owner personality=hunter-ostentatious
drink_level=casual drinker transport=public personality=thrifty-protector
drink_level=social drinker marital_status=single religion=nonedress_preference=no
preferenceinterest=technology religion=mormon
dress_preference=informal
interest=none religion=christian
dress_preference=formal interest=variety activity=student
RIPPER: First Rule• Next we determine the information gain for each of the candidate rules in the table.• Below is a detailed example of the calculation for the rulesmoker = true budget = high:Given: is the number of instances such that budget=highis the number of instance such that budget ≠ highis the number of instances such that smoker=true and budget=highis the number of instance such that smoker=true but budget ≠ high
=
RIPPER: First Rule• Here we see a list of the information gain for each of the possible first condition in the antecedentRule: ___ -> budget=high Info
GainRule: ___ -> budget=high Info
Gainsmoker=true 0.0862 marital_status=single 0.8889smoker=false 0.07365 interest=technology 3.6049
drink_level=abstemious 2.0974 interest=none -0.1203drink_level=casual drinker -0.7680 interest=variety 3.6049drink_level=social drinker -0.5353 personality=hard-worker -1.1441
dress_preference=no preference
0.1174 personality=conformist 1.9792
dress_preference=informal -0.3426 personality=hunter-ostentatious
1.2016
dress_preference=formal -0.5710 personality=thrifty-protector -0.1428ambience=family -0.6854 religion=none -0.1203ambience=friends 2.5440 religion=mormon 4.7866
transport=car owner 6.7865 religion=christian 1.9792transport=public -1.5710 activity=student -0.1343
RIPPER: First Rule
• Since the following rule results in the highest information gain we select that as the first condition of our rule:transport = car owner budget = high:• Now we can use the number of instances calculated from this rule as and we calculate all the possible second conditions as in the next set of calculations.
RIPPER: First Rule• Next we attempt to find the second condition in the antecedent. We need only look at possible conditions that exists in the 4 instances that have transport = car owner and budget=high.• The list of possible conditions are in the table below.Rule: transport=car owner and ___ -> budget=high
smoker=false ambience=friends personality=thrifty-protector
drink_level=abstemious marital_status=single religion=nonedrink_level=casual drinker interest=technology religion=mormon
dress_preference=no preference
interest=none religion=christian
dress_preference=informal
interest=variety activity=student
dress_preference=elegant personality=hard-workerambience=family personality=hunter-
ostentatious
RIPPER: First Rule• Here we see a list of the information gain for each of the possible second condition in the antecedentRule: transport=car owner
and ___ -> budget=high
Info Gain
Rule: transport=car owner and
___ -> budget=high
Info Gain
smoker=false 2.5121 interest=none 0.0875drink_level=abstemious 5.0173 interest=variety 2.5602drink_level=casual drinker -0.6130 personality=hard-worker -1.1605
dress_preference=no preference
-.06097 personality=hunter-ostentatious
0.7655
dress_preference=informal 0.7655 personality=thrifty-protector 1.5311dress_preference=elegant 3.0875 religion=none -0.0824
ambience=family -0.6130 religion=mormon 3.0875ambience=friends 1.5075 religion=christian 3.0875
marital_status=single 2.7570 activity=student -0.0840interest=technology 2.5602
RIPPER: First Rule
• Since the following rule results in the highest information gain we select that as the second condition of our rule:transport = car owner and drink_level=abstemious budget = high:• Now we can use the number of instances calculated from this rule as and we calculate all the possible third conditions as in the next set of calculations.
RIPPER: First Rule• Next we attempt to find the third condition in the antecedent. We need only look at possible conditions that exists in the 3 instances that have transport = car owner and drink_level = abstemious and budget=high.• The list of possible conditions are in the table below.Rule: transport=car owner and drink_level=abstemious
and ___ -> budget=highsmoker=false interest=technology personality=thrifty-
protectordress_preference=no
preferenceinterest=none religion=none
dress_preference=formal interest=variety religion=catholicambience=family personality=hard-worker religion=christianambience=friends personality=hunter-
ostentatiousactivity=student
marital_status=single
RIPPER: First Rule• Here we see a list of the information gain for each of the possible third conditions in the antecedentRule: transport=car owner
and drink_level=abstemious and ___ -> budget=high
Info Gain
Rule: transport=car owner and
drink_level=abstemious and ___ -> budget=high
Info Gain
smoker=false 0 interest=variety 0.4515dress_preference=no
preference-0.3399 personality=hard-worker -0.5850
dress_preference=formal 1.4513 personality=hunter-ostentatious
1.4150
ambience=family -0.5850 personality=thrifty-protector -0.1699ambience=friends 2.8300 religion=none -0.1699
marital_status=single 1.2415 religion=catholic -0.5850interest=technology 0.4515 religion=christian 1.4150
interest=none -0.5850 activity=student .01826
RIPPER: First Rule
• Since the following rule results in the highest information gain we select that as the third condition of our rule:transport = car owner and drink_level = abstemious and ambience = friends budget = high:• Note that this rule covers only positive examples (i.e., budget=high data instances). Since it doesn’t cover negative examples, then there is no need to add more conditions to the rule. RIPPER’s construction of the first rule is now complete.
RIPPER: Pruning the First Rule
First rule: transport = car owner and drink_level = abstemious and ambience = friends budget = high:In order to decide if/how to prune this rule, RIPPER will:• use a validation set (that is, a piece of the training set that was kept apart and not used to construct the rule)• use a metric for pruning: v = (p-n)/(p+n) where
• p: # of positive examples covered by the rule in the validation set• n: # of negative examples covered by the rule in the validation set
• pruning method: deletes any final sequence of conditions that maximizes v. That is, it calculates v for each of the following pruned versions of the rule and keeps the version of the rule with maximum v:• transport = car owner & drink_level = abstemious & ambience = friends budget = high• transport = car owner & drink_level = abstemious budget = high• transport = car owner budget = high• budget = high
Homework 4 SolutionsASSOCIATION RULES: APRIORI ALGORITHM
Apriori: Level 1• We begin the Apriori algorithm by determining the order:
• Here I will use the order that the attributes appear and the values for each attribute in alphabetical order.• Then all the possible single item rules are generated and the support calculated for each rule.
• The following slide shows the complete list of possible items in the rule.• Support is calculated in the following manner:
• Since we know the minimum acceptable support count is 55, we need only look at the numerator of this ratio to determine whether or not to keep this item.
Apriori: Level 1Candidate Itemsets with Support Count
smoker=false 109 transport=on foot 14 religion=christian 7smoker=true 26 transport=public 82 religion=jewish 1
drink_level=abstemious 51 marital_status=single 122
religion=mormon 1
drink_level=casual drinker 47 marital_status=married 10 religion=none 30drink_level=social drinker 40 interest=eco-friendly 16 activity=professio
nal15
dress_preference=elegant 4 interest=none 30 activity=student
113
dress_preference=formal 41 interest=technology 36 activity=unemployed
2
dress_preference=informal
53 interest=variety 50 activity=working-class
1
dress_preference=no preference
35 personality=conformist 7 budget=high 5
ambience=family 70 personality=hard-worker
61 budget-low 35
ambience=friends 46 personality=hunter-ostentatious
12 budget=medium
91
ambience=solitary 16 personality=thrifty-protector
58
transport=car owner 34 religion=catholic 99
• We keep the ones in bold as they meet the minimum support threshold.
Apriori: Level 1
Itemsets with Supportsmoker=false 109
ambience=family 70transport=public 82
marital_status=single 122personality=hard-worker 61
personality=thrifty-protector
58
religion=catholic 99activity=student 113budget=medium 91
• We keep the following item sets as they contain enough support, and use these item sets to generate candidate item sets for the next level.
Apriori: Level 2• We merge pairs from the level 1 set. Since there are no prefixes here then we must consider all combinations. (Continued on next slide)Candidate Itemsets with Support Count
smoker=false, ambience=family 59 smoker=false,
budget=medium 75 ambience=family, budget=medium 54
smoker=false, transport=public 69 ambience=family,
transport=public 46transport=public,
marital_status=single
76
smoker=false, marital_status=single 98
ambience=family, marital_status=singl
e63
transport=public, personality=hard-
worker28
smoker=false, personality=hard-
worker49
ambience=family, personality=hard-
worker26
transport=public, personality=thrifty-
protector44
smoker=false, personality=thrifty-
protector48
ambience=family, personality=thrifty-
protector33 transport=public,
religion=catholic 62
smoker=false, religion=catholic 79 ambience=family,
religion=catholic 57 transport=public, activity=student 71
smoker=false, activity=student 90 ambience=family,
activity=student 61 transport=public, budget=medium 54
Apriori: Level 2Candidate Itemsets with Support Count
marital_status=single, personality=hard-worker 52 personality=hard-worker
budget=medium 40
marital_status=single, personality=thrifty-
protector51 personality=thrifty-
protector, religion=catholic 45
marital_status=single, religion=catholic 91 personality=thrifty-
protector, activity=student 50
marital_status=single, activity=student
107
personality=thrifty-protector, budget=medium 41
marital_status=single, budget=medium 79 religion=catholic,
activity=student 84
personality=hard-worker, personality=thrifty-
protector0 religion=catholic,
budget=medium 67
personality=hard-worker, religion=catholic 40 activity=student,
budget=medium 71
personality=hard-worker, activity=student 46
Apriori: Level 2
Itemsets with Support Countsmoker=false,
ambience=family 59 ambience=family, marital_status=single 63
marital_status=single,
religion=catholic91
smoker=false, transport=public 69 ambience=family,
religion=catholic 57marital_status=sin
gle, activity=student
107
smoker=false, marital_status=single 98 ambience=family,
activity=student 61marital_status=sin
gle, budget=medium
79
smoker=false, religion=catholic 79 transport=public,
marital_status=single 76 religion=catholic,activity=student 84
smoker=false, activity=student 90 transport=public,
religion=catholic 62 religion=catholic,budget=medium 67
smoker=false, budget=medium 75 transport=public,
activity=student 71 activity=student,budget=medium 71
• We keep the following item sets as they contain enough support, and use these item sets to generate candidate item sets for the next level.
Apriori: Level 3• We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates.
Itemsets from Level 2smoker=false,
ambience=familyambience=family,
marital_status=singlemarital_status=sin
gle, religion=catholic
smoker=false, transport=public
ambience=family, religion=catholic
marital_status=single,
activity=student
smoker=false, marital_status=single
ambience=family, activity=student
marital_status=single,
budget=mediumsmoker=false,
religion=catholictransport=public,
marital_status=singlereligion=catholic,activity=student
smoker=false, activity=student
transport=public, religion=catholic
religion=catholic,budget=medium
smoker=false, budget=medium
transport=public, activity=student
activity=student,budget=medium
Apriori: Level 3• First we determine the candidates by “joining” itemsets with like prefixes. (i.e., the first k-1 items in the items sets are the same)• Here we need only match the first item in the itemset.Itemsets from Level 2
smoker=false, ambience=family
ambience=family, marital_status=single
marital_status=single,
religion=catholic
smoker=false, transport=public
ambience=family, religion=catholic
marital_status=single,
activity=student
smoker=false, marital_status=single
ambience=family, activity=student
marital_status=single,
budget=mediumsmoker=false,
religion=catholictransport=public,
marital_status=singlereligion=catholic,activity=student
smoker=false, activity=student
transport=public, religion=catholic
religion=catholic,budget=medium
smoker=false, budget=medium
transport=public, activity=student
activity=student,budget=medium
Apriori: Level 3• That results in this set of potential candidate itemsets.
Potential Candidate Itemsetssmoker=false,
ambience=family,transport=public
smoker=false, transport=public,religion=catholic
smoker=false, activity=student,budget=medium
transport=public, religion=catholic,activity=student
smoker=false, ambience=family,
marital_status=single
smoker=false, transport=public,activity=student
ambience=family, marital_status=singl
e, religion=catholic
marital_status=single, religion=catholic,
activity=student
smoker=false, ambience=family,religion=catholic
smoker=false, transport=public,budget=medium
ambience=family, marital_status=singl
e, activity=student
marital_status=single, religion=catholic,
budget=medium
smoker=false, ambience=family,activity=student
smoker=false, marital_status=singl
e,religion=catholic
ambience=family, religion=catholic,activity=student
marital_status=single, activity=student,
budget=medium
smoker=false, ambience=family,budget=medium
smoker=false, marital_status=singl
e,activity=student
transport=public, marital_status=singl
e,religion=catholic
religion=catholic,activity=student,budget=medium
smoker=false, transport=public,
marital_status=single
smoker=false, marital_status=single, budget=medium
transport=public, marital_status=singl
e,activity=student
Apriori: Level 3• We have one final step before calculating the support: we can eliminate unnecessary candidates. We must check that all subsets of size 2 in each of these itemsets also existed in the level 2 set. We can make this a little easier by ignoring the prefix subsets as we know those existed because we used them to create the itemsets.• The following itemsets can be removed as the bolded subsets do not appear in the Level 2 itemsets. This leaves us the candidate itemsets on the next slide.
Candidate Itemsets That Can be Removedsmoker=false,
ambience=family,transport=public
smoker=false, ambience=family,budget=medium
smoker=false, transport=public,budget=medium
Apriori: Level 3Candidate Itemsets with Support Count
smoker=false, ambience=family,
marital_status=single
53smoker=false,
transport=public,
activity=student58
ambience=family, marital_status=singl
e, religion=catholic
50
transport=public,
religion=catholic,
activity=student
59
smoker=false, ambience=family,religion=catholic
46smoker=false,
marital_status=single,
religion=catholic72
ambience=family, marital_status=si
ngle, activity=student
57
marital_status=single,
religion=catholic,
activity=student
80
smoker=false, ambience=family,activity=student
52smoker=false,
marital_status=single,
activity=student85
ambience=family, religion=catholic,activity=student
51
marital_status=single,
religion=catholic,
budget=medium
80
smoker=false, transport=public,marital_status=si
ngle63
smoker=false, marital_status=s
ingle, budget=medium
65transport=public, marital_status=si
ngle,religion=catholic
57
marital_status=single,
activity=student,
budget=medium
59
smoker=false, transport=public,religion=catholic
52smoker=false,
activity=student,
budget=medium58
transport=public, marital_status=si
ngle,activity=student
67religion=catholic,activity=student,budget=medium
53
Apriori: Level 3
Level 3 Itemsets with Support
smoker=false, transport=public,
marital_status=single63
smoker=false, activity=student,budget=medium
58marital_status=sing
le, religion=catholic,activity=student
80
smoker=false, transport=public,activity=student
58ambience=family,
marital_status=single, activity=student
57marital_status=sing
le, religion=catholic,budget=medium
80
smoker=false, marital_status=single,
religion=catholic72
transport=public, marital_status=single,
religion=catholic57
marital_status=single, activity=student,
budget=medium59
smoker=false, marital_status=single,
activity=student85
transport=public, marital_status=single,
activity=student67
smoker=false, marital_status=single,
budget=medium65
transport=public, religion=catholic,activity=student
59
• We keep the following item sets as they contain enough support, and use these item sets to generate candidate item sets for the next level.
Apriori: Level 4• We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates.Level 3 Itemsets
smoker=false, transport=public,
marital_status=single
smoker=false, activity=student,budget=medium
marital_status=single,
religion=catholic,activity=student
smoker=false, transport=public,activity=student
ambience=family, marital_status=single,
activity=student
marital_status=single,
religion=catholic,budget=medium
smoker=false, marital_status=single,
religion=catholic
transport=public, marital_status=single,
religion=catholic
marital_status=single, activity=student,
budget=mediumsmoker=false,
marital_status=single,activity=student
transport=public, marital_status=single,
activity=studentsmoker=false,
marital_status=single, budget=medium
transport=public, religion=catholic,activity=student
Apriori: Level 4• We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates.Level 3 Itemsets
smoker=false, transport=public,
marital_status=single
smoker=false, activity=student,budget=medium
marital_status=single,
religion=catholic,activity=student
smoker=false, transport=public,activity=student
ambience=family, marital_status=single,
activity=student
marital_status=single,
religion=catholic,budget=medium
smoker=false, marital_status=single,
religion=catholic
transport=public, marital_status=single,
religion=catholic
marital_status=single, activity=student,
budget=mediumsmoker=false,
marital_status=single,activity=student
transport=public, marital_status=single,
activity=studentsmoker=false,
marital_status=single, budget=medium
transport=public, religion=catholic,activity=student
• First we determine the candidates by “joining” itemsets with like prefixes. (i.e., the first k-1 items in the items sets match)• Here we need only match the first two items in the itemset.
Apriori: Level 4
Potential Candidate Item Setssmoker=false,
transport=public,marital_status=sing
le,activity=student
smoker=false, marital_status=single,
activity=student, budget=medium
smoker=false, marital_status=single,
religion=catholic,activity=student
transport=public, marital_status=single,religion=catholic,activity=student
smoker=false, marital_status=single,religion=catholic,budget=medium
marital_status=single, religion=catholic,activity=student, budget=medium
• That results in this set of candidate itemsets.• We have one final step before calculating the support: we can eliminate unnecessary candidates. We must check that all subsets of size 3 in each of these itemsets also existed in the level 3 set. We can make this a little easier by ignoring the prefix subsets as we know those existed because we used them to create the itemsets.• Here we again eliminate candidates from consideration, the offending subsets are bolded.
Apriori: Level 4
Candidate Itemsets with Support Count
smoker=false, marital_status=sing
le,religion=catholic,activity=student
63
smoker=false, marital_status=single,
activity=student, budget=medium
53
• In the end we keep only one single itemset that has enough support for this level.
• The following slide depicts the complete itemset.
Level 4 Itemsets with Support Count
smoker=false, marital_status=single,
religion=catholic,activity=student
63
Apriori: Complete ItemsetItemsets with Support Count
smoker=false 109 smoker=false, marital_status=single 98 marital_status=single,
religion=catholic 91smoker=false,
marital_status=single, budget=medium
65
ambience=family 70 smoker=false, religion=catholic 79 marital_status=single,
activity=student 107smoker=false,
activity=student,budget=medium
58
marital_status=single 122 smoker=false, activity=student 90 marital_status=single,
budget=medium 79ambience=family,
marital_status=single, activity=student
57
personality=hard-worker
61 smoker=false, budget=medium 75 religion=catholic,
activity=student 84transport=public,
marital_status=single,religion=catholic
57
transport=public 82 ambience=family, marital_status=single 63 religion=catholic,
budget=medium 67transport=public,
marital_status=single,activity=student
67
religion=catholic 99 ambience=family, religion=catholic 57 activity=student,
budget=medium 71transport=public, religion=catholic,activity=student
59
activity=student 113 ambience=family, activity=student 61
smoker=false, transport=public,
marital_status=single63
marital_status=single, religion=catholic,activity=student
80
budget=medium 91 transport=public, marital_status=single 76
smoker=false, transport=public,activity=student
58marital_status=single,
religion=catholic,budget=medium
80
smoker=false, ambience=family 59 transport=public,
religion=catholic 62smoker=false,
marital_status=single,religion=catholic
72marital_status=single,
activity=student,budget=medium
59
smoker=false, transport=public 69 transport=public,
activity=student 71smoker=false,
marital_status=single,activity=student
85smoker=false,
marital_status=single,religion=catholic,activity=student
63
Rule ConstructionLargest itemset: Let’s call this itemset I4:
I4: smoker=false, marital_status=single, religion=catholic, activity=student
Rules constructed from I4 with 2 items in the antecedent: R1: smoker=false, marital_status=single religion=catholic, activity=student
conf(R1) = supp(I4)/supp(smoker=false, marital_status=single ) = 63/ 98 = 64.28% R2: smoker=false, religion=catholic marital_status=single, activity=student
conf(R2) = supp(I4)/supp(smoker=false, religion=catholic ) = 63/ 79 = 79.74% R3: smoker=false, activity=student marital_status=single, religion=catholic conf(R3) =
supp(I4)/supp(smoker=false, activity=student ) = 63/ 90= 70% R4: marital_status=single, religion=catholic smoker=false, activity=student
conf(R4) = supp(I4)/supp(marital_status=single, religion=catholic ) = 63/ 91 = 69.23% R5: marital_status=single, activity=student smoker=false, religion=catholic
conf(R5) = supp(I4)/supp(marital_status=single, activity=student ) = 63/ 107 = 58.87% R6: religion=catholic, activity=student smoker=false, marital_status=single
conf(R6) = supp(I4)/supp(religion=catholic, activity=student) = 63/ 84 = 75%