Date post: | 14-Apr-2018 |
Category: |
Documents |
Upload: | shubhamg823 |
View: | 219 times |
Download: | 0 times |
of 55
7/30/2019 5 Detailed
1/55
Mining Association Rulesin Large Databases
7/30/2019 5 Detailed
2/55
Association Rule Mining
Given a set of transactions, find rules that willpredict the occurrence of an item based on theoccurrences of other items in the transaction
Market-Basket transactions
TID I tems
1 Bread, Milk2 Bread, Diaper, Beer, Eggs3 Milk, Diaper, Beer, Coke4 Bread, Milk, Diaper, Beer5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk},
Implication means co-occurrence,
not causality!
7/30/2019 5 Detailed
3/55
Definition: Frequent Itemset
Itemset
A collection of one or more items Example: {Milk, Bread, Diaper}
k-itemset
An itemset that contains k items
Support count () Frequency of occurrence of an
itemset
E.g. ({Milk, Bread,Diaper}) = 2
Support
Fraction of transactions that
contain an itemset E.g. s({Milk, Bread, Diaper}) =
2/5
Frequent Itemset
An itemset whose support is
greater than or equal to a minsupthreshold
TID I tems
1 Bread, Milk2 Bread, Diaper, Beer, Eggs3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer5 Bread, Milk, Diaper, Coke
I assume that itemsets areordered lexicographically
7/30/2019 5 Detailed
4/55
Definition: Association Rule
Let D be database oftransactions
e.g.:
Let I be the set of items that appear in the
database, e.g., I={A,B,C,D,E,F} A rule is defined by X Y, where XI,
YI, and XY=
e.g.: {B,C} {E} is a rule
Transaction ID Items Bought
2000 A,B,C
1000 A,C4000 A,D
5000 B,E,F
7/30/2019 5 Detailed
5/55
Definition: Association Rule
Example:
Beer}Diaper,Milk{
4.052
|T|)BeerDiaper,,Milk( s
67.03
2
)Diaper,Milk(
)BeerDiaper,Milk,(
c
Association Rule
An implication expression of theform X Y, where X and Y areitemsets
Example:
{Milk, Diaper} {Beer}
Rule Evaluation Metrics
Support (s)
Fraction of transactions that
contain both X and Y Confidence (c)
Measures how often items in Yappear in transactions thatcontain X
TID I tems
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
7/30/2019 5 Detailed
6/55
Rule Measures: Support and
ConfidenceFind all the rulesX Ywith
minimum confidence andsupport support, s, probability that a
transaction contains {X Y}
confidence, c,conditionalprobability that a transactionhaving X also contains Y
Transaction ID Items Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
Let minimum support 50%,
and minimum confidence50%, we have
A C (50%, 66.6%)
C A (50%, 100%)
Customer
buys diaper
Customerbuys both
Customer
buys beer
7/30/2019 5 Detailed
7/55
TID date items_bought100 10/10/99 {F,A,D,B}
200 15/10/99 {D,A,C,E,B}
300 19/10/99 {C,A,B,E}
400 20/10/99 {B,A,D}
Example
What is the support and confidence of therule: {B,D} {A}
Support:
percentage of tuples that contain {A,B,D} =
Confidence:
D}{B,containthattuplesofnumber
D}B,{A,containthattuplesofnumber
75%
100%
Remember:conf(X Y) =
sup(X)
Y)sup(X
7/30/2019 5 Detailed
8/55
Association Rule Mining Task
Given a set of transactions T, the goal ofassociation rule mining is to find all ruleshaving
support minsup threshold
confidence minconfthreshold
Brute-force approach:
List all possible association rules
Compute the support and confidence for eachrule
Prune rules that fail the minsup and minconfthresholds
Computationally prohibitive!
7/30/2019 5 Detailed
9/55
Mining Association Rules
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67){Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
TID I tems
1 Bread, Milk2 Bread, Diaper, Beer, Eggs3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer5 Bread, Milk, Diaper, Coke
Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
Thus, we may decouple the support and confidence requirements
7/30/2019 5 Detailed
10/55
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
Generate all itemsets whose support minsup
2. Rule Generation
Generate high confidence rules from each frequentitemset, where each rule is a binary partitioning of afrequent itemset
Frequent itemset generation is stillcomputationally expensive
7/30/2019 5 Detailed
11/55
Frequent Itemset Generation
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there
are 2d possible
candidate itemsets
7/30/2019 5 Detailed
12/55
Frequent Itemset Generation Brute-force approach:
Each itemset in the lattice is a candidate frequent itemset
Count the support of each candidate by scanning thedatabase
Match each transaction against every candidate
Complexity ~ O(NMw) => Expensive since M = 2d!!!
TID I tems
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
N
Transactions List of
Candidates
M
w
7/30/2019 5 Detailed
13/55
Computational ComplexityGiven d unique items:
Total number of itemsets = 2d
Total number of possible association rules:
1231
1
1 1
dd
d
k
kd
j j
kd
k
d
R
If d=6, R = 602 rules
7/30/2019 5 Detailed
14/55
Frequent Itemset Generation Strategies
Reduce the number of candidates (M) Complete search: M=2d
Use pruning techniques to reduce M
Reduce the number of transactions (N)
Reduce size of N as the size of itemsetincreases
Used by DHP and vertical-based miningalgorithms
Reduce the number of comparisons (NM) Use efficient data structures to store the
candidates or transactions
No need to match every candidate against
every transaction
7/30/2019 5 Detailed
15/55
Reducing Number of Candidates
Apriori principle: If an itemset is frequent, then all of its subsets
must also be frequent
Apriori principle holds due to the followingproperty of the support measure:
Support of an itemset never exceeds the supportof its subsets
This is known as the anti-monotone property of
support
)()()(:, YsXsYXYX
7/30/2019 5 Detailed
16/55
Example
TID I tems
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
s(Bread) > s(Bread, Beer)
s(Milk) > s(Bread, Milk)
s(Diaper, Beer) > s(Diaper, Beer, Coke)
7/30/2019 5 Detailed
17/55
7/30/2019 5 Detailed
18/55
Illustrating Apriori Principle
Item Count
Bread 4Coke 2Milk 4Beer 3
Diaper 4Eggs 1
Itemset Count
{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3
Itemset Count
{Bread,Milk,Diaper} 3
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generatecandidates involving Cokeor Eggs)
Triplets (3-itemsets)
Minimum Support = 3
If every subset is considered,6C1 +
6C2 +6C3 = 41
With support-based pruning,6 + 6 + 1 = 13
7/30/2019 5 Detailed
19/55
The Apriori Algorithm (the general idea)
1. Find frequent 1-items and put them to Lk (k=1)
2. Use Lk to generate a collection ofcandidateitemsets Ck+1 with size (k+1)
3. Scan the database to find which itemsets in Ck+1are frequent and put them into Lk+1
4. If Lk+1 is not empty
k=k+1
GOTO 2
R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules",
Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994.
7/30/2019 5 Detailed
20/55
The Apriori Algorithm
Pseudo-code:Ck: Candidate itemset of size kLk: frequent itemset of size k
L1 = {frequent items};for(k= 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;// join and prune stepsfor each transaction tin database do
increment the count of all candidates in Ck+1that are contained in t
Lk+1 = candidates in Ck+1 with min_support (frequent)end
returnkLk;
Important steps in candidate generation: Join Step: Ck+1is generated by joining Lk with itself
Prune Step: Any k-itemset that is not frequent cannot bea subset of a frequent (k+1)-itemset
7/30/2019 5 Detailed
21/55
The Apriori AlgorithmExample
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
Scan D
C1L1
itemset
{1 2}
{1 3}
{1 5}
{2 3}{2 5}
{3 5}
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3
{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C2 C2Scan D
C3 L3itemset
{2 3 5}
Scan D itemset sup
{2 3 5} 2
min_sup=2=50%
7/30/2019 5 Detailed
22/55
How to Generate Candidates?
Suppose the items in Lkare listed in an order
Step 1: self-joining Lk(IN SQL)
insert intoCk+1
selectp.item1, p.item2, , p.itemk, q.itemkfrom Lkp, Lkq
wherep.item1=q.item1, , p.itemk-1=q.itemk-1,p.itemk{ }
BCD=>A ACD=>B ABD=>C ABC=>D
BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD
Lattice of rules
ABCD=>{ }
BCD=>A ACD=>B ABD=>C ABC=>D
BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD
Pruned
Rules
Low
Confidence
Rule
R le Generation for Apriori
7/30/2019 5 Detailed
40/55
Rule Generation for Apriori
Algorithm
Candidate rule is generated by mergingtwo rules that share the same prefixin the rule consequent
join(CD=>AB,BD=>AC)would produce the candidaterule D => ABC
Prune rule D=>ABC if itssubset AD=>BC does not have
high confidence
BD=>ACCD=>AB
D=>ABC
A i i F h?
7/30/2019 5 Detailed
41/55
Is Apriori Fast Enough?
Performance Bottlenecks
The core of the Apriori algorithm: Use frequent (k 1)-itemsets to generate candidate frequent k-
itemsets
Use database scan and pattern matching to collect counts for
the candidate itemsets
The bottleneck ofApriori: candidate generation
Huge candidate sets:
104 frequent 1-itemset will generate 107 candidate 2-itemsets
To discover a frequent pattern of size 100, e.g., {a1, a2, ,a100}, one needs to generate 2
100 1030 candidates.
Multiple scans of database:
Needs (n +1 ) scans, n is the length of the longest pattern
FP h Mi i F P
7/30/2019 5 Detailed
42/55
FP-growth: Mining Frequent Patterns
Without Candidate Generation
Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure
highly condensed, but complete for frequent pattern
mining avoid costly database scans
Develop an efficient, FP-tree-based frequent
pattern mining method
A divide-and-conquer methodology: decompose miningtasks into smaller ones
Avoid candidate generation: sub-database test only!
FP tree Construction from a
7/30/2019 5 Detailed
43/55
FP-tree Construction from a
Transactional DB
I tem frequency
f 4c 4a 3b 3
m 3p 3
min_support = 3TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Steps:
1. Scan DB once, find frequent 1-itemsets (single
item patterns)2. Order frequent items in descending order of
their frequency
3. Scan DB again, construct FP-tree
7/30/2019 5 Detailed
44/55
FP-tree Construction
root
TID freq. Items bought100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, p, b}500 {f, c, a, m, p}
I tem frequency
f 4c 4a 3b 3
m 3p 3
min_support = 3
f:1
c:1
a:1
m:1
p:1
7/30/2019 5 Detailed
45/55
FP-tree Construction
root
I tem frequency
f 4c 4a 3b 3
m 3p 3
min_support = 3
f:2
c:2
a:2
m:1
p:1
b:1
m:1
TID freq. Items bought100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, p, b}500 {f, c, a, m, p}
7/30/2019 5 Detailed
46/55
FP-tree Construction
root
I tem fr equency
f 4c 4a 3b 3
m 3p 3
min_support = 3
f:3
c:2
a:2
m:1
p:1
b:1
m:1
b:1
TID freq. Items bought100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, p, b}500 {f, c, a, m, p}
c:1
b:1
p:1
7/30/2019 5 Detailed
47/55
FP-tree Construction
root
I tem frequency
f 4c 4a 3b 3
m 3p 3
min_support = 3
f:4
c:3
a:3
m:2
p:2
b:1
m:1
b:1
TID freq. Items bought100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, p, b}500 {f, c, a, m, p}
c:1
b:1
p:1
Header TableI tem frequency head
f 4c 4a 3
b 3m 3
p 3
7/30/2019 5 Detailed
48/55
Benefits of the FP-tree Structure
Completeness: never breaks a long pattern of any transaction
preserves complete information for frequent pattern mining
Compactness reduce irrelevant informationinfrequent items are gone
frequency descending ordering: more frequent items aremore likely to be shared
never be larger than the original database (if not count
node-links and counts) Example: For Connect-4 DB, compression ratio could be
over 100
Mi i F P U i
7/30/2019 5 Detailed
49/55
Mining Frequent Patterns Using
FP-tree
General idea (divide-and-conquer)
Recursively grow frequent pattern path using the FP-tree
Method
For each item, construct its conditional pattern-base, andthen its conditional FP-tree
Repeat the process on each newly created conditional FP-tree
Until the resulting FP-tree is empty, or it containsonly
one path(single path will generate all the combinations of itssub-paths, each of which is a frequent pattern)
7/30/2019 5 Detailed
50/55
Mining Frequent Patterns Using the FP-tree(contd)
Start with last item in order (i.e., p). Follow node pointers and traverse only the paths containing p. Accumulate all of transformed prefix paths of that item to form
a conditional pattern base
Conditionalpattern base for p
fcam:2, cb:1
f:4
c:3
a:3
m:2
p:2
c:1
b:1
p:1
p
Construct a new FP-tree based
on this pattern, by merging allpaths and keeping nodes thatappear sup times. This leads toonly one branch c:3Thus we derive only one frequentpattern cont.p. Pattern cp
7/30/2019 5 Detailed
51/55
Mining Frequent Patterns Using the FP-tree(contd)
Move to next least frequent item in order, i.e., m Follow node pointers and traverse only the paths containing m. Accumulate all of transformed prefix paths of that item to form
a conditional pattern base
f:4
c:3
a:3
m:2
m
m:1
b:1
m-conditionalpattern base:
fca:2, fcab:1
{}
f:3c:3
a:3m-conditionalFP-tree (contains only path fca:3)
All frequent patternsthat include m
m,
fm, cm, am,
fcm, fam, cam,
fcam
7/30/2019 5 Detailed
52/55
Properties of FP-tree for Conditional Pattern
Base Construction
Node-link property
For any frequent item ai,all the possible frequent patterns
that contain aican be obtained by following ai's node-links,
starting from ai's head in the FP-tree header
Prefix path property
To calculate the frequent patterns for a node ai in a path P,
only the prefix sub-path ofai in P need to be accumulated,
and its frequency count should carry the same count as
node ai.
7/30/2019 5 Detailed
53/55
Conditional Pattern-Bases for the example
EmptyEmptyf
{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}a
Empty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m
{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern-baseItem
7/30/2019 5 Detailed
54/55
Why Is Frequent Pattern Growth Fast?
Performance studies show
FP-growth is an order of magnitude faster than Apriori,
and is also faster than tree-projection
Reasoning
No candidate generation, no candidate test
Uses compact data structure
Eliminates repeated database scan
Basic operation is counting and FP-tree building
FP growth vs Apriori: Scalability With
7/30/2019 5 Detailed
55/55
FP-growth vs. Apriori: Scalability Withthe Support Threshold
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Runtime(sec.)
D1 FP-grow th runtime
D1 Apriori runtime
Data set T25I20D10K