Lecture 5: Mining Association Rule
Introduction to Data Mining
Yunming Ye
Department of Computer Science
Shenzhen Graduate School
Harbin Institute of Technology
04/19/23 2
Agenda
1. Introduction to Association Rule Mining 2. Apriori Algorithm 3. FP-Tree Algorithm 4. Sequential Association Rule Mining 5. Advanced Association Rule Mining
04/19/23 3
Introduction to Association Rule Mining
04/19/23 4
What Is Association Rule Mining?
Association rule mining: Finding associations, correlations, or causal structures among
sets of items or objects in transaction databases, relational databases, or other information repositories.
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Applications: Basket analysis, cross-selling, catalog design, loss-leader
analysis, clustering, classification, etc.
04/19/23 5
An Example
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},
Customerbuys diaper
Customerbuys both
Customerbuys beer
04/19/23 6
Definition: Association Rule
Example:
Beer}Diaper,Milk{
4.052
|T|)BeerDiaper,,Milk(
s
67.032
)Diaper,Milk()BeerDiaper,Milk,(
c
Association Rule An implication expression of the form X
Y, where X and Y are itemsets Example:
{Milk, Diaper} {Beer}
Rule Evaluation Metrics Support (s)
Fraction of transactions that contain both X and Y
Confidence (c) Measures how often items in Y
appear in transactions that contain X
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
04/19/23 7
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold
Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
04/19/23 8
Mining Association Rules
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have different confidence
• Thus, we may decouple the support and confidence requirements
04/19/23 9
Categorization of Association Rules
Based on the types of values handled in the rule: Boolean association rule Quantitative association rule
Based on the dimensions of data involved Single-dimensional Multi-dimensional
Based on the levels of abstraction involved Based on various extensions to association mining
Frequent closed itemset Max-pattern
04/19/23 10
Roadmap for Mining Association Rules
Two-step approach: 1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is the most computationally expensive
04/19/23 11
Apriori Algorithm
04/19/23 12
Frequent Itemset Generationnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there are 2d possible candidate itemsets!
04/19/23 13
Frequent Itemset Generation
Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database
Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!!
TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
N
Transactions List ofCandidates
M
w
04/19/23 14
Frequent Itemset Generation Strategies
Reduce the number of candidates (M) Complete search: M=2d
Use pruning techniques to reduce M
Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or
transactions No need to match every candidate against every transaction
04/19/23 15
Scalable Methods for Mining Frequent Patterns
The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains
{beer, diaper}
Scalable mining methods: Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
04/19/23 16
Apriori: A Candidate Generation-and-Test Approach
Apriori pruning principle: If there is any itemset which is infrequent,
its superset should not be generated/tested! (i.e. Anti-monotone)
(Agrawal & Srikant @VLDB’94)
Method: Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k frequent
itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be generated
The Apriori Algorithm—An Example
Database TDB
1st scan
C1L1
L2
C2 C2
2nd scan
C3 L33rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2{B} 3{C} 3{D} 1{E} 3
Itemset sup
{A} 2{B} 3{C} 3{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2
Itemset sup
{A, C} 2{B, C} 2{B, E} 3{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
04/19/23 18
How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2: pruningforall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
04/19/23 19
Example of Generating Candidates
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
04/19/23 20
Is Apriori Fast Enough?
The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets Use database scan and pattern matching to collect counts for the
candidate itemsets
The bottleneck of Apriori: candidate generation Huge candidate sets:
104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100},
one needs to generate 2100 1030 candidates. Multiple scans of database:
Needs (n +1 ) scans, n is the length of the longest pattern
04/19/23 21
Methods to Improve Apriori’s Efficiency
Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent
Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness
Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent
04/19/23 22
Partition: Scan Database Only Twice
Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
Scan 1: partition database and find local frequent patterns
Scan 2: consolidate global frequent patterns
A. Savasere, E. Omiecinski, and S. Navathe. An efficient
algorithm for mining association in large databases. In
VLDB’95
04/19/23 23
DHP: Reduce the Number of Candidates
J. Park, M. Chen, and P. Yu. An effective hash-based
algorithm for mining association rules. In SIGMOD’95
Goal: Improve the efficiency of Apriori-based mining.The
algorithm is based on Apriori algorithm by reduce the
number of candidates.
The difference of DHP and Apriori is the process of
generate k-itemsets, and that of DHP is show below: Step1:Generate all of the k-itemsets for each transaction, hash them into the
different buckets of a hash table structure, and increase the corresponding
bucket counts.
Step2:A k-itemset whose corresponding bucket count in the hash table is
below the support threshold cannot be frequent and thus should be removed
from the candidate set.
04/19/23 24
DHP: Reduce the Number of Candidates
Example: Step1:
TID Items
100 A C D
200 B C E
300 A B C E
400 B E
{A} 2{B} 3{C} 3{D} 1{E} 3
{A}{B} {C} {E}
C 1L 1
Data Base
04/19/23 25
DHP: Reduce the Number of Candidates
Making a hash table
h({x y}) = {{order of x}*10 + {order of y}}mod 7
Step2: Generate L2
TID Items
100 A C D
200 B C E
300 A B C E
400 B E
100{A C},{A D},{C D}200{B C},{B E},{C E}300{A B},{A C},{A E},{B C},{B E},{C E}400{B E}
04/19/23 26
Sampling for Frequent Patterns
Select a sample of original database, mine frequent
patterns within sample using Apriori
Scan database once to verify frequent itemsets found
in sample, only borders of closure of frequent patterns
are checked Example: check abcd instead of ab, ac, …, etc.
Scan database again to find missed frequent patterns
H. Toivonen. Sampling large databases for association
rules. In VLDB’96
04/19/23 27
DIC: Dynamic itemset counting
S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97
DIC: Database is partitioned into blocks marked by start point. New candidate itemsets can be added at any start point.
Apriori: New candidate itemsets only generated before each complete database scan.
DIC requires fewer database scans than Apriori.
04/19/23 28
DIC: Reduce Number of Scans
Example
Min support = 2;
TID Items
100 ABC
200 BCD
300 BCD
400 ABC
500 ABC
600 ABC
700 BCD
800 BCD
Block TID Items
B1 100 ABC
200 BCD
300 BCD
400 ABC
B2 500 ABC
600 ABC
700 BCD
800 BCD Start point
04/19/23 29
DIC: Reduce Number of Scans
ExampleB1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
A 1
D
B 1
C 1
04/19/23 30
DIC: Reduce Number of Scans
Example
A 1
D 1
B 2
C 2
B1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
04/19/23 31
DIC: Reduce Number of Scans
Example
A 1
D 2
B 3
C 3
B1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
04/19/23 32
DIC: Reduce Number of Scans
Example
A 2
D 2
B 4
C 4
B1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
04/19/23 33
DIC: Reduce Number of Scans
Example
A 2
D 2
B 4
C 4
AB
AC
BC
AD
BD
CD
Add new candidateat start point
B1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
04/19/23 34
DIC: Reduce Number of Scans
Example
A 3
D 2
B 5
C 5
AB 1
AC 1
BC 1
AD
BD
CD
B1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
04/19/23 35
DIC: Reduce Number of Scans
Example
A 4
D 2
B 6
C 6
AB 2
AC 2
BC 2
AD
BD
CD
B1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
04/19/23 36
DIC: Reduce Number of Scans
Example
A 4
D 3
B 7
C 7
AB 2
AC 2
BC 3
AD
BD 1
CD 1
B1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
04/19/23 37
DIC: Reduce Number of Scans
Example
A 4
D 4
B 8
C 8
AB 2
AC 2
BC 4
AD
BD 2
CD 2
B1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
04/19/23 38
DIC: Reduce Number of Scans
ExampleB1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
A 4
D 4
B 8
C 8
AB 2
AC 2
BC 4
AD
BD 2
CD 2
ABC
BCD
04/19/23 39
DIC: Reduce Number of Scans
Example
AB 2
AC 2
BC 4
AD
BD 2
CD 2
ABC
BCD
B1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCDIf a dashed itemset has been counted through all thetransactions, make it solid and stop counting it.
04/19/23 40
DIC: Reduce Number of Scans
Example
AB 3
AC 3
BC 5
AD
BD 2
CD 2
ABC 1
BCD
B1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
04/19/23 41
DIC: Reduce Number of Scans
Example
AB 3
AC 3
BC 6
AD
BD 3
CD 3
ABC 1
BCD 1
B1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
04/19/23 42
DIC: Reduce Number of Scans
Example
AB 3
AC 3
BC 7
AD
BD 4
CD 4
ABC 1
BCD 2
B1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
04/19/23 43
DIC: Reduce Number of Scans
Example
AB 4
AC 4
BC 8
AD
BD 4
CD 4
ABC 2
BCD 2
B1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
04/19/23 44
DIC: Reduce Number of Scans
Example
AB 4
AC 4
BC 6
AD
BD 4
CD 4
ABC 2
BCD 2
B1
100 ABC
200 BCD
300 BCD
400 ABC
B2
500 ABC
600 ABC
700 BCD
800 BCD
If dashed itemset has been counted through all thetransactions, make it solid and stop counting it.
Finish!
04/19/23 45
DIC: Reduce Number of Scans
Example
Apriori DIC
3 round 1.5 round
04/19/23 46
FP-Tree Algorithm
04/19/23 47
Mining Frequent Patterns Without Candidate Generation
Grow long patterns from short ones using local
frequent items
“abc” is a frequent pattern
Get all transactions having “abc”: DB|abc
“d” is a local frequent item in DB|abc abcd is a
frequent pattern
04/19/23 48
Mining Frequent Patterns With FP-trees
Idea: Frequent pattern growth Recursively grow frequent patterns by pattern and database
partition
Method For each frequent item, construct its conditional pattern-base,
and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path
—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern
04/19/23 49
Construct FP-tree from a Transaction Database
min_support = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
F-list=f-c-a-b-m-p
04/19/23 50
Construct FP-tree from a Transaction Database
{}
f:1
c:1
a:1
m:1
p:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
04/19/23 51
Construct FP-tree from a Transaction Database
{}
f:2
c:2
a:2
m:1
p:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
b:1
m:1
04/19/23 52
Construct FP-tree from a Transaction Database
{}
f:3
c:2
a:2
m:1
p:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
b:1
m:1
b:1
04/19/23 53
Construct FP-tree from a Transaction Database
{}
f:3
c:2
a:2
m:1
p:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
b:1
m:1
b:1
c:1
b:1
p:1
04/19/23 54
Construct FP-tree from a Transaction Database
{}
f:4
c:3
a:3
m:2
p:2
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
b:1
m:1
b:1
c:1
b:1
p:1
04/19/23 55
Benefits of the FP-tree Structure
Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction
Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently
occurring, the more likely to be shared Never be larger than the original database (not count node-links
and the count field) For Connect-4 DB, compression ratio could be over 100
04/19/23 56
Find Patterns Having P From P-conditional Database
Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
04/19/23 57
From Conditional Pattern-bases to Conditional FP-trees
For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the
pattern basem-conditional pattern base:
fca:2, fcab:1
{}
f:3
c:3
a:3m-conditional FP-tree
All frequent patterns relate to m
m,
fm, cm, am,
fcm, fam, cam,
fcam
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header TableItem frequency head f 4c 4a 3b 3m 3p 3
04/19/23 58
Recursion: Mining Each Conditional FP-tree
{}
f:3
c:3
a:3m-conditional FP-tree
Cond. pattern base of “am”: (fc:3)
{}
f:3
c:3am-conditional FP-tree
Cond. pattern base of “cm”: (f:3){}
f:3
cm-conditional FP-tree
Cond. pattern base of “cam”: (f:3)
{}
f:3
cam-conditional FP-tree
04/19/23 59
A Special Case: Single Prefix Path in FP-tree
Suppose a (conditional) FP-tree T has a shared
single prefix-path P
Mining can be decomposed into two parts Reduction of the single prefix path into one node
Concatenation of the mining results of the two parts
a2:n2
a3:n3
a1:n1
{}
b1:m1C1:k1
C2:k2 C3:k3
b1:m1C1:k1
C2:k2 C3:k3
r1
+a2:n2
a3:n3
a1:n1
{}
r1 =
04/19/23 60
Scaling FP-growth by DB Projection
Jiawei Han, Jian Pei, Yiwen Yin, Runying Mao. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Mining and Knowledge Discovery, Volume 8(1):pp.53-87, 2004.
FP-tree cannot fit in memory?—DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB Parallel projection vs. Partition projection techniques
Parallel projection is space costly
04/19/23
Parallel Projection
Parallel projection needs a lot of disk space
Partition projection saves it
04/19/23 63
FP-Growth vs. Apriori: Scalability With the Support Threshold
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Ru
n t
ime
(se
c.)
D1 FP-grow th runtime
D1 Apriori runtime
Data set T25I20D10K
04/19/23 64
Why Is FP-Growth the Winner?
Divide-and-conquer: decompose both the mining task and DB according to the frequent
patterns obtained so far leads to focused search of smaller databases
Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and building sub FP-tree, no
pattern search and matching
04/19/23 65
CHARM: Mining by Exploring Vertical Data Format
Vertical format: t(AB) = {T11, T25, …} tid-list: list of trans.-ids containing an itemset
Deriving closed patterns based on vertical intersections t(X) = t(Y): X and Y always happen together t(X) t(Y): transaction having X always has Y
Using diffset to accelerate mining Only keep track of differences of tids t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
Diffset (XY, X) = {T2}
CHARM: An Efficient Algorithm for Closed Itemset Mining. (Mohammed J. Zaki & Ching-Jui Hsiao@SDM’02)
04/19/23 66
CHARM: Mining by Exploring Vertical Data Format
Item TIDI1 {T100, T400, T500, T700, T800, T900}
I2 {T100, T200, T300, T400, T600, T800, T900}
I3 {T300, T500, T600, T700, T800, T900}
I4 {T200, T400}
I5 {T100, T800}
Item TID{I1, I2}
{T100, T400, T800, T900}
{I1, I3}
{T500, T700, T800, T900}
{I1, I4}
{T400}
{I1, I5}
{T100, T800}
{I2, I3}
{T300, T600, T800, T900}
{I2, I4}
{T200, T400}
{I2, I5}
{T100, T800}
{I3, I5}
{T800}
Item TID
{I1, I2, I3} {T800, T900}
{I1, I2, I5} {T100, T800}
1
2 3
04/19/23 67
Interestingness Measure: Correlations (Lift)
Buys games buys videos[40%, 66%] is misleading The overall % of purchasing videos is 75% > 66.7%.
Buys games not buy video[20%, 33.3%] is more
accurate, although with lower support and confidence
Measure of dependent/correlated events: lift
89.010000/7500*10000/6000
10000/4000),( VGlift
game Not game Sum (row)
Video 4000(4500) 3500(3000) 7500
Not video 2000(1500) 500(1000) 2500
Sum(col.) 6000 4000 10000
)()(
)(
BPAP
BAPlift
33.110000/2500*10000/6000
10000/2000),( VGlift
04/19/23 68
The influence of null-transactions!
( , )
( ) ( )
P A Blift
P A P B
Expected
ExpectedObserved 22 )(
Are lift and 2 Good Measures of Correlation?
04/19/23 69
Null-invariant Measures of Correlation
)sup(_max_
)sup(_
Xitem
Xconfall
cosine=
Null-invariant measure:
-if its value is free from the influence of null-transactions
Kulczynski measure:
Max confidence:
All confidence:
Cosine measure:
04/19/23 70
Null-invariant Measures of Correlation: examples
)sup(_max_
)sup(_
Xitem
Xconfall
cosine=
04/19/23 71
Which Null-invariant Measure is better?
Imbalance ratio (IR):-IR=0, balanced -otherwise, the larger the difference between the two, the larger the IR
04/19/23 72
Summary of Measures of Correlation
lift and 2 are not good measures for correlations in large transactional DBs,because they do not have the null-invariance property
Among the four null-invariant measures studied here, namely all_confidence, max_confidence, Kulc, and cosine, we recommend using Kulc in conjunction with the imbalance ratio
all-conf has the downward closure property, and efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub)
04/19/23 73
Sequential Association Rule Mining
04/19/23 74
Sequence Data
10 15 20 25 30 35
235
61
1
Timeline
Object A:
Object B:
Object C:
456
2 7812
16
178
Object Timestamp EventsA 10 2, 3, 5A 20 6, 1A 23 1B 11 4, 5, 6B 17 2B 21 7, 8, 1, 2B 28 1, 6C 14 1, 8, 7
Sequence Database:
04/19/23 75
Examples of Sequence Data
Sequence Database
Sequence Element (Transaction) Event(Item)
Customer Purchase history of a given customer
A set of items bought by a customer at time t
Books, diary products, CDs, etc
Web Data Browsing activity of a particular Web visitor
A collection of files viewed by a Web visitor after a single mouse click
Home page, index page, contact info, etc
Sensor data
History of events generated by a given sensor
Events triggered by a sensor at time t
Types of alarms generated by sensors
Genome sequences
DNA sequence of a particular species
An element of the DNA sequence
Bases A,T,G,C
Sequence
E1E2
E1E3
E2E3E4E2
Element (Transaction
)
Event (Item)
04/19/23 76
Formal Definition of a Sequence
A sequence is an ordered list of elements (transactions)
s = < e1 e2 e3 … >
Each element contains a collection of events (items)
ei = {i1, i2, …, ik}
Each element is attributed to a specific time or location
Length of a sequence, |s|, is given by the number of elements of the sequence
A k-sequence is a sequence that contains k elements
04/19/23 77
Examples of Sequence
Web sequence:
< {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping} >
Sequence of books checked out at a library:<{Fellowship of the Ring} {The Two Towers} {Return of the
King}>
04/19/23 78
Formal Definition of a Subsequence A sequence <a1 a2 … an> is contained in another sequence <b1 b2 … bm>
(m ≥ n) if there exist integers i1 < i2 < … < in such that a1 bi1 , a2 bi2, …, an bin
The support of a subsequence w is defined as the fraction of data sequences that contain w
A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup)
Data sequence Subsequence Contain?
< {2,4} {3,5,6} {8} > < {2} {3,5} > Yes
< {1,2} {3,4} > < {1} {2} > No
< {2,4} {2,4} {2,5} > < {2} {4} > Yes
04/19/23 79
Sequential Pattern Mining: Definition
Given: a database of sequences a user-specified minimum support threshold,
minsup
Task: Find all subsequences with support ≥ minsup
04/19/23 80
Example
Q. How to find the sequential patterns?
04/19/23 81
Example (cont.)
Item
Itemset
Transaction
Sorted by Customer_Id and TransactionTime
04/19/23 82
Example (cont.)
Sequence
<(30) (90)> is supported by customer 1 and 4
<(30) (40 70)> is supported by customer 2 and 4
With minimum support of 2 customers:The large itemset (litemset):
(30), (40), (70), (40 70), (90)
3-Sequence
04/19/23 83
Example (cont.)
Q. Find the maximal sequences with minimum support of 2 customers:
- The answer set is: <(30) (90)>, <(30) (40 70)>
Sequential Patterns
04/19/2384
The Algorithm
Five phases Sort phase Large itemset phase Transformation phase Sequence phase Maximal phase
ApriorAll
ApriorSome
DynamicSome
Rakesh Agrawal and Ramakrishnan Srikant. Mining Sequential Patterns. Proceedings of the 11th International Conference on Data Engineering, ICDE 1995.
04/19/23 85
Sort the database with customer-id as the major key and transaction-time as the minor key
Sort phase
04/19/23 86
Find the large itemset. Itemsets mapping
Litemset phase
04/19/23 87
Transformation phase
Deleting non-large itemsets Mapping large itemsets to integers
04/19/23 88
Sequence phase
Use the set of litemsets to find the desired sequence.
Two families of algorithms: Count-all: counts all large sequences, including non-
maximal sequences.
Algorithm AprioriAll Count-some: try to avoid counting non-maximal
sequences by counting longer sequences first.
Algorithm AprioriSome, Algorithm DynamicSome
04/19/23 89
Maximal phase
Find the maximum sequences among the set of large sequences.
In some algorithms, this phase is combined with the sequence phase.
04/19/23 90
Maximal phase
Algorithm: S the set of all litemsets n the length of the longest sequence
for (k = n; k > 1; k--) do for each k-sequence sk do Delete from S all subsequences of sk
04/19/23 91
AprioriAll
The basic method to mine sequential patterns
Based on the Apriori algorithm. Count all the large sequences, including
non-maximal sequences. Use Apriori-generate function to generate
candidate sequence.
04/19/23 92
Apriori Candidate Generation
generate candidates for pass using only the large sequences found in the previous pass and then makes a pass over the data to find their support.
04/19/23 93
Algorithm: Lk the set of all large k-sequences
Ck the set of candidate k-sequences
Apriori Candidate Generation
insert into Ck
select p.litemset1, p.litemset2,…, p.litemsetk-1, q.litemsetk-1
from Lk-1 p, Lk-1 qwhere p.litemset1=q.litemset1,…, p.litemsetk-2=q.litemsetk-2;
forall sequences cCk do forall (k-1)-subsequences s of c do if (sLk-1) then delete c from Ck;
04/19/23 94
AprioriAll (cont.)
L1 = {large 1-sequences}; // Result of the phasefor ( k=2; Lk-1≠Φ; k++) do begin Ck = New candidate generate from Lk-1 for each customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in c Lk = Candidates in Ck with minimum support.EndAnswer=Maximal Sequences in UkLk;
04/19/23 95
Example: (Customer Sequences)
Apriori Candidate Generation
<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>
<{1}{3}{5}><{4}{5}>
next step: find the large 1-sequences
With minimum set to 40%
04/19/23 96
next step: find the large 2-sequences
Sequence Support
<1>
<2>
<3>
<4>
<5>
<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>
<{1}{3}{5}><{4}{5}>
Example
Large 1-Sequence
4
2
4
4
2
04/19/23 97
next step: find the large 3-sequences
Sequence Support
<1 2> 2
<1 3> 4
<1 4> 3
<1 5> 2
<2 3> 2
<2 4> 2
<3 4> 3
<3 5> 2
<4 5> 2
<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>
<{1}{3}{5}><{4}{5}>
ExampleLarge 2-Sequence
04/19/23 98
next step: find the large 4-sequences
Sequence Support
<1 2 3> 2
<1 2 4> 2
<1 3 4> 3
<1 3 5> 2
<2 3 4> 2
<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>
<{1}{3}{5}><{4}{5}>
ExampleLarge 3-Sequence
04/19/23 99
next step: find the sequential pattern
Sequence Support
<1 2 3 4> 2
<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>
<{1}{3}{5}><{4}{5}>
ExampleLarge 4-Sequence
04/19/23 100
Sequence Support
<1 2 3 4> 2
Example
Sequence Support
<1> 4
<2> 2
<3> 4
<4> 4
<5> 2
Sequence Support
<1 2> 2
<1 3> 4
<1 4> 3
<1 5> 3
<2 3> 2
<2 4> 2
<3 4> 3
<3 5> 2
<4 5> 2
Sequence Support
<1 2 3> 2
<1 2 4> 2
<1 3 4> 3
<1 3 5> 2
<2 3 4> 2
Find the maximal large sequences
04/19/23 101
Count-some Algorithms
Try to avoid counting non-maximal sequences by counting longer sequences first.
2 phases: Forward Phase – find all large
sequences or certain lengths. Backward Phase – find all remaining
large sequences.
04/19/23 102
AprioriSome (1)
Determines which lengths to count using next() function.
next() takes in as a parameter the length of the sequence counted in the last pass.
next(k) = k + 1 - Same as AprioriAll Balances tradeoff between:
Counting non-maximal sequences Counting extensions of small candidate
sequences
04/19/23 103
AprioriSome (2)
hitk = Lk/ Ck Intuition: As hitk increases the time wasted
by counting extensions of small candidates decreases.
04/19/23 104
AprioriSome (3)
04/19/23 105
AprioriSome (4)
Backward Phase: For all lengths which we skipped:
Delete sequences in candidate set which are contained in some large sequence.
Count remaining candidates and find all sequences with min. support.
Also delete large sequences found in forward phase which are non-maximal.
04/19/23 106
AprioriSome (5)
04/19/23 107
AprioriSome (6)
Example:
3-Sequences
C3
next(k) = 2kminsup = 2Forward Phase:
04/19/23 108
AprioriSome (7)
Example
Backward Phase:
3-Sequences
C3
04/19/23 109
Performance of two algorithms
AprioriSome does a little better than AprioriAll. It avoids counting
many non-maximal sequences.
04/19/23 110
Advanced Association Rule Mining
04/19/23 111
Mining Various Kinds of Association Rules
Mining multilevel association
Miming multidimensional association
(Optional) Mining Max and Closed association
patterns
(Optional) Constraint-based association mining
04/19/23 112
Mining Multiple-Level Association Rules
Items often form hierarchies Flexible support settings
Items at the lower level are expected to have lower support
Exploration of shared multi-level mining (Agrawal & Srikant@VLB’95, Han & Fu@VLDB’95)
uniform support
Milk[support = 10%]
2% Milk [support = 6%]
Skim Milk [support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 5%
Level 1min_sup = 5%
Level 2min_sup = 3%
reduced support
04/19/23 113
Multi-level Association: Redundancy Filtering
Some rules may be redundant due to “ancestor” relationships between items.
Example milk wheat bread [support = 8%, confidence = 70%] 2% milk wheat bread [support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second rule.
A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor.
04/19/23 114
Mining Multi-Dimensional Association
Single-dimensional rules:buys(X, “milk”) buys(X, “bread”)
Multi-dimensional rules: 2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X, “coke”) hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
Categorical Attributes: finite number of possible values, no ordering among values—data cube approach
Quantitative Attributes: numeric, implicit ordering among values—discretization, clustering, and gradient approaches
04/19/23 115
Mining Quantitative Associations
Techniques can be categorized by how numerical attributes, such as age or salary are treated
1. Static discretization based on predefined concept hierarchies (data cube methods)
2. Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96)
3. Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97)
one dimensional clustering then association
4. Deviation: (such as Aumann and Lindell@KDD99)Sex = female => Wage: mean=$7/hr (overall mean = $9)
04/19/23 116
Static Discretization of Quantitative Attributes
Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. In relational database, finding all frequent k-predicate sets
will require k or k+1 table scans. Data cube is well suited for mining. The cells of an n-dimensional
cuboid correspond to the
predicate sets.
Mining from data cubescan be much faster.
(income)(age)
()
(buys)
(age, income) (age,buys) (income,buys)
(age,income,buys)
04/19/23 117
Quantitative Association Rules
age(X,”34-35”) income(X,”30-50K”) buys(X,”high resolution TV”)
Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined
is maximized 2-D quantitative association rules: Aquan1 Aquan2 Acat
Cluster adjacent association rules to form general rules using a 2-D grid
Exampleage(X, 34) income(X,”31-40K”) buys(X,”HDTV”)
age(X, 35) income(X,”31-40K”) buys(X,”HDTV”)
age(X, 34) income(X,”31-50K”) buys(X,”HDTV”)
age(X, 35) income(X,”31-50K”) buys(X,”HDTV”)
Classification by Association Rule Analysis
2023年4月19日 119119
Associative Classification
Associative classification: Major steps Mine data to find strong associations between frequent patterns
(conjunctions of attribute-value pairs) and class labels
Association rules are generated in the form of
P1 ^ p2 … ^ pl “Aclass = C” (conf, sup)
Organize the rules to form a rule-based classifier
Why effective? It explores highly confident associations among multiple attributes and
may overcome some constraints introduced by decision-tree induction,
which considers only one attribute at a time
Associative classification has been found to be often more accurate
than some traditional classification methods, such as C4.5
2023年4月19日 120120
Typical Associative Classification Methods CBA (Classification Based on Associations: Liu, Hsu & Ma, KDD’98)
Mine possible association rules in the form of
Cond-set (a set of attribute-value pairs) class label Build classifier: Organize rules according to decreasing precedence
based on confidence and then support
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
Classification: Statistical analysis on multiple rules
CPAR (Classification based on Predictive Association Rules: Yin & Han,
SDM’03)
Generation of predictive rules (FOIL-like analysis) but allow covered rules to retain with reduced weight
Prediction using best k rules
High efficiency, accuracy similar to CMAR
2023年4月19日 121
CBA [Liu, Hsu and Ma, KDD’98]
• Basic idea• Mine high-confidence, high-support class
association rules with Apriori• Rule LHS: a conjunction of conditions• Rule RHS: a class label• Example:
R1: age < 25 & credit = ‘good’ buy iPhone (sup=30%, conf=80%)
R2: age > 40 & income < 50k not buy iPhone (sup=40%, conf=90%)
2023年4月19日 122
CBA
• Rule mining• Mine the set of association rules wrt. min_sup and
min_conf• Rank rules in descending order of confidence and
support• Select rules to ensure training instance coverage
• Prediction• Apply the first rule that matches a test case• Otherwise, apply the default rule
2023年4月19日 123
CBA – An exampleage income student credit_rating buys_computer
<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
min_sup=25% min_conf=80%
1. age=31…40 buys_computer=yes (conf=100%, sup=28.6%)2. student=yes & credit_rating=fairbuys_computer=yes
(conf=100%, sup=28.6%)3. student=yesbuys_computer=yes (conf=85.7%, sup=50%)Default: buys_computer=yes
Rules:
•Rule mining
CBA - An example
2023年4月19日 124
1. age=31…40 buys_computer=yes (conf=100%, sup=28.6%)2. student=yes & credit_rating=fairbuys_computer=yes
(conf=100%, sup=28.6%)3. student=yesbuys_computer=yes (conf=85.7%, sup=50%)Default: buys_computer=no
Rules:
•Prediction
age income student credit_rating<=30 high yes fair
Apply Rule 2, buys_computer=yes
age income student credit_rating30…40 high yes excellent
Apply Rule 1, buys_computer=yes
age income student credit_rating<=30 high no excellent
Apply Default rule, buys_computer=no
2023年4月19日 125
CMAR [Li, Han and Pei, ICDM’01]
Basic idea Mining: build a class distribution-associated FP-tree Prediction: combine the strength of multiple rules
Rule mining Mine association rules from a class distribution-
associated FP-tree Store and retrieve association rules in a CR-tree Prune rules based on confidence, correlation and
database coverage
2023年4月19日 126
CMAR (Classification based on Multiple Association Rules) (1)
Adopted from the FP-growth Phases:
rule generation or training (R: Pc, such that sup(R ) and conf( R ) pass the given thresholds), and
classification or testing phase (predict the classification of the new sample).
2023年4月19日 127
CMAR (Classification based on Multiple Association Rules) (2)
Training database T for CMAR algorithm (the support threshold is 2 and the confidence threshold is 70%).
ID A B C D Class
01 a1 b1 c1 d1 A
02 a1 b2 c1 d2 B
03 a2 b3 c2 d3 A
04 a1 b2 c3 d3 C
05 a1 b2 c1 d3 C
FP-tree is a FP-tree is a prefix tree with prefix tree with respect to F-listrespect to F-list
F-list: F-list: (a1, b2, c1, d3)(a1, b2, c1, d3)
2023年4月19日 128
2023年4月19日 129
CMAR (Classification based on Multiple Association Rules) (3)
Rules subsets: The rules having d3 value;
The rules having c1 but no d3;
The rules having b2 but no d3 nor c1; and
The rules having only a1.
d3-projected samples:
(a1, b2, c1, d3): C, (a1, b2, d3): C, and (d3): A
=> Rule: A(a1, b2, d3) C (sup = 2, conf =100%)
(a1, c1) is a frequent pattern with support 3, but all rules
are with confidence less than threshold value. Similar conclusions are for pattern (a1, b2), and finally for (a1).
2023年4月19日 130
CMAR (Classification based on Multiple Association Rules) (4)
Classification or testing phase If all the rules have the same class, CMAR
simply assigns that label to the new sample If the rules are not consistent in the class
label of the “strongest” group To compare the strength of groups, it is
necessary to measure the “combined effect” of each group
If the rules in a group are highly positively correlated and have good support, the group should have a strong effect
2023年4月19日 131
CMAR (Classification based on Multiple Association Rules) (5)
Possible ways to measure the combined effect of a group of rules Highest X 2 value Compound of correlation Integrate both information of
correlation and population Weighted X 2
2023年4月19日 132
Weighted X 2
maxX 2 computes the upper bound of X 2 value of the rule w.r.t. other setting are fixed
For each group of rules, the weighted X 2 measure of the group is defined as
CMAR (Classification based on Multiple Association Rules) (6)
2023年4月19日 133
CPAR [Yin and Han, SDM’03]
Basic idea Combine associative classification and FOIL-
based rule generation Foil gain: criterion for selecting a literal
Improve accuracy over traditional rule-based classifiers
Improve efficiency and reduce number of rules over association rule-based methods
2023年4月19日 134
CPAR (1)
Rule generation Build a rule by adding literals one by one in a
greedy way according to foil gain measure Keep all close-to-the-best literals and build
several rules simultaneously
Prediction Collect all rules matching a test case Select the best k rules for each class Choose the class with the highest expected
accuracy for prediction
2023年4月19日 135
CPAR (2)
Build rules by adding literal one by one
CPAR keeps all “close-to-the-best” literal during rule building process
select more than one literal at the same time and build several rules simultaneously
2023年4月19日 136
CPAR (3)
After finding the best literal p, another literal q has the similar gain as p
(e.g. differ by at most 1%)
Appending p and q to current rule r create new rule r’
2023年4月19日 137
How CPAR generates rules
Example 1. Literal (A1=2) has the most Foil
gain
A1=2
2023年4月19日 138
How CPAR generates rules
2.After the first literal is selected,two literals(A2=1) and (A3=1) are found to have similar gain, which is higher than others.
A1=2 A2=1
A3=1
A1=2
2023年4月19日 139
How CPAR generates rules 3. Choose literal (A2=1) first. A rule is generated along this
direction. ( A1=2, A2=1, A4=1)
A2=1
A3=1
A1=2 A4=1
2023年4月19日 140
How CPAR generates rules
4. Then, the rule (A1=2, A3=1) is taken as the current rule. Again two literals with similar gain are selected.
A4=2
A2=1
A2=1
A3=1
A1=2 A4=1
5. Choose (A1=2,A3=1,A4=2) first. A rule is generated.
(A1=2,A3=1,A4=2,A2=3)
2023年4月19日 141
How CPAR generates rules
A4=2
A2=1
A2=1
A3=1
A1=2 A4=1
A2=3
6. (A1=2,A3=1,A2=1) is generated.
2023年4月19日 142
How CPAR generates rules
A4=2
A2=1
A2=1
A3=1
A1=2 A4=1
A2=3
More reading on Associative Classification
FADI THABTAH. A review of associative classification mining. The Knowledge Engineering Review, Vol. 22:1, 37–65, 2007.
04/19/23 144
Q&A