Date post: | 18-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 1 times |
2
OverviewAssociationsMarket Basket AnalysisBasic ConceptsFrequent ItemsetsGenerating Frequent Itemsets Apriori FP-Growth
Applications
3
Association Rule Learners
to discover elements that co-occur frequently within a data set consisting of multiple independent selections of elements (such as purchasing transactions), and
to discover rules, such as implication or correlation, which relate co-occurring elements.
to answer questions such as "if a customer purchases product A, how likely is he to purchase product B?" and "What products will a customer buy if he buys products C and D?" are answered by association-finding algorithms.
to reduce a potentially huge amount of information to a small, understandable set of statistically supported statements.
also known as “market basket analysis”.
4
Associations
Rules expressing relationships between itemsExample
cereal, milk fruit
“People who bought cereal and milk also bought fruit.”
Stores might want to offer specials on milk and cereal to get people to buy more fruit.
5
Market Basket Analysis
Analyze tables of transactions
Can we hypothesize? Chips => Salsa Lettuce => Spinach
Person Basket
A Chips, Salsa, Cookies, Crackers, Coke, Beer
B Lettuce, Spinach, Oranges, Celery, Apples, Grapes
C Chips, Salsa, Frozen Pizza, Frozen Cake
D Lettuce, Spinach, Milk, Butter
7
Basic Concepts
Set of items
Transaction
Association Rule
- set of transactions (i.e., our data)
IT
},...,,{ 21 miiiI
BAIBIA
BA
,,
D
8
Measuring Interesting Rules
Support Ratio of # of transactions containing A
and B to the total # of transactions
Confidence Ratio of # of transactions containing A
and B to #of transactions containing A
||||
||}|{||)(
D
TBADTBAs
||}|{||
||}|{||)(
TADT
TBADTBAc
9
Measuring Interesting Rules
Rules are included/excluded based on two metrics minimum support level - how
frequently all of the items in a rule appear in transactions
minimum confidence level - how frequently the left hand side of a rule implies the right hand side
10
Market Basket Analysis
What is I?What is T for person B?What is s(Chips=>Salsa)?What is c(Chips=>Salsa)?
Person Basket
A Chips, Salsa, Cookies, Crackers, Coke, Beer
B Lettuce, Spinach, Oranges, Celery, Apples, Grapes
C Chips, Salsa, Frozen Pizza, Frozen Cake
D Lettuce, Spinach, Milk, Butter, Chips
11
Frequent Itemsets
itemset – any set of itemsk-itemset – an itemset containing k itemsfrequent itemset – an itemset that satisfies a minimum support levelIf I contains m items, how many itemsets are there?
12
Strong Association Rules
Given an itemset, it’s easy to generate association rules Given itemset, {Chips, Salsa} => Chips, Salsa Chips => Salsa Salsa => Chips Chips, Salsa =>
Strong rules are interesting Generally defined as those rules satisfying
minimum support and minimum confidence
13
Association Rule Mining
Two basic steps Find all frequent itemsets
Satisfying minimum support Find all strong association rules
Generate association rules from frequent itemsets
Keep rules satisfying minimum confidence
14
Generating Frequent Itemsets
Naïve algorithm
n <- |D|
for each subset s of I do
l <- 0
for each transaction T in D do
if s is a subset of T then
l <- l + 1
if minimum support <= l/n then
add s to frequent subsets
15
Generating Frequent Itemsets
Analysis of naïve algorithm 2m subsets of I Scan n transactions for each subset O(2mn) tests of s being subset of T
Growth is exponential in the number of items!Can we do better?
16
Generating Frequent Itemsets
Frequent itemsets support the apriori property If A is not a frequent itemset, then any
superset of A is not a frequent itemset.
Proof: Let n be the number of transactions. Suppose A is a subset of l transactions. If A’ A, then A’ is a subset of l’ l transactions. Thus, if l/n < minimum support, so is l’/n.
17
Generating Frequent Itemsets
Central idea: Build candidate k-itemsets from frequent (k-1)-itemsetsApproach Find all frequent 1-itemsets Extend (k-1)-itemsets to candidate k-
itemsets Prune candidate itemsets that do not
meet the minimum support.
18
Generating Frequent Itemsets (Basic Apriori)
L1 = {frequent 1-itemsets}
for (k=2; L(k-1) is not empty; k++) {
Ck = generate k-itemset candidates from L(k-1)
for each transaction t in D { // The candidates that are subsets of t
Ct=subset(Ck,t)
for each candidate c in Ct { c.count++; } }
Lk = {c in Ck | c.count >= min_sup}}
The frequent itemsets are the union of the Lk
19
FP Growth (Han, Pei, Yin 2000)
One problematic aspect of the Apriori is the candidate generation Source of exponential growth
Another approach is to use a divide and conquer strategyIdea: Compress the database into a frequent pattern tree representing frequent items
20
FP Growth (Tree construction)
Initially, scan database for frequent 1-itemsets Place resulting set in a list L in descending
order by frequency (support)
Construct an FP-tree Create a root node labeled null Scan database
Process the items in each transaction in L order From the root, add nodes in the order in which
items appear in the transactions Link nodes representing items along different
branches
21
Frequent 1-itemsets
Minimum support of 20% (frequency of 2)Frequent 1-itemsets
I1,I2,I3,I4,I5
Construct listL = {(I2,7),(I1,6),(I3,6),(I4,2),
(I5,2)}
TID
Items
1 I1,I2,I5
2 I2,I4
3 I2,I3,I6
4 I1,I2,I4
5 I1,I3
6 I2,I3
7 I1,I3
8 I1,I2,I3,I5
9 I1,I2,I3
22
Build FP-Tree Create root node
null Scan database
Transaction1: I1, I2, I5 Order: I2, I1, I5
Process transaction Add nodes in item order Label with items, count
(I2,1)
(I1,1)
(I5,1)
1I5
0I4
0I3
1I1
1I2
Maintain header table
23
Build FP-Tree
null
(I2,2)
(I1,1)
(I5,1)
1I5
1I4
0I3
1I1
2I2
(I4,1)
TID
Items
1 I1,I2,I5
2 I2,I4
3 I2,I3,I6
4 I1,I2,I4
5 I1,I3
6 I2,I3
7 I1,I3
8 I1,I2,I3,I5
9 I1,I2,I3
24
Minining the FP-tree
Start at the last item in the tableFind all paths containing item Follow the node-links
Identify conditional patterns Patterns in paths with required frequency
Build conditional FP-tree CAppend item to all paths in C, generating frequent patternsMine C recursively (appending item)Remove item from table and tree
25
Mining the FP-Tree
null
(I2,7)
(I1,4)
(I5,1)
2I5
2I4
6I3
6I1
7I2
(I4,1)
(I3,2)
(I3,2)(I4,1)
(I5,1)
(I1,2)
(I3,2)
Prefix Paths(I2 I1,1)(I2 I1 I3, 1)
Conditional Path(I2 I1, 2)
Conditional FP-tree
(I2 I1 I5, 2)
null
(I2,2)
(I1,2)
27
Web Personalization
“ Effective Personalization Based on Association Rule Discovery from Web Usage Data,” Mobasher, et al., ACM Workshop on Web Information and Data Management, 2001.Personalization and recommendation systems e.g. Amazon.com’s recommended books
28
Data Preprocessing
Identify set of pageviews P Which files result in a single browser
display (complicated by frames, images, etc.)
P = {p1, …, pn}
Transactions T From session IDs or cookies T = {t1, …, tm}
29
Data Preprocessing
A transaction t consists of t = {(p1
t, w(p1t)), …, (pl
t,w(plt))}
The w is a weight associated with the pageview Could be binary (purchase or non-
purchase) Could be related to amount of time
spent on the page
30
Data Preprocessing
In the paper, only considered pageviews in a transaction with w(p) = 1Ordering of pageviews didn’t matter
31
Recommendation Engine
Has to run online i.e. must be fast generate frequent itemsets first and store in a
graph data structure for efficient searching
Maintains a history of the user’s current session Sets a window size w (e.g. 3) Consider pageviews A, B, C
{A,B,C} If user then visits D
{B,C,D}
32
Genomic Data
“ Finding Association Rules on Heterogenous Genome Data,” Satou et al.Combined data from PDB, SWISS-PROT, and PROSITEProtein Name
sequence feature1
sequence feature2
structure feature1
function1
function2
name1 1 0 1 0 1
name2 0 0 1 1 0
33
Genomic Data
After mining, 182388 association rules were generated (minimum support = 5, minimum confidence = 65%)Post process results with max support of 30 Itemsets appearing too frequently
aren’t interesting
Reduced to 381 rules
34
Genomic Data
Rules generated were corroborated by biological background data Found common substructures in
serine endopeptidases
Rules were not distributed well over protein families Still some work to be done on the data
preprocessing stage
35
Association Rule Summary
Association rule mining is a fundamental tool in data miningSeveral algorithms Apriori: Use a provable mathematical property
to improve performance FP-Growth: Stop candidate generation, use
effective data structure Correlation Rules: Evaluate interestingness
based on statistics Query Flocks: Generalize approach with the
purpose of query optimization (incorporation into database systems)