Date post: | 21-Jan-2016 |
Category: |
Documents |
Upload: | brian-mccoy |
View: | 221 times |
Download: | 3 times |
CS 345:Topics in Data Warehousing
Thursday, November 18, 2004
Review of Tuesday’s Class• Data Mining
– What is data mining?– Types of data mining– Data mining pitfalls
• Decision Tree Classifiers– What is a decision tree?– Learning decision trees– Entropy– Information Gain– Cross-Validation
Overview of Today’s Class• Assignment #3 clarifications• Association Rule Mining
– Market basket analysis– What is an association rule?– Frequent itemsets
• Association rule mining algorithms– A-priori algorithm– Speeding up A-priori using hashing– One- and two-pass algorithms
* Adapted from slides by Vipin Kumar (Minnesota) and Rajeev Motwani (Stanford)
Aggregate Tables
FACT
Dimension
n dimension columns =2n possible aggregates
2 are special•All columns = original dimension table•No grouping columns
•Only 1 row•No reason to join to
FACTAGG
Eliminate thisforeign keyfrom the factaggregate table
Candidate Column Sets
Including fact aggregatesthat use some base dimensiontables is optional
Association Rule Mining• Given a set of transactions, find rules that will predict the occurrence
of an item based on the occurrences of other items in the transaction• Also known as market basket analysis
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},
Implication means co-occurrence, not causality!
Definition: Frequent Itemset• Itemset
– A collection of one or more items• Example: {Milk, Bread, Diaper}
– k-itemset• An itemset that contains k items
• Support count ()– Frequency of occurrence of an
itemset– E.g. ({Milk, Bread,Diaper}) = 2
• Support– Fraction of transactions that
contain an itemset– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset– An itemset whose support is
greater than or equal to a minsup threshold
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Definition: Association Rule
Example:Beer}Diaper,Milk{
4.052
|T|)BeerDiaper,,Milk(
s
67.032
)Diaper,Milk()BeerDiaper,Milk,(
c
• Association Rule– An implication expression of the
form X Y, where X and Y are itemsets
– Example: {Milk, Diaper} {Beer}
• Rule Evaluation Metrics– Support (s)
• Fraction of transactions that contain both X and Y
– Confidence (c)• Measures how often items in Y
appear in transactions thatcontain X
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold– confidence ≥ minconf threshold
• High confidence = strong pattern • High support = occurs often
– Less likely to be random occurrence– Larger potential benefit from acting on the rule
Application 1 (Retail Stores)
• Real market baskets – chain stores keep TBs of customer purchase
info– Value?
• how typical customers navigate stores• positioning tempting items• suggests cross-sell opportunities – e.g., hamburger
sale while raising ketchup price• …
• High support needed, or no $$’s
Application 2 (Information Retrieval)
• Scenario 1– baskets = documents– items = words in documents– frequent word-groups = linked concepts.
• Scenario 2– items = sentences– baskets = documents containing sentences– frequent sentence-groups = possible
plagiarism
Application 3 (Web Search)
• Scenario 1– baskets = web pages– items = outgoing links– pages with similar references about same topic
• Scenario 2– baskets = web pages – items = incoming links– pages with similar in-links mirrors, or same
topic
Mining Association RulesExample of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
• Goal – find all association rules such that– support– confidence
• Reduction to Frequent Itemsets Problems– Find all frequent itemsets X– Given X={A1, …,Ak}, generate all rules X-Aj Aj
– Confidence = sup(X)/sup(X-Aj)– Support = sup(X)– Exclude rules whose confidence is too low
• Observe X-Aj also frequent support known• Finiding all frequent itemsets is the hard part!
sc
Itemset Latticenull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given m items, there are 2m-1 possible candidate itemsets
Scale of Problem
• WalMart – sells m=100,000 items– tracks n=1,000,000,000 baskets
• Web– several billion pages– approximately one new “word” per page
• Exponential number of itemsets– m items → 2m-1 possible itemsets– Cannot possibly example all itemsets for large m– Even itemsets of size 2 may be too many– m=100,000 → 5 trillion item pairs
Frequent Itemsets in SQL• DBMSs are poorly suited to association rule mining• Star schema
– Sales Fact – Transaction ID degenerate dimension– Item dimension
• Finding frequent 3-itemsets:SELECT Fact1.ItemID, Fact2.ItemID, Fact3.ItemID, COUNT(*)FROM Fact1 JOIN Fact2 ON Fact1.TID = Fact2.TID AND Fact1.ItemID < Fact2.ItemIDJOIN Fact3 ON Fact1.TID = Fact3.TID AND Fact1.ItemID < Fact2.ItemID AND Fact2.ItemID < Fact3.ItemIDGROUP BY Fact1.ItemID, Fact2.ItemID, Fact3.ItemIDHAVING COUNT(*) > 1000
• Finding frequent k-itemsets requires joining k copies of fact table– Joins are non-equijoins– Impossibly expensive!
Association Rules and Data Warehouses
• Typical procedure:– Use data warehouse to apply filters
• Mine association rules for certain regions, dates
– Export all fact rows matching filters to flat file• Sort by transaction ID• Items in same transaction are grouped together
– Perform association rule mining on flat file
• An alternative:– Database vendors are beginning to add specialized data mining
capabilities– Efficient algorithms for common data mining tasks are built in to
the database system• Decisions trees, association rules, clustering, etc.
– Not standardized yet
Finding Frequent Pairs
• Frequent 2-Sets– hard case already– focus for now, later extend to k-sets
• Naïve Algorithm– Counters – all m(m–1)/2 item pairs (m = # of distinct items)– Single pass – scanning all baskets– Basket of size b – increments b(b–1)/2 counters
• Failure?– if memory < m(m–1)/2 – m=100,000 → 5 trillion item pairs– Naïve algorithm is impractical for large m
Pruning Candidate Itemsets
• Monotonicity principle:– If an itemset is frequent, then all of its subsets
must also be frequent
• Monotonicity principle holds due to the following property of the support measure:
• Converse:– If an itemset is infrequent, then all of its
supersets must also be infrequent
)()()(:, YsXsYXYX
Found to be Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Illustrating Monotonicity Principlenull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDEPruned supersets
A-Priori Algorithm
• A-Priori – 2-pass approach in limited memory• Pass 1
– m counters (candidate items in A)– Linear scan of baskets b– Increment counters for each item in b
• Mark as frequent, f items of count at least s• Pass 2
– f(f-1)/2 counters (candidate pairs of frequent items)– Linear scan of baskets b – Increment counters for each pair of frequent items in b
• Failure – if memory < f(f–1)/2– Suppose that 10% of items are frequent– Memory is (m2 / 200) vs. (m2 / 2)
Finding Larger Itemsets
• Goal – extend A-Priori to frequent k-sets, k > 2• Monotonicity
itemset X is frequent only if X – {Xj} is frequent for all Xj
• Idea– Stage k – finds all frequent k-sets– Stage 1 – gets all frequent items– Stage k – maintain counters for all candidate k-sets– Candidates – k-sets whose (k–1)-subsets are all frequent– Total cost: number of passes = max size of frequent itemset
A-Priori AlgorithmItem CountBread 4Coke 2Milk 4Beer 3Diaper 4Eggs 1
Itemset Count{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3
Itemset Count {Bread,Milk,Diaper} 3
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generatecandidates involving Cokeor Eggs)
Triplets (3-itemsets)Minimum Support = 3
If every subset is considered, 6C1 + 6C2 + 6C3 = 41
With support-based pruning,6 + 6 + 3 = 15
Memory Usage – A-Priori
Candidate Items
Pass 1 Pass 2
Frequent Items
Candidate Pairs
MEMORY
MEMORY
PCY Idea
• Improvement upon A-Priori– Uses less memory– Proposed by Park, Chen, and Yu
• Observe – during Pass 1, memory mostly idle• Idea
– Use idle memory for hash-table H– Pass 1 – hash pairs from b into H– Increment counter at hash location – At end – bitmap of high-frequency hash locations– Pass 2 – bitmap extra condition for candidate pairs
• Similar to bit-vector filtering in “Bloom join”
Memory Usage – PCY
Candidate Items
Pass 1 Pass 2
MEMORY
MEMORY
Hash Table
Frequent Items
Bitmap
Candidate Pairs
PCY Algorithm
• Pass 1– m counters and hash-table T– Linear scan of baskets b– Increment counters for each item in b– Increment hash-table counter for each item-pair in b
• Mark as frequent, f items of count at least s• Summarize T as bitmap (count > s bit = 1)• Pass 2
– Counter only for F qualified pairs (Xi,Xj):• both are frequent• pair hashes to frequent bucket (bit=1)
– Linear scan of baskets b – Increment counters for candidate qualified pairs of items in b
Multistage PCY Algorithm
• Problem – False positives from hashing • New Idea
– Multiple rounds of hashing– After Pass 1, get list of qualified pairs– In Pass 2, hash only qualified pairs– Fewer pairs hash to buckets less false positives
(buckets with count >s, yet no pair of count >s)– In Pass 3, less likely to qualify infrequent pairs
• Repetition – reduce memory, but more passes• Failure – memory < O(f+F)
Memory Usage – Multistage PCY
Candidate Items
Pass 1 Pass 2
Hash Table 1
Frequent Items
Bitmap
Frequent Items
Bitmap 1
Bitmap 2
Candidate Pairs
Hash Table 2
Approximation Techniques
• Goal– find all frequent k-sets– reduce to 2 passes– must lose something accuracy
• Approaches– Sampling algorithm– SON (Savasere, Omiecinski, Navathe) Algorithm– Toivonen Algorithm
Sampling Algorithm
• Pass 1 – load random sample of baskets in memory• Run A-Priori (or enhancement)
– Scale-down support threshold (e.g., if 1% sample, use s/100 as support threshold)
– Compute all frequent k-sets in memory from sample– Need to leave enough space for counters
• Pass 2– Keep counters only for frequent k-sets of random sample– Get exact counts for candidates to validate
• Error?– No false positives (Pass 2)– False negatives (X frequent, but not in sample)
SON Algorithm
• Pass 1 – Batch Processing– Scan data on disk– Repeatedly fill memory with new batch of data– Run sampling algorithm on each batch– Generate candidate frequent itemsets
• Candidate Itemsets – if frequent in some batch• Pass 2 – Validate candidate itemsets• Monotonicity Property
Itemset X is frequent overall frequent in at least one batch
Toivonen’s Algorithm
• Lower Threshold in Sampling Algorithm – Example – if support threshold is 1%, use 0.8% as support
threshold when evaluating sample– Goal – overkill to avoid any false negatives
• Negative Border– Itemset X infrequent in sample, but all subsets are frequent– Example: AB, BC, AC frequent, but ABC infrequent
• Pass 2– Count candidates and negative border – Negative border itemsets all infrequent candidates are exactly
the frequent itemsets– Otherwise? – start over!
• Achievement? – reduced failure probability, while keeping candidate-count low enough for memory