+ All Categories
Home > Documents > Performance and Scalability: Apriori Implementation.

Performance and Scalability: Apriori Implementation.

Date post: 22-Dec-2015
Category:
View: 223 times
Download: 0 times
Share this document with a friend
22
Performance and Scalability: Apriori Implementation
Transcript
Page 1: Performance and Scalability: Apriori Implementation.

Performance and Scalability: Apriori

Implementation

Page 2: Performance and Scalability: Apriori Implementation.

Apriori

R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB, 487-499, 1994

Page 3: Performance and Scalability: Apriori Implementation.

Reducing Number of Comparisons Candidate counting:

Scan the database of transactions to determine the support of each candidate itemset

To reduce the number of comparisons, store the candidates in a hash structure

Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions Hash Structure

k

Buckets

Page 4: Performance and Scalability: Apriori Implementation.

Generate Hash Tree

2 3 45 6 7

1 4 51 3 6

1 2 44 5 7 1 2 5

4 5 81 5 9

3 4 5 3 5 63 5 76 8 9

3 6 73 6 8

1,4,7

2,5,8

3,6,9Hash function

Suppose you have 15 candidate itemsets of length 3:

{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

You need:

• Hash function

• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)

Page 5: Performance and Scalability: Apriori Implementation.

Association Rule Discovery: Hash tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function Candidate Hash Tree

Hash on 1, 4 or 7

Page 6: Performance and Scalability: Apriori Implementation.

Association Rule Discovery: Hash tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function Candidate Hash Tree

Hash on 2, 5 or 8

Page 7: Performance and Scalability: Apriori Implementation.

Association Rule Discovery: Hash tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function Candidate Hash Tree

Hash on 3, 6 or 9

Page 8: Performance and Scalability: Apriori Implementation.

Subset Operation

1 2 3 5 6

Transaction, t

2 3 5 61 3 5 62

5 61 33 5 61 2 61 5 5 62 3 62 5

5 63

1 2 31 2 51 2 6

1 3 51 3 6

1 5 62 3 52 3 6

2 5 6 3 5 6

Subsets of 3 items

Level 1

Level 2

Level 3

63 5

Given a transaction t, what are the possible subsets of size 3?

Page 9: Performance and Scalability: Apriori Implementation.

Subset Operation Using Hash Tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1 2 3 5 6

1 + 2 3 5 63 5 62 +

5 63 +

1,4,7

2,5,8

3,6,9

Hash Functiontransaction

Page 10: Performance and Scalability: Apriori Implementation.

Subset Operation Using Hash Tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function1 2 3 5 6

3 5 61 2 +

5 61 3 +

61 5 +

3 5 62 +

5 63 +

1 + 2 3 5 6

transaction

Page 11: Performance and Scalability: Apriori Implementation.

Subset Operation Using Hash Tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function1 2 3 5 6

3 5 61 2 +

5 61 3 +

61 5 +

3 5 62 +

5 63 +

1 + 2 3 5 6

transaction

Match transaction against 11 out of 15 candidates

Page 12: Performance and Scalability: Apriori Implementation.

Prefix Tree Representation

Efficient Implementations of Apriori and EclatChristian Borgelt., FIMI’03

Page 13: Performance and Scalability: Apriori Implementation.

Prefix Tree

Page 14: Performance and Scalability: Apriori Implementation.

Prefix Tree Structure for Counting

Page 15: Performance and Scalability: Apriori Implementation.

Other key optimization Recording the items

Why is this relevant? Transaction Tree

Organize transaction into trees Count through two trees

Page 16: Performance and Scalability: Apriori Implementation.

Important websites: FIMI workshop

Not only Apriori and FIM FP-tree, ECLAT, Closed, Maximal

http://fimi.cs.helsinki.fi/ Christian Borgelt’s website

http://www.borgelt.net/software.html Ferenc Bodon’s website

http://www.cs.bme.hu/~bodon/en/apriori/

Page 17: Performance and Scalability: Apriori Implementation.

References: Christian Borgelt, Efficient

Implementations of Apriori and Eclat, FIMI’03

Ferenc Bodon, A fast APRIORI implementation, FIMI’03

Ferenc Bodon, A Survey on Frequent Itemset Mining, Technical Report, Budapest University of Technology and Economic, 2006

Page 18: Performance and Scalability: Apriori Implementation.

Scalability How to handle very large dataset? The dataset can not be stored in the main

memory Performance of out-of-core

datasets/Performance of in-core datasets

Page 19: Performance and Scalability: Apriori Implementation.

Partition: Scan Database Only Twice

Any itemset that is potentially frequent in DB

must be frequent in at least one of the partitions

of DB

Scan 1: partition database and find local frequent

patterns

Scan 2: consolidate global frequent patterns

A. Savasere, E. Omiecinski, and S. Navathe. An

efficient algorithm for mining association in large

databases. In VLDB’95

Page 20: Performance and Scalability: Apriori Implementation.

DHP: Reduce the Number of Candidates A k-itemset whose corresponding hashing bucket

count is below the threshold cannot be frequent Candidates: a, b, c, d, e

Hash entries: {ab, ad, ae} {bd, be, de} …

Frequent 1-itemset: a, b, d, e

ab is not a candidate 2-itemset if the sum of count of {ab,

ad, ae} is below support threshold

J. Park, M. Chen, and P. Yu. An effective hash-based

algorithm for mining association rules. In

SIGMOD’95

Page 21: Performance and Scalability: Apriori Implementation.

Sampling for Frequent Patterns

Select a sample of original database, mine frequent

patterns within sample using Apriori

Scan database once to verify frequent itemsets found

in sample, only borders of closure of frequent patterns

are checked Example: check abcd instead of ab, ac, …, etc.

Scan database again to find missed frequent patterns

H. Toivonen. Sampling large databases for association

rules. In VLDB’96

Page 22: Performance and Scalability: Apriori Implementation.

DIC: Reduce Number of ScansABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Itemset lattice

Once both A and D are determined frequent, the counting of AD begins

Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins

Transactions

1-itemsets2-itemsets

…Apriori

1-itemsets2-items

3-itemsDICS. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97


Recommended