+ All Categories
Home > Documents > Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining...

Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining...

Date post: 11-Jan-2016
Category:
Upload: cameron-watts
View: 221 times
Download: 1 times
Share this document with a friend
Popular Tags:
124
Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
Transcript
Page 1: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Data Mining Association Analysis: Basic Concepts

and Algorithms

Based on

Introduction to Data Miningby

Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Page 2: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

Outline

Association rules & mining

Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm

Generating association rules

Evaluating discovered rules

Page 3: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

Association Rule Mining

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},

Implication means co-occurrence, not causality!

Page 4: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

Definition: Frequent Itemset

Itemset– A collection of one or more items

Example: {Milk, Bread, Diaper}

– k-itemset An itemset that contains k items

Support count ()– Frequency of occurrence of an itemset

– E.g. ({Milk, Bread, Diaper}) = 2

Support– Fraction of transactions that contain an

itemset

– E.g. s({Milk, Bread, Diaper}) = 2/5

Frequent Itemset– An itemset whose support is greater

than or equal to a minsup threshold

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 5: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

Definition: Association Rule

Example:Beer}Diaper,Milk{

4.052

|T|)BeerDiaper,,Milk(

s

67.032

)Diaper,Milk()BeerDiaper,Milk,(

c

Association Rule– An implication expression of the form

X Y, where X and Y are itemsets

– Example: {Milk, Diaper} {Beer}

Rule Evaluation Metrics– Support (s)

Fraction of transactions that contain both X and Y

– Confidence (c) Measures how often items in Y

appear in transactions thatcontain X

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

)X(

)(

YX

c

||

)(

T

YXs

Page 6: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

Association Rule Mining Task

Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold

– confidence ≥ minconf threshold

Brute-force approach:– List all possible association rules

– Compute the support and confidence for each rule

– Prune rules that fail the minsup and minconf thresholds

Computationally prohibitive!

Page 7: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

Computational Complexity

Given d unique items:– Total number of itemsets = 2d

– Total number of possible association rules:

123 1

1

1 1

dd

d

k

kd

j j

kd

k

dR

If d=6, R = 602 rules

Page 8: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

Mining Association Rules

Example of Rules:

{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Observations:

• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}

• Rules originating from the same itemset have identical support but can have different confidence

• Thus, we may decouple the support and confidence requirements

Page 9: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9

Mining Association Rules

Two-step approach: 1. Frequent Itemset Generation

– Generate all itemsets whose support minsup

2. Rule Generation– Generate high confidence rules from each frequent itemset,

where each rule is a binary partitioning of a frequent itemset

Frequent itemset generation is still computationally expensive

Page 10: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

Frequent Itemset Generation

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d possible candidate itemsets

Page 11: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

Frequent Itemset Generation

Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset

– Count the support of each candidate by scanning the database

– Match each transaction against every candidate

– Complexity ~ O(MNw) => Expensive since M = 2d !!!

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions List ofCandidates

M

w

Page 12: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

Outline

Association rules & mining

Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm

Generating association rules

Evaluating discovered rules

Page 13: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

Frequent Itemset Generation Strategies

Reduce the number of candidates (M)– Complete search: M=2d

– Use pruning techniques to reduce M

Reduce the number of comparisons (MN)– Use efficient data structures to store the candidates or

transactions

– No need to match every candidate against every transaction

Reduce the number of transactions (N)– Transactions which does not contain any frequent k-itemsets can

not contain any frequent k+1-itemsets

Page 14: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

Reducing Number of Candidates

Apriori principle:– If an itemset is frequent, then all of its subsets must also be

frequent

– Or equivalently, if an itemset is NOT frequent, then none of its supersets can be frequent

Apriori principle holds due to the following property of support of an itemset:

– Support of an itemset never exceeds the support of its subsets

– This is known as the anti-monotone property of support: support decreases when more items added to the set

)()()(:, XsYsYXYX

Page 15: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

Found to be Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Illustrating Apriori Principle

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDEPruned supersets

Page 16: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

Apriori Algorithm

Method:

1. Let k=12. Generate frequent itemsets of length 13. Repeat until no new frequent itemsets are identified

a) Generate a (k+1)-candidate itemsets from every two frequent k-itemsets with only one different item

b) Prune candidate itemsets containing subsets of length k that are infrequent

c) Count the support of each candidate by scanning the DB

d) Prune candidates that are infrequent, leaving only those that are frequent

Page 17: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

Generating Frequent 1-Itemsets

Consider a set of transactions T with 6 unique items

– suppose minimum support count = 3 Apriori makes first pass through T to obtain support

counts for the candidate 1-itemsets

Itemset {Bread} {Coke} {Milk} {Beer} {Diaper} {Eggs}

Candidates for frequent 1-itemsets

Obtain support count

Itemset Count {Bread} 4 {Coke} 2 {Milk} 4 {Beer} 3 {Diaper} 4 {Eggs} 1

Coke & Eggs are infrequent

Itemset Count {Bread} 4 {Milk} 4 {Beer} 3 {Diaper} 4

Pruneinfrequent itemsets

Frequent1-itemsets

Page 18: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

Generating Frequent 2-Itemsets

Itemset Count {Bread} 4 {Milk} 4 {Beer} 3 {Diaper} 4

Frequent1-itemsets

Step 3.a: Generate

candidates

Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}

Candidates for frequent 2-itemsets

Step 3.b: No pruning due to

infrequent subsets

{Bread,Beer} &{Milk,Beer} are infrequent

Itemset Count {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3

Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}

Step 3.c: Obtainsupport counts

Itemset Count {Bread,Milk} 3 {Bread,Diaper} 3 {Milk,Diaper} 3 {Beer,Diaper} 3

Step 3.d: Pruneinfrequent itemsets

Frequent 2-itemsets

Page 19: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

Generating Frequent 3-Itemsets

Itemset Count {Bread,Milk} 3 {Bread,Diaper} 3 {Diaper,Milk} 3 {Beer,Diaper} 3

Frequent 2-itemsets

Itemset {Bread,Diaper,Milk} {Beer,Bread,Diaper} {Beer,Diaper,Milk}

Candidates for

frequent 3-itemsets

Step 3.a: Generate

candidates Itemset {Bread,Diaper,Milk}

Step 3.b: Prune due infrequent

subsets

E.g., {Beer,Bread,Diaper}pruned since {Beer,Diaper}

is infrequent

Itemset Count {Bread,Milk,Diaper} 3

Frequent 3-itemsets

Step 3.c: Obtainsupport counts

Step 3.d: no pruning due to insufficient support

# of candidates considered:• without pruning: 6C1 + 6C2 + 6C3 = 41 (there are 6C1 candidates for frequent 1-itemsets)• with pruning: 6C1 + 4C2 + 3 = 6 + 6 + 3 = 15

Page 20: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

Reducing Number of Comparisons

Candidate counting:– Scan the database of transactions to determine the

support of each candidate itemset– To reduce the number of comparisons, store the

candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions Hash Structure

k

Buckets

Page 21: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21

Generate Hash Tree

2 3 45 6 7

1 4 51 3 6

1 2 44 5 7 1 2 5

4 5 81 5 9

3 4 5 3 5 63 5 76 8 9

3 6 73 6 81,4,7

2,5,8

3,6,9

Hash functionh(x) = x mod 3

Suppose you have 15 candidate itemsets of length 3:

{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

• Items in itemsets & transactions are ordered (in the same way)

• Hash function

• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)

hash on first item

on second item

on third

max leaf size = 3

Page 22: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

Association Rule Discovery: Hash tree

1,4,7

2,5,8

3,6,9

Hash Function

First item is: 1, 4, or 7

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

Candidate Hash Tree

Page 23: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23

Association Rule Discovery: Hash tree

1,4,7

2,5,8

3,6,9

Hash Function

Second item is: 2, 5, or 8

1 5 9

1 4 5 1 3 6

3 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

Candidate Hash Tree

Second item is: 1, 4, or 7

Second item is: 3, 6, or 9

Page 24: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24

Association Rule Discovery: Hash tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function Candidate Hash Tree First item is 2, 5, or 8

Page 25: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25

Association Rule Discovery: Hash tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function Candidate Hash TreeFirst item is 3, 6, or 9

Page 26: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26

Association Rule Discovery: Hash tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function Candidate Hash Tree

Split by second item

Page 27: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27

Association Rule Discovery: Hash tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

Candidate Hash Tree

not further split since <= max size, if split (by third item), we would have:

3 5 66 8 9

3 5 7

1,4,7

2,5,8

3,6,9

Hash Function

Page 28: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28

Subset Operation

1 2 3 5 6

Transaction, t

2 3 5 61 3 5 62

5 61 33 5 61 2 61 5 5 62 3 62 5

5 63

1 2 31 2 51 2 6

1 3 51 3 6

1 5 62 3 52 3 6

2 5 6 3 5 6

Subsets of 3 items

Level 1

Level 2

Level 3

63 5

Given a transaction t, what are the possible subsets of size 3?

Page 29: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29

Subset Operation Using Hash Tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1 2 3 5 6

1 + 2 3 5 63 5 62 +

5 63 +

1,4,7

2,5,8

3,6,9

Hash Functiontransaction

Page 30: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30

Subset Operation Using Hash Tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function1 2 3 5 6

3 5 61 2 +

5 61 3 +

61 5 +

3 5 62 +

5 63 +

1 + 2 3 5 6

transaction

Page 31: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31

Subset Operation Using Hash Tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function1 2 3 5 6

3 5 61 2 +

5 61 3 +

61 5 +

3 5 62 +

5 63 +

1 + 2 3 5 6

transaction

5 leaves visited, 9 candidates compared125

123126156

135136

235236256

356 3-itemset from transaction

Page 32: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32

Factors Affecting Complexity of Apriori

Choice of minimum support threshold– lowering support threshold results in more frequent itemsets– this may increase (1) number of candidates and (2) max length of

frequent itemsets Dimensionality (number of unique items) of the data set

– more candidates & space needed to store their support counts– if number of frequent items also increases, both computation and

I/O costs may also increase Size of database

– since Apriori makes multiple passes, run time of algorithm may increase with number of transactions

Average transaction width– transaction width increases with denser data sets– this may increase max length of frequent itemsets and traversals

of hash tree (number of subsets in a transaction increases with its width)

Page 33: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33

Outline

Association rules & mining

Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm

Generating association rules

Evaluating discovered rules

Page 34: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34

Compact Representation of Frequent Itemsets

Itemsets with positive support counts– 3 groups: {A1, …, A10}, {B1, …, B10}, {C1, …, C10}– no transactions containing items from different groups

All subsets of items within the same group have the same support count

– e.g., support counts of {A1}, {A1, A2}, {A1, A2, A3} all = 5

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C101 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 03 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 04 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 05 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 06 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 07 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 08 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 09 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 113 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 114 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

Page 35: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35

Compact Representation of Frequent Itemsets

Suppose minimum support count is 3 Number of frequent itemsets:

need a compact representation

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C101 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 03 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 04 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 05 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 06 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 07 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 08 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 09 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 113 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 114 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

)12(310

3 1010

1

k k

Page 36: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36

Maximal Frequent Itemset

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Border

Infrequent Itemsets

Maximal Itemsets

A itemset is maximally frequent if it is frequent but none of its immediate supersets is frequent

Page 37: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37

Maximal Frequent Itemset

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Border

Infrequent Itemsets

Maximal Itemsets

Every frequent itemset is either maximal or a subset of some maximal frequent itemset

Page 38: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38

Properties of Maximal Frequent Itemsets

If level k is the highest level with maximal frequent itemsets, then level k only contains maximal frequent itemsets or infrequent itemsets

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Border

Infrequent Itemsets

Maximal Itemsets

level 0

level 1

level 2

level 3

level 4

level 5

Page 39: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39

Properties of Maximal Frequent Itemsets

Furthermore, level k + 1 (if exists) only contains infrequent itemsets

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Border

Infrequent Itemsets

Maximal Itemsets

level 0

level 1

level 2

level 3

level 4

level 5

Page 40: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40

Maximal Frequent Itemset

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Border

Infrequent Itemsets

Maximal Itemsets

Know maximal frequent itemsets know all frequent itemsets, but not their supports

Page 41: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41

Closed Itemset

An itemset is closed if none of its immediate supersets has the same support as the itemset

TID Items

1 ABC

2 ABCD

3 BCE

4 ACDE

5 DE

support count

Not supported by any transactions

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

3 3 4 3 3

2 3 2 1 3 1 1 2 2 2

2 1 2 1 1 1 1 1

1 1

Closed

0

000

0

0

BCE

Page 42: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42

Properties of Closed Itemsets

Every itemset is either a closed itemset or has the same support as one of its immediate supersets

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

3 3 4 3 3

2 3 2 1 3 1 1 2 2 2

2 1 2 1 1 1 1 1

1 1

0

000

0

0BCE

Page 43: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43

Properties of Closed Itemsets

If X is not closed, X has the same support as its immediate superset with the largest support

Suppose Y1, …, Yk are X’s immediate supersets & Yi has the largest support

(X) ≥ (Y1), (Y2), …, (Yk)

≥ max{(Y1), …, (Yk)}

= (Yi)

I.e., (X) ≥ (Yi)

since X is not closed, (X) = (Yi)

Page 44: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44

Properties of Closed Itemsets

Suppose the highest level of lattice is level k– the itemset at level k (only one k-itemset) must be closed

level 0

level 1

level 2

level 3

level 4

level 5

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

3 3 4 3 3

2 3 2 1 3 1 1 2 2 2

2 1 2 1 1 1 1 1

1 1

0

000

0

0BCE

Page 45: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45

Properties of Closed Itemsets

Can compute support counts of all itemsets from support counts of closed itemsets

level 0

level 1

level 2

level 3

level 4

level 5

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

3 3 4 3 3

2 3 2 1 3 1 1 2 2 2

2 1 2 1 1 1 1 1

1 1

0

000

0

0BCE

Page 46: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46

Properties of Closed Itemsets

Compute support of itemsets on level k from those in level k + 1: if an itemset on level k is not closed, we can compute its support from those at level k + 1

level 0

level 1

level 2

level 3

level 4

level 5

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

3 3 4 3 3

2 3 2 1 3 1 1 2 2 2

2 1 2 1 1 1 1 1

1 1

0

000

0

0BCE

Page 47: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47

Closed Frequent Itemsets

An itemset is a closed frequent itemset (shown in shaded ovals) if it is both frequent & closed

Minimum support count = 2

Closed frequent itemsets

Closed frequent itemsets

# frequent = 14

# closed frequent = 9

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

3 3 4 3 3

2 3 2 1 3 1 1 2 2 2

2 1 2 1 1 1 1 1

1 1

0

000

0

0BCE

Page 48: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48

Frequent Itemsets: Maximal vs Closed

Every maximal frequent itemset is a closed frequent itemset

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

3 3 4 3 3

2 3 2 1 3 1 1 2 2 2

2 1 2 1 1 1 1 1

1 1

0

000

0

Closed and maximal

Closed but not maximal

# closed frequent = 9

# maximal frequent = 4

0

Closed and maximal

Minimum support count = 2

Page 49: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49

Frequent Itemsets: Maximal vs Closed

FrequentItemsets

ClosedFrequentItemsets

MaximalFrequentItemsets

Page 50: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 50

Outline

Association rules & mining

Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm

Generating association rules

Evaluating discovered rules

Page 51: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 51

Alternative Methods for Frequent Itemset

Generation

Traversal of itemset lattice– Breadth-first: e.g., Apriori algorithm

– Depth-first: often used to search maximal frequent itemsets

(a) Breadth first (b) Depth first

a

ab

abc

a b c d e

ab

Page 52: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 52

Alternative Methods for Frequent Itemset

Generation

Depth-first finds maximal frequent itemsets more quickly It also enables substantial pruning

– if abcd is not maximal, all its subsets (e.g., bc) are not; but supersets of bc that are not subsets of abcd (e.g., bce) may still be maximal

(a) Breadth first (b) Depth first

a

ab

abc

a b c d e

ab

Page 53: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 53

Alternative Methods for Frequent Itemset

Generation

Traversal of itemset lattice– General-to-specific, specific-to-general, & bidirectional

Frequentitemsetborder null

{a1,a2,...,an}

(a) General-to-specific

null

{a1,a2,...,an}

Frequentitemsetborder

(b) Specific-to-general

..

......

Frequentitemsetborder

null

{a1,a2,...,an}

(c) Bidirectional

..

..

Page 54: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54

Alternative Methods for Frequent Itemset

Generation

General-to-specific: going from k to k+1-itemsets– Adopted by Apriori, effective when border (of frequent

and infrequent itemsets) is near the top of latticeFrequentitemsetborder null

{a1,a2,...,an}

(a) General-to-specific

null

{a1,a2,...,an}

Frequentitemsetborder

(b) Specific-to-general

..

......

Frequentitemsetborder

null

{a1,a2,...,an}

(c) Bidirectional

..

..

Page 55: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 55

Alternative Methods for Frequent Itemset

Generation

Specific-to-general: going from k+1 to k-itemsets– often used to find maximal frequent itemsets

– effective when border is near the bottom of latticeFrequentitemsetborder null

{a1,a2,...,an}

(a) General-to-specific

null

{a1,a2,...,an}

Frequentitemsetborder

(b) Specific-to-general

..

......

Frequentitemsetborder

null

{a1,a2,...,an}

(c) Bidirectional

..

..

Page 56: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 56

Alternative Methods for Frequent Itemset

Generation

Bidirectional: going in two directions at same time– need more space for storing candidates

– may quickly discover border in situation shown in (c)Frequentitemsetborder null

{a1,a2,...,an}

(a) General-to-specific

null

{a1,a2,...,an}

Frequentitemsetborder

(b) Specific-to-general

..

......

Frequentitemsetborder

null

{a1,a2,...,an}

(c) Bidirectional

..

..

Page 57: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 57

Alternative Methods for Frequent Itemset

Generation

Traverse lattice by equivalent classes, formed by– having the same # of items in the itemset, e.g., Apriori

– or having the same prefix (e.g., A, B, C, D) or suffix null

AB AC AD BC BD CD

A B C D

ABC ABD ACD BCD

ABCD

null

AB AC ADBC BD CD

A B C D

ABC ABD ACD BCD

ABCD

(a) Prefix tree (b) Suffix tree

same prefixA

same suffixD

Page 58: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 58

Storage Formats for Transactions

Horizontal: store a set/list of items associated with each transaction (e.g., Apriori)

Vertical: store a set/list of transactions associated with each item

TID Items1 A,B,E2 B,C,D3 C,E4 A,C,D5 A,B,C,D6 A,E7 A,B8 A,B,C9 A,C,D

10 B

HorizontalData Layout

A B C D E1 1 2 2 14 2 3 4 35 5 4 5 66 7 8 97 8 98 109

Vertical Data Layout

TID-list

Page 59: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 59

Vertical Data Layout

Searching support = intersecting TID-lists– length of TID-list shrinks as size of itemsets grows

– problem: initial lists may be too large to fit in memory

Need more compact representation FP-growth

TID Items1 A,B,E2 B,C,D3 C,E4 A,C,D5 A,B,C,D6 A,E7 A,B8 A,B,C9 A,C,D

10 B

HorizontalData Layout

A B C D E1 1 2 2 14 2 3 4 35 5 4 5 66 7 8 97 8 98 109

Vertical Data Layout

TID-list

Page 60: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 60

Outline

Association rules & mining

Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm

Generating association rules

Evaluating discovered rules

Page 61: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 61

FP-growth Algorithm: Key Ideas

Limitations of Apriori– need to generate a large number of candidates, e.g.,

1k frequent 1-itemsets 1M candidates of 2-itemset

– need to repeatedly scan the database

FP-growth: frequent pattern growth– discover frequent itemsets w/o generating candidates

– often just need to scan databases twice

Page 62: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 62

FP-growth Algorithm

Step 1:– scan database to discover frequent 1-itemsets

– scan database the second time to build a compact representation of transactions in form of FP-tree

Step 2: – use the constructed FP-tree to recursively find

frequent itemsets via divide-and-conquer: turn problem of finding k-itemsets into a set of subproblems, each finding k-itemsets ending in a different suffix

Page 63: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 63

FP-tree & Construction

Transactions represented by paths from root – items in transactions are ordered

– node = item:# of transactions having path to item

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}

null

A:1

B:1

After reading TID=1:

Page 64: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 64

FP-tree & Construction

Transactions represented by paths from root – transactions with common prefixes (e.g., T1 and T3)

share paths for their prefixes

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}

null

A:1

B:1

After reading TID=1:

Page 65: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 65

FP-tree & Construction

Items often ordered in decreasing support counts– s(A) = 8, s(B) = 7, s(C) = 5, s(D) = 5, s(E) = 3

the order is: A, B, C, D, E

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}

null

A:1

B:1

After reading TID=1:

Page 66: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 66

FP-tree & Construction

Items often ordered in decreasing support counts– s(A) = 8, s(B) = 7, s(C) = 5, s(D) = 5, s(E) = 3

– a pass over database needed for getting these counts

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}

null

A:1

B:1

After reading TID=1:

Page 67: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 67

FP-tree & Construction

Items often ordered in decreasing support counts– with this order, transactions tend to share paths

smaller tree: low branching factor, less bushy

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}

null

A:1

B:1

After reading TID=1:

Page 68: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 68

FP-tree Construction

Nodes for the same item on different paths are connected via node links

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}

null

A:1

B:1

B:1

C:1

D:1

After reading TID=2:

node link

Page 69: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 69

FP-tree Construction

Shared paths between TID=1 and TID=3

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}

After reading TID=3:null

A:2

B:1

B:1

C:1

D:1

C:1

D:1

E:1

Page 70: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 70

FP-tree Construction: Completed Tree

null

A:8

B:5

B:2

C:2

D:1

C:1

D:1C:3

D:1

D:1

E:1E:1

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}

Pointers are used to assist frequent itemset generation

D:1

E:1

Transaction Database

Item PointerABCDE

Header table

Page 71: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 71

FP-growth: Finding Frequent Itemsets

Take FP-tree generated in step 1 as input

Find frequent itemsets with common suffixes– e.g., suppose the order of items is: A, B, C, D, E– first find itemsets in reverse order: ending with E, then, D, C, …– start with items w/ fewer supports, so pruning is more likely

For each particular suffix, find frequent itemsets via divide-and-conquer, e.g., consider itemsets ending w/ E

1. obtain sub-tree of FP-tree where all paths end with E

2. compute support counts of E; if E is frequent,

3. check & prepare to solve subproblems: DE, CE, BE, AE

4. recursively compute each subproblem in a similar fashion

Page 72: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 72

FP-growth: Finding Frequent Itemsets

For each particular suffix, find frequent itemsets via divide-and-conquer, e.g., consider itemsets ending w/ E 1. obtain sub-tree of FP-tree where all paths end with E2. compute support counts of E; if E is frequent,3. transform subtree for suffix E into conditional FP-tree

for suffix E– update count– remove leaves (E’s)– prune infrequent itemsets

4. Use conditional FP-tree for E to recursively solve: DE, CE, BE, AE in a similar fashion (as using FP-tree to solve E, D, C, B, A)

Page 73: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 73

Traversal of Lattice by FP-growth

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

3

4

common suffix E

Page 74: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 74

Discover Frequent Itemsets via FP-tree

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

3

4

common suffix E

FP-tree

Sub-tree for suffix E

Sub-tree for suffix DE

Page 75: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 75

Obtaining Sub-Tree for Suffix E

Find first path containing E (by looking up in the header table), find rest by following node links

null

A:8

B:5

B:2

C:2

D:1

C:1

D:1C:3

D:1

D:1

E:1E:1

D:1

E:1

A:8

B:5

C:3

D:1

D:1Item Pointer

ABCDE

Header table

Page 76: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 76

Sub-Tree for Suffix E

Remove all E’s descendants

& nodes that are not E’s

ancestors

E:1

null

A:8 B:2

C:2C:1

D:1

D:1

E:1

E:1

A:8

null

A:8

B:5

B:2

C:2

D:1

C:1

D:1C:3

D:1

D:1

E:1E:1

D:1

E:1

A:8

B:5

C:3

D:1

D:1

Paths in the sub-tree are theprefix paths ending in E

Page 77: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 77

Current Position of Processing

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-tree

Sub-tree for suffix E just obtained

Page 78: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 78

Determine if E is Frequent

Support count of E = sum of counts of all E nodes– suppose minimum support count is 2

– support count of E = 3, so E is frequentnull

A:8 B:2

C:2C:1

D:1

D:1

E:1

E:1

A:8

E:1

Page 79: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 79

Prepare for Solving DE, CE, BE, and AE

Turn original subtree into conditional FP-tree by1. update counts on prefix paths

2. remove E nodes

3. prune infrequent itemsetsnull

A:8 B:2

C:2C:1

D:1

D:1

E:1

E:1

A:8

E:1original sub-tree

Page 80: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 80

Update Counts on Prefix Paths

Counts for E (leaf) nodes are correct Counts for internal nodes may not be correct

– due to removal of paths which do not have E’s

null

A:8 B:2

C:2C:1

D:1

D:1

E:1

E:1

A:8

E:1

Path BCD was removed,so 2 is not correct anymore

Page 81: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 81

Update Counts on Prefix Paths

Start from leaves, going upward– if node X has only child Y, count of X = count of Y

– otherwise, count of X = sum of counts of X’s children

null

A:2 B:1

C:1C:1

D:1

D:1

E:1

E:1E:1

null

A:8 B:2

C:2C:1

D:1

D:1

E:1

E:1

A:8

E:1

Page 82: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 82

Remove E Nodes

E nodes can be removed– counts of internal nodes have been updated

– contain no more information for solving: DE, CE, …

null

A:2 B:1

C:1C:1

D:1

D:1

null

A:2 B:1

C:1C:1

D:1

D:1

E:1

E:1E:1

Page 83: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 83

Prune Infrequent Itemsets

If sum of counts of all X nodes < minimum support count, remove X– since XE can not be frequent

null

A:2 B:1

C:1C:1

D:1

D:1

remove B

null

A:2

C:1

C:1

D:1

D:1

Conditional FP-tree for E

Page 84: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 84

Current Position of Processing

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-treeSub-tree for suffix E turned intoconditional FP-tree for suffix E• counts updated• leaves (E’s) removed• infrequent itemsets pruned

Page 85: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 85

Current Position of Processing

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-tree Conditional FP-tree for suffix Ethen used to solve DE, CE, BE, AEsimilarly as we use FP-tree to solveE, D, C, B, A

Page 86: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 86

Solving for DE

From conditional FP-tree for E, obtain sub-tree for suffix DE

null

A:2

C:1

C:1

D:1

D:1

Conditional FP-tree for E

null

A:2

C:1

D:1

D:1

Sub-tree for suffix DE

Page 87: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 87

Current Position of Processing

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-tree

Sub-tree for suffix DE just obtained

3

Page 88: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 88

Solving for DE

Support count of DE = 2 (sum of counts of all D’s)– DE is frequent, need to solve: CDE, BDE, ADE

Sub-tree for suffix DE

null

A:2

C:1

D:1

D:1

Page 89: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 89

Preparing for Solving CDE, BDE, ADE

null

A:2

C:1

D:1

D:1

null

A:2

C:1 Conditional FP-tree for suffix DE

Sub-tree for suffix DE

• update count• remove leaves

null

A:2

prune infrequent itemsets

Page 90: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 90

Current Position of Processing

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-tree

Sub-tree for suffix DE turned into conditionalFP-tree for suffix DE,ready to solve CDE, BDE ADE

3

Page 91: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 91

Solving CDE, BDE, ADE

Sub-trees for both CDE and BDE are empty– no paths containing item C or B

Work on ADE

ADE (support count = 2) is frequent– but no more subproblem for ADE, backtrack– & no more subproblem for DE, backtrack

Conditional FP-tree for suffix DE

null

A:2

null

A:2

Sub-tree for suffix for ADE

solving next subproblem CE

Page 92: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 92

Current Position of Processing

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-tree

3

456

7

About to solve suffix CE

Page 93: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 93

Solving for Suffix CE

CE is frequent (support count = 2) No more subproblems for CE (empty conditional FP-tree,

why?), so done with CE Checking next subproblem: BE

null

A:2

C:1

C:1

D:1

D:1

Conditional FP-tree for E

null

A:2

C:1

C:1

Sub-tree for suffix CE

Conditional FP-tree for suffix CE

Empty tree

Page 94: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 94

Solving for Suffix CE

Suftree for BE is empty (no path in conditional FP-tree for E contains B)

Checking AE: there are paths containing A

null

A:2

C:1

C:1

D:1

D:1

Conditional FP-tree for E

null

A:2

C:1

C:1

Sub-tree for suffix CE

Conditional FP-tree for suffix CE

Empty tree

Page 95: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 95

Current Position of Processing

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-tree

3

456

7

Preparing to solve suffix AE

89

10

Page 96: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 96

Solving for Suffix AE

AE is frequent (support count = 2) Done with AE, backtrack; done with E, backtrack

null

A:2

C:1

C:1

D:1

D:1

Conditional FP-tree for E

null

A:2

Sub-tree for suffix AE

Solving next subproblem: suffix D

Page 97: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 97

Current Position of Processing

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-tree

3

456

7

Ready to solve suffix D

89

10

Page 98: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 98

Found Frequent Itemsets with Suffix E

E, DE, ADE, CE, AE discovered in this order

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

common suffix E

Ready to solve suffix D

Page 99: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 99

Outline

Association rules & mining

Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm

Generating association rules

Evaluating discovered rules

Page 100: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 100

Rule Generation

Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement– If {A,B,C,D} is a frequent itemset, candidate rules:

ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BC AD, BD AC, CD AB,

If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L)

Page 101: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 101

Rule Generation

How to efficiently generate rules from frequent itemsets?– In general, confidence does not have an anti-

monotone property

– e.g., c(ABC D) can be larger or smaller than c(AB D)

)(

)()(

ABC

ABCDDABCc

)(

)()(

AB

ABDDABc

Page 102: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 102

Rule Generation

But confidence of rules generated from the same itemset has an anti-monotone property– e.g., L = {A,B,C,D}:

c(BCD A) c(CD AB) c(D ABC)

Confidence is anti-monotone w.r.t. number of items on the RHS of the rule– more items on the right lower/equal confidence

)(

)(

)(

)(

)(

)(

D

ABCD

CD

ABCD

BCD

ABCD

Page 103: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 103

Rule Generation for Apriori Algorithm

ABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Lattice of rulesABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned Rules

Low Confidence Rule

Page 104: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 104

Rule Generation for Apriori Algorithm

Candidate rule is generated by merging two rules that share the same prefixin the rule consequent

join(CD=>AB,BD=>AC)would produce the candidaterule D => ABC

Prune rule D=>ABC if itssubset AD=>BC does not havehigh confidence

BD=>ACCD=>AB

D=>ABC

Page 105: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 105

Outline

Association rules & mining

Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm

Generating association rules

Evaluating discovered rules

Page 106: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 106

Pattern Evaluation

Association rule algorithms tend to produce too many rules – many of them are uninteresting

In the original formulation of association rules, support & confidence are used

Additional interestingness measures can be used to prune/rank the derived patterns

Page 107: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 107

Computing Interestingness Measure

Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table

Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 |T|

Contingency table for X Y

f11: support count of X and Yf10: support count of X and Yf01: support count of X and Yf00: support count of X and Y

Used to define various measures

support, confidence, lift, cosine, Jaccard coefficient, etc

Page 108: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 108

Computing Interestingness Measure

Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table

Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 |T|

Contingency table for X Y

confidence (X Y)

=

= f11 / (f11 + f10)

= f11 / f1+

)|()(

),(

)(

)(XYP

XP

YXP

X

YX

Page 109: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 109

Drawback of Confidence

Coffee Coffee

Tea 15 5 20

Tea 75 5 80

90 10 100

Association Rule: Tea Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9 & P(Coffee|Tea) = 0.9375 > .75

Although confidence is high, rule is misleading

Page 110: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 110

Computing Interestingness Measure

Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table

Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 |T|

Contingency table for X Y

confidence (X Y)

=

based on P(X,Y) & P(X), but also need to consider P(Y)

)|()(

),(

)(

)(XYP

XP

YXP

X

YX

need to measure correlation of X & Y

Page 111: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 111

Measuring Correlation

Population of 1000 students– 600 students know how to swim (S)

– 700 students know how to bike (B)

– 420 students know how to swim and bike (S,B)

– P(SB) = 420/1000 = 0.42

– P(S) P(B) = 0.6 0.7 = 0.42

– P(SB) = P(S) P(B) => Independent

– P(SB) > P(S) P(B) => Positively correlated

– P(SB) < P(S) P(B) => Negatively correlated

Page 112: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 112

Measuring Correlation

Population of 1000 students– 600 students know how to swim (S)

– 700 students know how to bike (B)

– 420 students know how to swim and bike (S,B)

– P(SB) = P(S) P(B) lift = 1 S & B independent

– P(SB) > P(S) P(B) lift > 1 positive-correlated

– P(SB) < P(S) P(B) lift < 1 negative-correlated

)()(

),(

)(

)|(),(

BPSP

BSP

BP

SBPBSLift

Page 113: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 113

Measures on Correlation

The larger the value, usually the more likely the two variables/events are correlated

othersmany ...

),()()(

),(),(

)()(

),(),(

)()(

),(),(

YXPYPXP

YXPYXJaccard

YPXP

YXPYXCosine

YPXP

YXPYXLift

X Y

X,Y

Page 114: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 114

Computing Lift from Contingency Table

Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 N

11

11

11

11

)()(),(

),(

ffNf

YPXPYXP

Nf

NfNf

YXLift

Page 115: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 115

Computing Lift from Contingency Table

Coffee Coffee

Tea 15 5 20

Tea 75 5 80

90 10 100

Association Rule: Tea Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9

Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

Page 116: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 116

Drawback of Lift

Y Y

X 10 0 10

X 0 90 90

10 90 100

Y Y

X 90 0 90

X 0 10 10

90 10 100

10)1.0)(1.0(

1.0 Lift 11.1)9.0)(9.0(

9.0 Lift

Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1

10 (X & Y seldom occur together) >> 1.11 (X & Y frequently occur together)

Page 117: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 117

Compute Cosine from Contingency Table

Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 N

11

11

11

11

)()(

),(

),(

ff

f

YPXP

YXP

Nf

Nf

Nf

YXCosine

Page 118: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 118

Cosine vs Lift

Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 N

11

11

11

11

)()(),(

)()(

),(

),(

),(

ffNf

YPXPYXP

ff

f

YPXP

YXP

YXLift

YXCosine

• If X and Y are independent, lift(X,Y) = 1 but Cosine(X,Y) = P(X)P(Y)• Cosine does not depend on N (& f00), unlike Lift

Page 119: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 119

Compute Jaccard from Contingency Table

Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 N

011011

11

),()()(),(

),(

ffff

YXPYPXPYXP

YXJaccard

Page 120: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 120

Property under Null Addition

Invariant measures:

confidence, Cosine, Jaccard, etc

Non-invariant measures:

support (f11/N vs f11/(N+s)), lift, etc

Nf+0f+1

fo+f00f01X

f1+f10f11X

Y Y

N+sf+0 + sf+1

fo+ + sf00 + sf01X

f1+f10f11X

Y Y

Page 121: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 121

Property under Variable Permutation

B B A p q A r s

A A B p r B q s

Does M(A,B) = M(B,A)?

If yes, M is symmetric; otherwise asymmetric

e.g., c(AB) = c(BA)?

)(

)()(

A

BABAc

)(

)()(

B

ABABc

Page 122: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 122

Property under Variable Permutation

B B A p q A r s

A A B p r B q s

Does M(A,B) = M(B,A)?

If yes, M is symmetric; otherwise asymmetric

lift(AB) = lift(BA)?

)()(

),()(

BPAP

BAPBAlift

)()(

),()(

APBP

ABPABlift

Page 123: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 123

Property under Variable Permutation

B B A p q A r s

A A B p r B q s

Does M(A,B) = M(B,A)?

If yes, M is symmetric; otherwise asymmetric

Symmetric measures:

support, lift, Cosine, Jaccard, etc

Asymmetric measures:

confidence, etc

Page 124: Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 124

Applying Interestingness Measures

B B A p q A r s

A A B p r B q s

Association rule (AB) directional

confidence (asymmetric) is intuitive

but confidence does not capture correlation

use additional measures such as lift to rank discovered rules & further prune those with low ranks


Recommended