Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining...

Data Mining Association Analysis: Basic Concepts

and Algorithms

Based on

Introduction to Data Miningby

Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1


Outline

Association rules & mining

Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm

Generating association rules

Evaluating discovered rules


Association Rule Mining

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},

Implication means co-occurrence, not causality!


Definition: Frequent Itemset

Itemset– A collection of one or more items

Example: {Milk, Bread, Diaper}

– k-itemset An itemset that contains k items

Support count ()– Frequency of occurrence of an itemset

– E.g. ({Milk, Bread, Diaper}) = 2

Support– Fraction of transactions that contain an

itemset

– E.g. s({Milk, Bread, Diaper}) = 2/5

Frequent Itemset– An itemset whose support is greater

than or equal to a minsup threshold

TID Items

1 Bread, Milk






Definition: Association Rule

Example:Beer}Diaper,Milk{

4.052

|T|)BeerDiaper,,Milk(

s

67.032

)Diaper,Milk()BeerDiaper,Milk,(

c

Association Rule– An implication expression of the form

X Y, where X and Y are itemsets

– Example: {Milk, Diaper} {Beer}

Rule Evaluation Metrics– Support (s)

Fraction of transactions that contain both X and Y

– Confidence (c) Measures how often items in Y

appear in transactions thatcontain X

TID Items

1 Bread, Milk





)X(

)(

YX

c

||

)(

T

YXs


Association Rule Mining Task

Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold

– confidence ≥ minconf threshold

Brute-force approach:– List all possible association rules

– Compute the support and confidence for each rule

– Prune rules that fail the minsup and minconf thresholds

Computationally prohibitive!


Computational Complexity

Given d unique items:– Total number of itemsets = 2d

– Total number of possible association rules:

123 1

1

1 1

dd

d

k

kd

j j

kd

k

dR

If d=6, R = 602 rules


Mining Association Rules

Example of Rules:

{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)

TID Items

1 Bread, Milk





Observations:

• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}

• Rules originating from the same itemset have identical support but can have different confidence

• Thus, we may decouple the support and confidence requirements


Mining Association Rules

Two-step approach: 1. Frequent Itemset Generation

– Generate all itemsets whose support minsup

2. Rule Generation– Generate high confidence rules from each frequent itemset,

where each rule is a binary partitioning of a frequent itemset

Frequent itemset generation is still computationally expensive


Frequent Itemset Generation

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d possible candidate itemsets


Frequent Itemset Generation

Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset

– Count the support of each candidate by scanning the database

– Match each transaction against every candidate

– Complexity ~ O(MNw) => Expensive since M = 2d !!!

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions List ofCandidates

M

w


Outline






Frequent Itemset Generation Strategies

Reduce the number of candidates (M)– Complete search: M=2d

– Use pruning techniques to reduce M

Reduce the number of comparisons (MN)– Use efficient data structures to store the candidates or

transactions

– No need to match every candidate against every transaction

Reduce the number of transactions (N)– Transactions which does not contain any frequent k-itemsets can

not contain any frequent k+1-itemsets


Reducing Number of Candidates

Apriori principle:– If an itemset is frequent, then all of its subsets must also be

frequent

– Or equivalently, if an itemset is NOT frequent, then none of its supersets can be frequent

Apriori principle holds due to the following property of support of an itemset:

– Support of an itemset never exceeds the support of its subsets

– This is known as the anti-monotone property of support: support decreases when more items added to the set

)()()(:, XsYsYXYX


Found to be Infrequent

null


A B C D E



ABCDE

Illustrating Apriori Principle

null


A B C D E



ABCDEPruned supersets


Apriori Algorithm

Method:

1. Let k=12. Generate frequent itemsets of length 13. Repeat until no new frequent itemsets are identified

a) Generate a (k+1)-candidate itemsets from every two frequent k-itemsets with only one different item

b) Prune candidate itemsets containing subsets of length k that are infrequent

c) Count the support of each candidate by scanning the DB

d) Prune candidates that are infrequent, leaving only those that are frequent


Generating Frequent 1-Itemsets

Consider a set of transactions T with 6 unique items

– suppose minimum support count = 3 Apriori makes first pass through T to obtain support

counts for the candidate 1-itemsets

Itemset {Bread} {Coke} {Milk} {Beer} {Diaper} {Eggs}

Candidates for frequent 1-itemsets

Obtain support count

Itemset Count {Bread} 4 {Coke} 2 {Milk} 4 {Beer} 3 {Diaper} 4 {Eggs} 1

Coke & Eggs are infrequent

Itemset Count {Bread} 4 {Milk} 4 {Beer} 3 {Diaper} 4

Pruneinfrequent itemsets

Frequent1-itemsets



Itemset Count {Bread} 4 {Milk} 4 {Beer} 3 {Diaper} 4

Frequent1-itemsets

Step 3.a: Generate

candidates

Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}

Candidates for frequent 2-itemsets

Step 3.b: No pruning due to

infrequent subsets

{Bread,Beer} &{Milk,Beer} are infrequent

Itemset Count {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3

Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}

Step 3.c: Obtainsupport counts

Itemset Count {Bread,Milk} 3 {Bread,Diaper} 3 {Milk,Diaper} 3 {Beer,Diaper} 3

Step 3.d: Pruneinfrequent itemsets

Frequent 2-itemsets



Itemset Count {Bread,Milk} 3 {Bread,Diaper} 3 {Diaper,Milk} 3 {Beer,Diaper} 3

Frequent 2-itemsets

Itemset {Bread,Diaper,Milk} {Beer,Bread,Diaper} {Beer,Diaper,Milk}

Candidates for

frequent 3-itemsets

Step 3.a: Generate

candidates Itemset {Bread,Diaper,Milk}

Step 3.b: Prune due infrequent

subsets

E.g., {Beer,Bread,Diaper}pruned since {Beer,Diaper}

is infrequent

Itemset Count {Bread,Milk,Diaper} 3

Frequent 3-itemsets

Step 3.c: Obtainsupport counts

Step 3.d: no pruning due to insufficient support

# of candidates considered:• without pruning: 6C1 + 6C2 + 6C3 = 41 (there are 6C1 candidates for frequent 1-itemsets)• with pruning: 6C1 + 4C2 + 3 = 6 + 6 + 3 = 15


Reducing Number of Comparisons

Candidate counting:– Scan the database of transactions to determine the

support of each candidate itemset– To reduce the number of comparisons, store the

candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions Hash Structure

k

Buckets


Generate Hash Tree

2 3 45 6 7

1 4 51 3 6

1 2 44 5 7 1 2 5

4 5 81 5 9

3 4 5 3 5 63 5 76 8 9

3 6 73 6 81,4,7

2,5,8

3,6,9

Hash functionh(x) = x mod 3

Suppose you have 15 candidate itemsets of length 3:

{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

• Items in itemsets & transactions are ordered (in the same way)

• Hash function

• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)

hash on first item

on second item

on third

max leaf size = 3


Association Rule Discovery: Hash tree

1,4,7

2,5,8

3,6,9

Hash Function

First item is: 1, 4, or 7

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

Candidate Hash Tree



1,4,7

2,5,8

3,6,9

Hash Function

Second item is: 2, 5, or 8

1 5 9

1 4 5 1 3 6

3 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

Candidate Hash Tree





1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function Candidate Hash Tree First item is 2, 5, or 8



1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function Candidate Hash TreeFirst item is 3, 6, or 9



1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function Candidate Hash Tree

Split by second item



1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

Candidate Hash Tree

not further split since <= max size, if split (by third item), we would have:

3 5 66 8 9

3 5 7

1,4,7

2,5,8

3,6,9

Hash Function


Subset Operation

1 2 3 5 6

Transaction, t

2 3 5 61 3 5 62

5 61 33 5 61 2 61 5 5 62 3 62 5

5 63

1 2 31 2 51 2 6

1 3 51 3 6

1 5 62 3 52 3 6

2 5 6 3 5 6

Subsets of 3 items

Level 1

Level 2

Level 3

63 5

Given a transaction t, what are the possible subsets of size 3?


Subset Operation Using Hash Tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1 2 3 5 6

1 + 2 3 5 63 5 62 +

5 63 +

1,4,7

2,5,8

3,6,9

Hash Functiontransaction



1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function1 2 3 5 6

3 5 61 2 +

5 61 3 +

61 5 +

3 5 62 +

5 63 +

1 + 2 3 5 6

transaction



1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function1 2 3 5 6

3 5 61 2 +

5 61 3 +

61 5 +

3 5 62 +

5 63 +

1 + 2 3 5 6

transaction

5 leaves visited, 9 candidates compared125

123126156

135136

235236256

356 3-itemset from transaction


Factors Affecting Complexity of Apriori

Choice of minimum support threshold– lowering support threshold results in more frequent itemsets– this may increase (1) number of candidates and (2) max length of

frequent itemsets Dimensionality (number of unique items) of the data set

– more candidates & space needed to store their support counts– if number of frequent items also increases, both computation and

I/O costs may also increase Size of database

– since Apriori makes multiple passes, run time of algorithm may increase with number of transactions

Average transaction width– transaction width increases with denser data sets– this may increase max length of frequent itemsets and traversals

of hash tree (number of subsets in a transaction increases with its width)


Outline






Compact Representation of Frequent Itemsets

Itemsets with positive support counts– 3 groups: {A1, …, A10}, {B1, …, B10}, {C1, …, C10}– no transactions containing items from different groups

All subsets of items within the same group have the same support count

– e.g., support counts of {A1}, {A1, A2}, {A1, A2, A3} all = 5

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C101 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 03 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 04 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 05 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 06 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 07 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 08 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 09 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 113 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 114 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1


Compact Representation of Frequent Itemsets

Suppose minimum support count is 3 Number of frequent itemsets:

need a compact representation

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C101 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 03 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 04 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 05 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 06 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 07 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 08 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 09 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 113 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 114 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

)12(310

3 1010

1

k k


Maximal Frequent Itemset

null


A B C D E



ABCDE

Border

Infrequent Itemsets

Maximal Itemsets

A itemset is maximally frequent if it is frequent but none of its immediate supersets is frequent



null


A B C D E



ABCDE

Border

Infrequent Itemsets

Maximal Itemsets

Every frequent itemset is either maximal or a subset of some maximal frequent itemset


Properties of Maximal Frequent Itemsets

If level k is the highest level with maximal frequent itemsets, then level k only contains maximal frequent itemsets or infrequent itemsets

null


A B C D E



ABCDE

Border

Infrequent Itemsets

Maximal Itemsets

level 0

level 1

level 2

level 3

level 4

level 5


Properties of Maximal Frequent Itemsets

Furthermore, level k + 1 (if exists) only contains infrequent itemsets

null


A B C D E



ABCDE

Border

Infrequent Itemsets

Maximal Itemsets

level 0

level 1

level 2

level 3

level 4

level 5



null


A B C D E



ABCDE

Border

Infrequent Itemsets

Maximal Itemsets

Know maximal frequent itemsets know all frequent itemsets, but not their supports


Closed Itemset

An itemset is closed if none of its immediate supersets has the same support as the itemset

TID Items

1 ABC

2 ABCD

3 BCE

4 ACDE

5 DE

support count

Not supported by any transactions

null


A B C D E

ABC ABD ABE ACD ACE ADE BCD BDE CDE


ABCDE

3 3 4 3 3

2 3 2 1 3 1 1 2 2 2

2 1 2 1 1 1 1 1

1 1

Closed

0

000

0

0

BCE


Properties of Closed Itemsets

Every itemset is either a closed itemset or has the same support as one of its immediate supersets

null


A B C D E



ABCDE

3 3 4 3 3

2 3 2 1 3 1 1 2 2 2

2 1 2 1 1 1 1 1

1 1

0

000

0

0BCE



If X is not closed, X has the same support as its immediate superset with the largest support

Suppose Y1, …, Yk are X’s immediate supersets & Yi has the largest support

(X) ≥ (Y1), (Y2), …, (Yk)

≥ max{(Y1), …, (Yk)}

= (Yi)

I.e., (X) ≥ (Yi)

since X is not closed, (X) = (Yi)



Suppose the highest level of lattice is level k– the itemset at level k (only one k-itemset) must be closed

level 0

level 1

level 2

level 3

level 4

level 5

null


A B C D E



ABCDE

3 3 4 3 3

2 3 2 1 3 1 1 2 2 2

2 1 2 1 1 1 1 1

1 1

0

000

0

0BCE



Can compute support counts of all itemsets from support counts of closed itemsets

level 0

level 1

level 2

level 3

level 4

level 5

null


A B C D E



ABCDE

3 3 4 3 3

2 3 2 1 3 1 1 2 2 2

2 1 2 1 1 1 1 1

1 1

0

000

0

0BCE



Compute support of itemsets on level k from those in level k + 1: if an itemset on level k is not closed, we can compute its support from those at level k + 1

level 0

level 1

level 2

level 3

level 4

level 5

null


A B C D E



ABCDE

3 3 4 3 3

2 3 2 1 3 1 1 2 2 2

2 1 2 1 1 1 1 1

1 1

0

000

0

0BCE


Closed Frequent Itemsets

An itemset is a closed frequent itemset (shown in shaded ovals) if it is both frequent & closed

Minimum support count = 2

Closed frequent itemsets

Closed frequent itemsets

# frequent = 14

# closed frequent = 9

null


A B C D E



ABCDE

3 3 4 3 3

2 3 2 1 3 1 1 2 2 2

2 1 2 1 1 1 1 1

1 1

0

000

0

0BCE


Frequent Itemsets: Maximal vs Closed

Every maximal frequent itemset is a closed frequent itemset

null


A B C D E



ABCDE

3 3 4 3 3

2 3 2 1 3 1 1 2 2 2

2 1 2 1 1 1 1 1

1 1

0

000

0

Closed and maximal

Closed but not maximal

# closed frequent = 9

# maximal frequent = 4

0

Closed and maximal

Minimum support count = 2


Frequent Itemsets: Maximal vs Closed

FrequentItemsets

ClosedFrequentItemsets

MaximalFrequentItemsets


Outline






Alternative Methods for Frequent Itemset

Generation

Traversal of itemset lattice– Breadth-first: e.g., Apriori algorithm

– Depth-first: often used to search maximal frequent itemsets

(a) Breadth first (b) Depth first

a

ab

abc

a b c d e

ab



Generation

Depth-first finds maximal frequent itemsets more quickly It also enables substantial pruning

– if abcd is not maximal, all its subsets (e.g., bc) are not; but supersets of bc that are not subsets of abcd (e.g., bce) may still be maximal

(a) Breadth first (b) Depth first

a

ab

abc

a b c d e

ab



Generation

Traversal of itemset lattice– General-to-specific, specific-to-general, & bidirectional

Frequentitemsetborder null

{a1,a2,...,an}

(a) General-to-specific

null

{a1,a2,...,an}

Frequentitemsetborder

(b) Specific-to-general

..

......


null

{a1,a2,...,an}

(c) Bidirectional

..

..



Generation

General-to-specific: going from k to k+1-itemsets– Adopted by Apriori, effective when border (of frequent

and infrequent itemsets) is near the top of latticeFrequentitemsetborder null

{a1,a2,...,an}


null

{a1,a2,...,an}



..

......


null

{a1,a2,...,an}

(c) Bidirectional

..

..



Generation

Specific-to-general: going from k+1 to k-itemsets– often used to find maximal frequent itemsets

– effective when border is near the bottom of latticeFrequentitemsetborder null

{a1,a2,...,an}


null

{a1,a2,...,an}



..

......


null

{a1,a2,...,an}

(c) Bidirectional

..

..



Generation

Bidirectional: going in two directions at same time– need more space for storing candidates

– may quickly discover border in situation shown in (c)Frequentitemsetborder null

{a1,a2,...,an}


null

{a1,a2,...,an}



..

......


null

{a1,a2,...,an}

(c) Bidirectional

..

..



Generation

Traverse lattice by equivalent classes, formed by– having the same # of items in the itemset, e.g., Apriori

– or having the same prefix (e.g., A, B, C, D) or suffix null

AB AC AD BC BD CD

A B C D

ABC ABD ACD BCD

ABCD

null

AB AC ADBC BD CD

A B C D

ABC ABD ACD BCD

ABCD

(a) Prefix tree (b) Suffix tree

same prefixA

same suffixD


Storage Formats for Transactions

Horizontal: store a set/list of items associated with each transaction (e.g., Apriori)

Vertical: store a set/list of transactions associated with each item

TID Items1 A,B,E2 B,C,D3 C,E4 A,C,D5 A,B,C,D6 A,E7 A,B8 A,B,C9 A,C,D

10 B

HorizontalData Layout

A B C D E1 1 2 2 14 2 3 4 35 5 4 5 66 7 8 97 8 98 109

Vertical Data Layout

TID-list



Searching support = intersecting TID-lists– length of TID-list shrinks as size of itemsets grows

– problem: initial lists may be too large to fit in memory

Need more compact representation FP-growth

TID Items1 A,B,E2 B,C,D3 C,E4 A,C,D5 A,B,C,D6 A,E7 A,B8 A,B,C9 A,C,D

10 B

HorizontalData Layout

A B C D E1 1 2 2 14 2 3 4 35 5 4 5 66 7 8 97 8 98 109


TID-list


Outline






FP-growth Algorithm: Key Ideas

Limitations of Apriori– need to generate a large number of candidates, e.g.,

1k frequent 1-itemsets 1M candidates of 2-itemset

– need to repeatedly scan the database

FP-growth: frequent pattern growth– discover frequent itemsets w/o generating candidates

– often just need to scan databases twice


FP-growth Algorithm

Step 1:– scan database to discover frequent 1-itemsets

– scan database the second time to build a compact representation of transactions in form of FP-tree

Step 2: – use the constructed FP-tree to recursively find

frequent itemsets via divide-and-conquer: turn problem of finding k-itemsets into a set of subproblems, each finding k-itemsets ending in a different suffix


FP-tree & Construction

Transactions represented by paths from root – items in transactions are ordered

– node = item:# of transactions having path to item

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}

null

A:1

B:1

After reading TID=1:



Transactions represented by paths from root – transactions with common prefixes (e.g., T1 and T3)

share paths for their prefixes


null

A:1

B:1




Items often ordered in decreasing support counts– s(A) = 8, s(B) = 7, s(C) = 5, s(D) = 5, s(E) = 3

the order is: A, B, C, D, E


null

A:1

B:1




Items often ordered in decreasing support counts– s(A) = 8, s(B) = 7, s(C) = 5, s(D) = 5, s(E) = 3

– a pass over database needed for getting these counts


null

A:1

B:1




Items often ordered in decreasing support counts– with this order, transactions tend to share paths

smaller tree: low branching factor, less bushy


null

A:1

B:1



FP-tree Construction

Nodes for the same item on different paths are connected via node links


null

A:1

B:1

B:1

C:1

D:1


node link


FP-tree Construction

Shared paths between TID=1 and TID=3


After reading TID=3:null

A:2

B:1

B:1

C:1

D:1

C:1

D:1

E:1


FP-tree Construction: Completed Tree

null

A:8

B:5

B:2

C:2

D:1

C:1

D:1C:3

D:1

D:1

E:1E:1


Pointers are used to assist frequent itemset generation

D:1

E:1

Transaction Database

Item PointerABCDE

Header table


FP-growth: Finding Frequent Itemsets

Take FP-tree generated in step 1 as input

Find frequent itemsets with common suffixes– e.g., suppose the order of items is: A, B, C, D, E– first find itemsets in reverse order: ending with E, then, D, C, …– start with items w/ fewer supports, so pruning is more likely

For each particular suffix, find frequent itemsets via divide-and-conquer, e.g., consider itemsets ending w/ E

1. obtain sub-tree of FP-tree where all paths end with E

2. compute support counts of E; if E is frequent,

3. check & prepare to solve subproblems: DE, CE, BE, AE

4. recursively compute each subproblem in a similar fashion


FP-growth: Finding Frequent Itemsets

For each particular suffix, find frequent itemsets via divide-and-conquer, e.g., consider itemsets ending w/ E 1. obtain sub-tree of FP-tree where all paths end with E2. compute support counts of E; if E is frequent,3. transform subtree for suffix E into conditional FP-tree

for suffix E– update count– remove leaves (E’s)– prune infrequent itemsets

4. Use conditional FP-tree for E to recursively solve: DE, CE, BE, AE in a similar fashion (as using FP-tree to solve E, D, C, B, A)


Traversal of Lattice by FP-growth

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

3

4

common suffix E


Discover Frequent Itemsets via FP-tree

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

3

4

common suffix E

FP-tree

Sub-tree for suffix E

Sub-tree for suffix DE


Obtaining Sub-Tree for Suffix E

Find first path containing E (by looking up in the header table), find rest by following node links

null

A:8

B:5

B:2

C:2

D:1

C:1

D:1C:3

D:1

D:1

E:1E:1

D:1

E:1

A:8

B:5

C:3

D:1

D:1Item Pointer

ABCDE

Header table


Sub-Tree for Suffix E

Remove all E’s descendants

& nodes that are not E’s

ancestors

E:1

null

A:8 B:2

C:2C:1

D:1

D:1

E:1

E:1

A:8

null

A:8

B:5

B:2

C:2

D:1

C:1

D:1C:3

D:1

D:1

E:1E:1

D:1

E:1

A:8

B:5

C:3

D:1

D:1

Paths in the sub-tree are theprefix paths ending in E


Current Position of Processing

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-tree

Sub-tree for suffix E just obtained


Determine if E is Frequent

Support count of E = sum of counts of all E nodes– suppose minimum support count is 2

– support count of E = 3, so E is frequentnull

A:8 B:2

C:2C:1

D:1

D:1

E:1

E:1

A:8

E:1


Prepare for Solving DE, CE, BE, and AE

Turn original subtree into conditional FP-tree by1. update counts on prefix paths

2. remove E nodes

3. prune infrequent itemsetsnull

A:8 B:2

C:2C:1

D:1

D:1

E:1

E:1

A:8

E:1original sub-tree


Update Counts on Prefix Paths

Counts for E (leaf) nodes are correct Counts for internal nodes may not be correct

– due to removal of paths which do not have E’s

null

A:8 B:2

C:2C:1

D:1

D:1

E:1

E:1

A:8

E:1

Path BCD was removed,so 2 is not correct anymore


Update Counts on Prefix Paths

Start from leaves, going upward– if node X has only child Y, count of X = count of Y

– otherwise, count of X = sum of counts of X’s children

null

A:2 B:1

C:1C:1

D:1

D:1

E:1

E:1E:1

null

A:8 B:2

C:2C:1

D:1

D:1

E:1

E:1

A:8

E:1


Remove E Nodes

E nodes can be removed– counts of internal nodes have been updated

– contain no more information for solving: DE, CE, …

null

A:2 B:1

C:1C:1

D:1

D:1

null

A:2 B:1

C:1C:1

D:1

D:1

E:1

E:1E:1


Prune Infrequent Itemsets

If sum of counts of all X nodes < minimum support count, remove X– since XE can not be frequent

null

A:2 B:1

C:1C:1

D:1

D:1

remove B

null

A:2

C:1

C:1

D:1

D:1

Conditional FP-tree for E



null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-treeSub-tree for suffix E turned intoconditional FP-tree for suffix E• counts updated• leaves (E’s) removed• infrequent itemsets pruned



null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-tree Conditional FP-tree for suffix Ethen used to solve DE, CE, BE, AEsimilarly as we use FP-tree to solveE, D, C, B, A


Solving for DE

From conditional FP-tree for E, obtain sub-tree for suffix DE

null

A:2

C:1

C:1

D:1

D:1


null

A:2

C:1

D:1

D:1




null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-tree

Sub-tree for suffix DE just obtained

3


Solving for DE

Support count of DE = 2 (sum of counts of all D’s)– DE is frequent, need to solve: CDE, BDE, ADE


null

A:2

C:1

D:1

D:1


Preparing for Solving CDE, BDE, ADE

null

A:2

C:1

D:1

D:1

null

A:2

C:1 Conditional FP-tree for suffix DE


• update count• remove leaves

null

A:2

prune infrequent itemsets



null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-tree

Sub-tree for suffix DE turned into conditionalFP-tree for suffix DE,ready to solve CDE, BDE ADE

3


Solving CDE, BDE, ADE

Sub-trees for both CDE and BDE are empty– no paths containing item C or B

Work on ADE

ADE (support count = 2) is frequent– but no more subproblem for ADE, backtrack– & no more subproblem for DE, backtrack

Conditional FP-tree for suffix DE

null

A:2

null

A:2

Sub-tree for suffix for ADE

solving next subproblem CE



null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-tree

3

456

7

About to solve suffix CE


Solving for Suffix CE

CE is frequent (support count = 2) No more subproblems for CE (empty conditional FP-tree,

why?), so done with CE Checking next subproblem: BE

null

A:2

C:1

C:1

D:1

D:1


null

A:2

C:1

C:1

Sub-tree for suffix CE

Conditional FP-tree for suffix CE

Empty tree


Solving for Suffix CE

Suftree for BE is empty (no path in conditional FP-tree for E contains B)

Checking AE: there are paths containing A

null

A:2

C:1

C:1

D:1

D:1


null

A:2

C:1

C:1

Sub-tree for suffix CE

Conditional FP-tree for suffix CE

Empty tree



null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-tree

3

456

7

Preparing to solve suffix AE

89

10


Solving for Suffix AE

AE is frequent (support count = 2) Done with AE, backtrack; done with E, backtrack

null

A:2

C:1

C:1

D:1

D:1


null

A:2

Sub-tree for suffix AE

Solving next subproblem: suffix D



null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

1

2

common suffix E

FP-tree

3

456

7

Ready to solve suffix D

89

10


Found Frequent Itemsets with Suffix E

E, DE, ADE, CE, AE discovered in this order

null

A B C D E

DECEBEAE

CDEBDEADE

BCDE

ABCDE

BCEACE

ACDE

CDBDADBCACAB

ABEBCDACDABDABC

ABDEABCEABCD

common suffix E

Ready to solve suffix D


Outline






Rule Generation

Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement– If {A,B,C,D} is a frequent itemset, candidate rules:

ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BC AD, BD AC, CD AB,

If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L)


Rule Generation

How to efficiently generate rules from frequent itemsets?– In general, confidence does not have an anti-

monotone property

– e.g., c(ABC D) can be larger or smaller than c(AB D)

)(

)()(

ABC

ABCDDABCc

)(

)()(

AB

ABDDABc


Rule Generation

But confidence of rules generated from the same itemset has an anti-monotone property– e.g., L = {A,B,C,D}:

c(BCD A) c(CD AB) c(D ABC)

Confidence is anti-monotone w.r.t. number of items on the RHS of the rule– more items on the right lower/equal confidence

)(

)(

)(

)(

)(

)(

D

ABCD

CD

ABCD

BCD

ABCD


Rule Generation for Apriori Algorithm

ABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Lattice of rulesABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned Rules

Low Confidence Rule


Rule Generation for Apriori Algorithm

Candidate rule is generated by merging two rules that share the same prefixin the rule consequent

join(CD=>AB,BD=>AC)would produce the candidaterule D => ABC

Prune rule D=>ABC if itssubset AD=>BC does not havehigh confidence

BD=>ACCD=>AB

D=>ABC


Outline






Pattern Evaluation

Association rule algorithms tend to produce too many rules – many of them are uninteresting

In the original formulation of association rules, support & confidence are used

Additional interestingness measures can be used to prune/rank the derived patterns


Computing Interestingness Measure

Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table

Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 |T|

Contingency table for X Y

f11: support count of X and Yf10: support count of X and Yf01: support count of X and Yf00: support count of X and Y

Used to define various measures

support, confidence, lift, cosine, Jaccard coefficient, etc




Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 |T|


confidence (X Y)

=

= f11 / (f11 + f10)

= f11 / f1+

)|()(

),(

)(

)(XYP

XP

YXP

X

YX


Drawback of Confidence

Coffee Coffee

Tea 15 5 20

Tea 75 5 80

90 10 100

Association Rule: Tea Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9 & P(Coffee|Tea) = 0.9375 > .75

Although confidence is high, rule is misleading




Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 |T|


confidence (X Y)

=

based on P(X,Y) & P(X), but also need to consider P(Y)

)|()(

),(

)(

)(XYP

XP

YXP

X

YX

need to measure correlation of X & Y


Measuring Correlation

Population of 1000 students– 600 students know how to swim (S)

– 700 students know how to bike (B)

– 420 students know how to swim and bike (S,B)

– P(SB) = 420/1000 = 0.42

– P(S) P(B) = 0.6 0.7 = 0.42

– P(SB) = P(S) P(B) => Independent

– P(SB) > P(S) P(B) => Positively correlated

– P(SB) < P(S) P(B) => Negatively correlated


Measuring Correlation

Population of 1000 students– 600 students know how to swim (S)

– 700 students know how to bike (B)

– 420 students know how to swim and bike (S,B)

– P(SB) = P(S) P(B) lift = 1 S & B independent

– P(SB) > P(S) P(B) lift > 1 positive-correlated

– P(SB) < P(S) P(B) lift < 1 negative-correlated

)()(

),(

)(

)|(),(

BPSP

BSP

BP

SBPBSLift


Measures on Correlation

The larger the value, usually the more likely the two variables/events are correlated

othersmany ...

),()()(

),(),(

)()(

),(),(

)()(

),(),(

YXPYPXP

YXPYXJaccard

YPXP

YXPYXCosine

YPXP

YXPYXLift

X Y

X,Y


Computing Lift from Contingency Table

Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 N

11

11

11

11

)()(),(

),(

ffNf

YPXPYXP

Nf

NfNf

YXLift


Computing Lift from Contingency Table

Coffee Coffee

Tea 15 5 20

Tea 75 5 80

90 10 100

Association Rule: Tea Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9

Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)


Drawback of Lift

Y Y

X 10 0 10

X 0 90 90

10 90 100

Y Y

X 90 0 90

X 0 10 10

90 10 100

10)1.0)(1.0(

1.0 Lift 11.1)9.0)(9.0(

9.0 Lift

Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1

10 (X & Y seldom occur together) >> 1.11 (X & Y frequently occur together)


Compute Cosine from Contingency Table

Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 N

11

11

11

11

)()(

),(

),(

ff

f

YPXP

YXP

Nf

Nf

Nf

YXCosine


Cosine vs Lift

Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 N

11

11

11

11

)()(),(

)()(

),(

),(

),(

ffNf

YPXPYXP

ff

f

YPXP

YXP

YXLift

YXCosine

• If X and Y are independent, lift(X,Y) = 1 but Cosine(X,Y) = P(X)P(Y)• Cosine does not depend on N (& f00), unlike Lift


Compute Jaccard from Contingency Table

Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 N

011011

11

),()()(),(

),(

ffff

YXPYPXPYXP

YXJaccard


Property under Null Addition

Invariant measures:

confidence, Cosine, Jaccard, etc

Non-invariant measures:

support (f11/N vs f11/(N+s)), lift, etc

Nf+0f+1

fo+f00f01X

f1+f10f11X

Y Y

N+sf+0 + sf+1

fo+ + sf00 + sf01X

f1+f10f11X

Y Y


Property under Variable Permutation

B B A p q A r s

A A B p r B q s

Does M(A,B) = M(B,A)?

If yes, M is symmetric; otherwise asymmetric

e.g., c(AB) = c(BA)?

)(

)()(

A

BABAc

)(

)()(

B

ABABc



B B A p q A r s

A A B p r B q s



lift(AB) = lift(BA)?

)()(

),()(

BPAP

BAPBAlift

)()(

),()(

APBP

ABPABlift



B B A p q A r s

A A B p r B q s



Symmetric measures:

support, lift, Cosine, Jaccard, etc

Asymmetric measures:

confidence, etc


Applying Interestingness Measures

B B A p q A r s

A A B p r B q s

Association rule (AB) directional

confidence (asymmetric) is intuitive

but confidence does not capture correlation

use additional measures such as lift to rank discovered rules & further prune those with low ranks

Date post:	11-Jan-2016
Category:	Documents
Upload:	cameron-watts
View:	221 times
Download:	1 times

Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining...

Documents