Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14.

transcript

Pattern-Growth Methods for Sequential Pattern Mining

Iris Zhang2003-5-14

Outline• Sequential pattern mining• Apriori-like methods

– GSP

• Pattern-growth methods– FreeSpan– PrefixSpan

• Performance analysis• Conclusions

Motivation

• Sequential pattern mining: Finding time-related frequent patterns

• Most data and applications are time-related– Customer shopping patterns, telephone calling

patterns

– Natural disasters (e.g., earthquake, hurricane)

– Disease and treatment

– Stock market fluctuation

– Weblog click stream analysis

– DNA sequence analysis

Concepts• Let I={i1,i2,…,in} be a set of all items

• Itemset is a subset of items• Sequence is an ordered list of itemset.

itemsets are called elements. The number of items in the sequence is its length– e.g. < (ef)(ab)(df)cb >

• A sequence =<a1a2…an> is called subsequence of =<b1b2…bm>, denoted , if there exist integers 1j1 <j2<…<jn m such that a1bj1, a2bj2,…,anbjn

– e.g. <a(bc)dc>is subsequence of <<a(abc)(ac))d(cf)>>

Concepts (con’t)• Sequence database is a set of tuples <sid,s>, sid is a

sequence_id, and s is a sequence. A tuple is said to contain a sequence if is a subsequence of s

• Support of is the number of tuples in the database containing

• If the support of no less than a threshold, it is called sequential pattern– <(ab)c> is a sequential pattern given support threshold

min_sup =2

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

Problem definition

• Given a sequence database and min_sup threshold, the problem of sequential pattern mining is to find the complete set of sequential patterns in the database

Apriori-like methods

• Apriori property: If a sequence S is not frequent, then every super-sequence of S is not frequent– e.g. <bh> is infrequent, so do <abh>,<b(dh)>

• GSP (Generalized Sequential Pattern) algorithm– Level-by-level do

• Generate candidate sequences• Use Apriori property to prune candidates• Scan database to collect support counts

GSP Mining Process

1st scan: 8 cand. 6 length-1 seq. pat.

2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all

3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all

4th scan: 8 cand. 6 length-4 seq. pat.

5th scan: 1 cand. 1 length-5 seq. pat.

<abb> <aab> <aba> <baa> <bab> …

<abba> <(bd)bc> …

<(bd)cba> Cand. cannot pass sup. threshold

Cand. not in DB at all

Bottlenecks of Apriori-Like Methods• Potentially huge set of candidate sequences

– 1,000 frequent length-1 sequences generate length-2

candidates

• Multiple scans of database

• Difficulties at mining long sequential patterns– Exponential number of short candidates

– A length-100 sequential pattern needs candidate sequences

500,499,12

999100010001000

30100100

1012100

Pattern-growth methods• A divide-and-conquer approach

– Recursively project a sequence database into a set of smaller databases

– Mine each projected database to find the subset of patterns

• Algorithms– FreeSpan: Frequent Pattern-Projected Sequential

Pattern Mining– PrefixSpan: Prefix-Projected Sequential Pattern

Mining

FreeSpan• Example: given a sequence database S and

min_support = 2

• Step 1: find length-1 sequential patterns and list them in support descending order– f_list = a:4,b:4,c:4,d:3,e:3,f:3

SID Sequence

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <(eg(af)cbc>

FreeSpan (con’t)• Step 2: divide search space. The complete

set of seq. pat. can be partitioned into 6 disjoint subsets:– ones only contain item a– ones contain item b but no items after b in f_list– ones contain item c but no items after c in f_list– ones contain item d but no items after d in f_list– ones contain item e but no items after e in f_list– ones contain item f

find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively

FreeSpan (con’t)• Finding Seq. Patterns containing item b but

no items after b in f_list– -projected database: <a(ab)a>, <aba>,

<(ab)b>, <ab>

– Find all the length-2 seq. pat. containing item b but no items after b in f_list : <ab>:4, <ba>:2, <(ab)>:2

– Further partition and miningSID Sequence

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <(eg(af)cbc>

From FreeSpan to PrefixSpan• Freespan:

– Projection-based: No candidate sequence needs to be generated

– But, projection can be performed at any point in the sequence, and the projected sequences may not shrink much. For example, the size of f-projected database is the same as the original sequence database

• PrefixSpan– Projection-based

– But only prefix-based projection: less projections and quickly shrinking sequences

PrefixSpan-conceptsSuppose all items in an element are listed alphabetically.Given a sequence =<e1e2…en>, =<e’1e’2…e’m>(mn)

• Prefix: is the prefix of iff (1) e’i=ei (i m-1) (2) e’m

em(3) all items in (em- e’m) are alphabetically after those in e’m.

– e.g. =<a(abc)(ac)d(cf)>, =<a(ab)>, ’=<a(bc)>

• Postfix: sequence =<e1e2…e’m>, =<e’’mem+1…en> is called the postfix of w.r.t. prefix , where e’’m=(em-e’m), denoted as =.

– e.g. =<(_c)(ac)d(cf)> is the postfix of w.r.t. prefix <a(ab)>

PrefixSpan-concepts (con’t)

• Projected database: let be a sequential pattern in S. -projected database, denoted s|, is the collection of postfixes of sequences in S w.r.t. prefix

• Support count in projected database: let be a sequential pattern in S, be a sequence having prefix . The support count of in -projected database is the number of sequence in s| such that .

PrefixSpan-process• Step 1: find length-1 sequential patterns

– <a>:4, :4, <c>:4, <d>:3, <e>:3, <f>:3

• Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:

– ones having prefix <a>;– ones having prefix ;– …– ones having prefix <f>;

find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively

SID Sequence

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <(eg(af)cbc>

PrefixSpan-Process (con’t)• Finding Seq. Patterns with Prefix <a>

– <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

– Find all the length-2 seq. pat. having prefix <a>:<aa>:2, <ab>:4, <(ab)>:2, <ac>:4, <ad>:2, <af>:2

– Further partition into 6 subsets• Having prefix <aa>;

• …

• Having prefix <af>;

SID Sequence

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <(eg(af)cbc>

Completeness of PrefixSpanSID sequence

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

Length-1 sequential patterns<a>, , <c>, <d>, <e>, <f>

prefix <af>

-projected database …

prefix <a>-projected database

<(abc)(ac)d(cf)><(_d)c(bc)(ae)><(_b)(df)cb><(_f)cbc>

Length-2 seq. pan<aa>, <ab>, <(ab)>,<ac>, <ad>, <af>

prefix <a>

prefix <aa>

<aa>-proj. db <af>-proj. db

prefix <c>, …, <f>

… …

Efficiency of PrefixSpan

• No candidate sequence needs to be

generated

• Projected databases keep shrinking

• Major cost of PrefixSpan: constructing

projected databases

– Can be improved by bi-level projections and

pseudo-projections

Optimization Techniques in PrefixSpan

• Single-level vs. bi-level projection

– Bi-level projection with 3-way checking may

reduce the number and size of projected

databases

• Physical projection vs. pseudo-projection

– Pseudo-projection may reduce the effort of

projection when the projected database fits in

main memory

S-matrix for sequence databaseLength-1 sequential patterns: <a>, , <c>, <d>, <e>, <f>

All length-2 sequential patterns are found in S-matrix

S-matrix

fedcba

1(2, 0, 1)(1, 1, 1)(1, 2, 1)(2, 2, 0)(2, 1, 1)f

0(1, 1, 0)(1, 2, 0)(1, 2, 0)(1, 2, 1)e

0(1, 3, 0)(2, 2, 0)(2, 1, 1)d

3(3, 3, 2)(4, 2, 1)c

1(4, 2, 2)b

<aa> happens twice

<ac> happens4 times

<ca>happens twice

<(ac)> happens once

S-matrix for <ab>-projected database• <ab>-projected database:

– <(_c)(ac)d(cf)>,<(_c)(ae)>,<c>

• frequent items:<a>,<c>,<(_c)>• S-matrix:

c (1, 0, 1) 1

(_c) (, 2, ) (, 1, )

a c (_c)

No a(_c), no count

Lead to pattern

<a(bc)a>

SID Sequence

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <(eg(af)cbc>

Scaling-up by Bi-level Projection

• Partition search space based on length-2

sequential patterns

• Only form projected databases and pursue

recursive mining over bi-level projected

databases

Benefits of Bi-level Projection• More patterns are found in each shoot

• Much less projections

– In the example, there are 53 patterns.

– 53 level-by-level projections

– 22 bi-level projections

3-way Apriori Checking

• Using Apriori heuristic to prune items in projected databases

b (4, 2, 2) 1

c (4, 2, 1) (3, 3, 2) 3

d (2, 1, 1) (2, 2, 0) (1, 3, 0) 0

e (1, 2, 1) (1, 2, 0) (1, 2, 0) (1, 1, 0) 0

f (2, 1, 1) (2, 2, 0) (1, 2, 1) (1, 1, 1) (2, 0, 1) 1

a b c d e f

<acd> cannot be a pattern w.r.t. min_support=2exclude d from <ac>-projected database

Pseudo-projection• Major cost of PrefixSpan: projection

– Postfixes of sequences often appear repeatedly in recursive projected databases

• When the projected database fit in memory, use pointers to form projections– Pointer to the sequence

– Offset of the postfix

s=<a(abc)(ac)d(cf)>

<(abc)(ac)d(cf)>

<(_c)(ac)d(cf)>

s|<a>: ( , 2)

s|<ab>: ( , 4)

Pseudo-Projection vs. Physical Projection• Pseudo-projection avoids physically copying

postfixes– Efficient when database fits in main memory

– Not efficient when database cannot fit in main memory

• Disk-based random accessing is very costly

• Suggested Approach:– Integration of physical and pseudo-projection

– Swapping to pseudo-projection when the data set fits in memory

Experiments

• Synthetic datasets were generated using procedure described in R.Agrawal and R.Srikant. Mining sequential patterns. In Proc. 1995 ICDE’95– number of items 1000– number of sequences in the data set 10,000– average number of items within elements 8– average number of elements in a sequence 8

Experiments (con’t)

• Comparing PrefixSpan with GSP and

FreeSpan in large databases – GSP (IBM Almaden, Srikant & Agrawal EDBT’96)

– FreeSpan (J. Han J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, M.C. Hsu, KDD’00)

– Prefix-Span-1 (single-level projection)

– Prefix-Span-2 (bi-level projection)

• Comparing effects of pseudo-projection

• Comparing I/O cost and scalability

PrefixSpan Is Faster Than GSP and FreeSpan

0.00 0.50 1.00 1.50 2.00 2.50 3.00

Support threshold (%)

PrefixSpan-1

PrefixSpan-2

FreeSpan

Effect of Pseudo-Projection for projected database fit in memory

0.20 0.30 0.40 0.50 0.60

Support threshold (%)

PrefixSpan-1

PrefixSpan-2

PrefixSpan-1 (Pseudo)

PrefixSpan-2 (Pseudo)

I/O Cost: When It Cannot Fit in Memory

0.E+00

2.E+09

4.E+09

6.E+09

8.E+09

1.E+10

0.0 1.0 2.0 3.0Support threshold (%)

PrefixSpan-1PrefixSpan-1 (pseudo)PrefixSpan-2PrefixSpan-2 (pseudo)

Scalability (When DB Is Large)

0 100 200 300 400 500

# of sequences (thousand)

PrefixSpan-1

PrefixSpan-2

min_sup=0.2%

Conclusions• Both PrefixSpan and FreeSpan are pattern-

growth methods which perform better than Apriori-like methods for sequential pattern mining problem

• PrefixSpan is more elegant than FreeSpan– Apriori heuristic is integrated into bi-level

projection in PrefixSpan– Pseudo-projection substantially enhances the

performance of the memory-based processing

References

• J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequent pattern-projected sequential pattern mining. KDD'00, pages 355-359.

• J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, pages 215-224.

• R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3-17.

Thanks

Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14.

Documents