+ All Categories
Home > Documents > Spring 2014 Presentation by: Thomas Little

Spring 2014 Presentation by: Thomas Little

Date post: 31-Dec-2015
Category:
Upload: anika-combs
View: 27 times
Download: 0 times
Share this document with a friend
Description:
Mining Sequential Patterns Rakesh Agrawal & Ramakrishnan Srikant Proc. of the Int'l Conference on Data Engineering (ICDE) , Taipei, Taiwan, March 1995. Spring 2014 Presentation by: Thomas Little *with slides adapted from Dan Brown’s 2011 presentation. Outline. Introduction - PowerPoint PPT Presentation
70
Proc. of the Int'l Conference on Data Engineering (ICDE) , Taipei, Taiwan, March 1995. Spring 2014 Presentation by: Thomas Little *with slides adapted from Dan Brown’s 2011 presentation.
Transcript
Page 1: Spring 2014 Presentation by:  Thomas Little

Mining Sequential Patterns

Rakesh Agrawal ampRamakrishnan Srikant

Proc of the Intl Conference on Data Engineering (ICDE) Taipei Taiwan March 1995

Spring 2014 Presentation by Thomas Little

with slides adapted from Dan Brownrsquos 2011 presentation

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

1

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

2

Introduction Bar-code technology allows the collection of

massive amounts of sales data (basket data)

A typical data record consists of transaction date items bought customer-id

3

Introduction - Cont The problem of mining sequential patterns

over this data is introduced

So far we have seen frequent pattern mining

in the context of association rules where we were interested in what items were purchased in the same transaction These are intra-transactional patterns

4

Introduction - Cont

The problem of sequential pattern mining is concerned with inter-transactional patterns

A pattern in the first case consists of a set of unordered items

acdg A pattern in the second case is an ordered list of

sets of items

ltacdggt

5

Introduction - Cont

An example of a sequential pattern

Customers typically rent ldquoStar Warsrdquo then ldquoThe Empire Strikes Backrdquo followed by ldquoReturn of the Jedirdquo

Note that these rentals do not need to be consecutive Customers who rent other videos in between also support

this sequential pattern

6

Introduction - Cont

Elements of a sequential pattern can be sets of items as well For example

ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo

7

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

8

Problem Description

We are given a database D of customer transactions

Each transaction consists of the fields customer-id transaction-time items purchased in the transaction

9

Problem Description No customer has more than one transaction

with the same transaction-time

Quantities of items bought are not

considered each item is a binary variable representing whether an item was bought or not

10

Problem Description(Terminology and definitions)

Itemset non-empty set of items Each itemset is mapped to an integer

Sequence Ordered list of itemsets

Customer Sequence List of customer transactions ordered by increasing transaction time

A customer supports a sequence if the sequence is contained in the customer-sequence

Support for a Sequence Fraction of total customers that support a sequence

11

Problem Description(Terminology and definitions) - Cont

Maximal Sequence A sequence that is not contained in any other sequence

Large Sequence Sequence that meets minisup

Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence

The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction

an itemset with minimum support is called a large itemset or Litemset

12

Problem Description(Terminology and definitions) - Cont

Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets

13

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 2: Spring 2014 Presentation by:  Thomas Little

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

1

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

2

Introduction Bar-code technology allows the collection of

massive amounts of sales data (basket data)

A typical data record consists of transaction date items bought customer-id

3

Introduction - Cont The problem of mining sequential patterns

over this data is introduced

So far we have seen frequent pattern mining

in the context of association rules where we were interested in what items were purchased in the same transaction These are intra-transactional patterns

4

Introduction - Cont

The problem of sequential pattern mining is concerned with inter-transactional patterns

A pattern in the first case consists of a set of unordered items

acdg A pattern in the second case is an ordered list of

sets of items

ltacdggt

5

Introduction - Cont

An example of a sequential pattern

Customers typically rent ldquoStar Warsrdquo then ldquoThe Empire Strikes Backrdquo followed by ldquoReturn of the Jedirdquo

Note that these rentals do not need to be consecutive Customers who rent other videos in between also support

this sequential pattern

6

Introduction - Cont

Elements of a sequential pattern can be sets of items as well For example

ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo

7

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

8

Problem Description

We are given a database D of customer transactions

Each transaction consists of the fields customer-id transaction-time items purchased in the transaction

9

Problem Description No customer has more than one transaction

with the same transaction-time

Quantities of items bought are not

considered each item is a binary variable representing whether an item was bought or not

10

Problem Description(Terminology and definitions)

Itemset non-empty set of items Each itemset is mapped to an integer

Sequence Ordered list of itemsets

Customer Sequence List of customer transactions ordered by increasing transaction time

A customer supports a sequence if the sequence is contained in the customer-sequence

Support for a Sequence Fraction of total customers that support a sequence

11

Problem Description(Terminology and definitions) - Cont

Maximal Sequence A sequence that is not contained in any other sequence

Large Sequence Sequence that meets minisup

Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence

The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction

an itemset with minimum support is called a large itemset or Litemset

12

Problem Description(Terminology and definitions) - Cont

Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets

13

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 3: Spring 2014 Presentation by:  Thomas Little

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

2

Introduction Bar-code technology allows the collection of

massive amounts of sales data (basket data)

A typical data record consists of transaction date items bought customer-id

3

Introduction - Cont The problem of mining sequential patterns

over this data is introduced

So far we have seen frequent pattern mining

in the context of association rules where we were interested in what items were purchased in the same transaction These are intra-transactional patterns

4

Introduction - Cont

The problem of sequential pattern mining is concerned with inter-transactional patterns

A pattern in the first case consists of a set of unordered items

acdg A pattern in the second case is an ordered list of

sets of items

ltacdggt

5

Introduction - Cont

An example of a sequential pattern

Customers typically rent ldquoStar Warsrdquo then ldquoThe Empire Strikes Backrdquo followed by ldquoReturn of the Jedirdquo

Note that these rentals do not need to be consecutive Customers who rent other videos in between also support

this sequential pattern

6

Introduction - Cont

Elements of a sequential pattern can be sets of items as well For example

ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo

7

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

8

Problem Description

We are given a database D of customer transactions

Each transaction consists of the fields customer-id transaction-time items purchased in the transaction

9

Problem Description No customer has more than one transaction

with the same transaction-time

Quantities of items bought are not

considered each item is a binary variable representing whether an item was bought or not

10

Problem Description(Terminology and definitions)

Itemset non-empty set of items Each itemset is mapped to an integer

Sequence Ordered list of itemsets

Customer Sequence List of customer transactions ordered by increasing transaction time

A customer supports a sequence if the sequence is contained in the customer-sequence

Support for a Sequence Fraction of total customers that support a sequence

11

Problem Description(Terminology and definitions) - Cont

Maximal Sequence A sequence that is not contained in any other sequence

Large Sequence Sequence that meets minisup

Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence

The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction

an itemset with minimum support is called a large itemset or Litemset

12

Problem Description(Terminology and definitions) - Cont

Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets

13

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 4: Spring 2014 Presentation by:  Thomas Little

Introduction Bar-code technology allows the collection of

massive amounts of sales data (basket data)

A typical data record consists of transaction date items bought customer-id

3

Introduction - Cont The problem of mining sequential patterns

over this data is introduced

So far we have seen frequent pattern mining

in the context of association rules where we were interested in what items were purchased in the same transaction These are intra-transactional patterns

4

Introduction - Cont

The problem of sequential pattern mining is concerned with inter-transactional patterns

A pattern in the first case consists of a set of unordered items

acdg A pattern in the second case is an ordered list of

sets of items

ltacdggt

5

Introduction - Cont

An example of a sequential pattern

Customers typically rent ldquoStar Warsrdquo then ldquoThe Empire Strikes Backrdquo followed by ldquoReturn of the Jedirdquo

Note that these rentals do not need to be consecutive Customers who rent other videos in between also support

this sequential pattern

6

Introduction - Cont

Elements of a sequential pattern can be sets of items as well For example

ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo

7

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

8

Problem Description

We are given a database D of customer transactions

Each transaction consists of the fields customer-id transaction-time items purchased in the transaction

9

Problem Description No customer has more than one transaction

with the same transaction-time

Quantities of items bought are not

considered each item is a binary variable representing whether an item was bought or not

10

Problem Description(Terminology and definitions)

Itemset non-empty set of items Each itemset is mapped to an integer

Sequence Ordered list of itemsets

Customer Sequence List of customer transactions ordered by increasing transaction time

A customer supports a sequence if the sequence is contained in the customer-sequence

Support for a Sequence Fraction of total customers that support a sequence

11

Problem Description(Terminology and definitions) - Cont

Maximal Sequence A sequence that is not contained in any other sequence

Large Sequence Sequence that meets minisup

Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence

The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction

an itemset with minimum support is called a large itemset or Litemset

12

Problem Description(Terminology and definitions) - Cont

Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets

13

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 5: Spring 2014 Presentation by:  Thomas Little

Introduction - Cont The problem of mining sequential patterns

over this data is introduced

So far we have seen frequent pattern mining

in the context of association rules where we were interested in what items were purchased in the same transaction These are intra-transactional patterns

4

Introduction - Cont

The problem of sequential pattern mining is concerned with inter-transactional patterns

A pattern in the first case consists of a set of unordered items

acdg A pattern in the second case is an ordered list of

sets of items

ltacdggt

5

Introduction - Cont

An example of a sequential pattern

Customers typically rent ldquoStar Warsrdquo then ldquoThe Empire Strikes Backrdquo followed by ldquoReturn of the Jedirdquo

Note that these rentals do not need to be consecutive Customers who rent other videos in between also support

this sequential pattern

6

Introduction - Cont

Elements of a sequential pattern can be sets of items as well For example

ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo

7

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

8

Problem Description

We are given a database D of customer transactions

Each transaction consists of the fields customer-id transaction-time items purchased in the transaction

9

Problem Description No customer has more than one transaction

with the same transaction-time

Quantities of items bought are not

considered each item is a binary variable representing whether an item was bought or not

10

Problem Description(Terminology and definitions)

Itemset non-empty set of items Each itemset is mapped to an integer

Sequence Ordered list of itemsets

Customer Sequence List of customer transactions ordered by increasing transaction time

A customer supports a sequence if the sequence is contained in the customer-sequence

Support for a Sequence Fraction of total customers that support a sequence

11

Problem Description(Terminology and definitions) - Cont

Maximal Sequence A sequence that is not contained in any other sequence

Large Sequence Sequence that meets minisup

Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence

The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction

an itemset with minimum support is called a large itemset or Litemset

12

Problem Description(Terminology and definitions) - Cont

Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets

13

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 6: Spring 2014 Presentation by:  Thomas Little

Introduction - Cont

The problem of sequential pattern mining is concerned with inter-transactional patterns

A pattern in the first case consists of a set of unordered items

acdg A pattern in the second case is an ordered list of

sets of items

ltacdggt

5

Introduction - Cont

An example of a sequential pattern

Customers typically rent ldquoStar Warsrdquo then ldquoThe Empire Strikes Backrdquo followed by ldquoReturn of the Jedirdquo

Note that these rentals do not need to be consecutive Customers who rent other videos in between also support

this sequential pattern

6

Introduction - Cont

Elements of a sequential pattern can be sets of items as well For example

ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo

7

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

8

Problem Description

We are given a database D of customer transactions

Each transaction consists of the fields customer-id transaction-time items purchased in the transaction

9

Problem Description No customer has more than one transaction

with the same transaction-time

Quantities of items bought are not

considered each item is a binary variable representing whether an item was bought or not

10

Problem Description(Terminology and definitions)

Itemset non-empty set of items Each itemset is mapped to an integer

Sequence Ordered list of itemsets

Customer Sequence List of customer transactions ordered by increasing transaction time

A customer supports a sequence if the sequence is contained in the customer-sequence

Support for a Sequence Fraction of total customers that support a sequence

11

Problem Description(Terminology and definitions) - Cont

Maximal Sequence A sequence that is not contained in any other sequence

Large Sequence Sequence that meets minisup

Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence

The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction

an itemset with minimum support is called a large itemset or Litemset

12

Problem Description(Terminology and definitions) - Cont

Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets

13

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 7: Spring 2014 Presentation by:  Thomas Little

Introduction - Cont

An example of a sequential pattern

Customers typically rent ldquoStar Warsrdquo then ldquoThe Empire Strikes Backrdquo followed by ldquoReturn of the Jedirdquo

Note that these rentals do not need to be consecutive Customers who rent other videos in between also support

this sequential pattern

6

Introduction - Cont

Elements of a sequential pattern can be sets of items as well For example

ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo

7

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

8

Problem Description

We are given a database D of customer transactions

Each transaction consists of the fields customer-id transaction-time items purchased in the transaction

9

Problem Description No customer has more than one transaction

with the same transaction-time

Quantities of items bought are not

considered each item is a binary variable representing whether an item was bought or not

10

Problem Description(Terminology and definitions)

Itemset non-empty set of items Each itemset is mapped to an integer

Sequence Ordered list of itemsets

Customer Sequence List of customer transactions ordered by increasing transaction time

A customer supports a sequence if the sequence is contained in the customer-sequence

Support for a Sequence Fraction of total customers that support a sequence

11

Problem Description(Terminology and definitions) - Cont

Maximal Sequence A sequence that is not contained in any other sequence

Large Sequence Sequence that meets minisup

Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence

The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction

an itemset with minimum support is called a large itemset or Litemset

12

Problem Description(Terminology and definitions) - Cont

Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets

13

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 8: Spring 2014 Presentation by:  Thomas Little

Introduction - Cont

Elements of a sequential pattern can be sets of items as well For example

ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo

7

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

8

Problem Description

We are given a database D of customer transactions

Each transaction consists of the fields customer-id transaction-time items purchased in the transaction

9

Problem Description No customer has more than one transaction

with the same transaction-time

Quantities of items bought are not

considered each item is a binary variable representing whether an item was bought or not

10

Problem Description(Terminology and definitions)

Itemset non-empty set of items Each itemset is mapped to an integer

Sequence Ordered list of itemsets

Customer Sequence List of customer transactions ordered by increasing transaction time

A customer supports a sequence if the sequence is contained in the customer-sequence

Support for a Sequence Fraction of total customers that support a sequence

11

Problem Description(Terminology and definitions) - Cont

Maximal Sequence A sequence that is not contained in any other sequence

Large Sequence Sequence that meets minisup

Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence

The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction

an itemset with minimum support is called a large itemset or Litemset

12

Problem Description(Terminology and definitions) - Cont

Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets

13

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 9: Spring 2014 Presentation by:  Thomas Little

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

8

Problem Description

We are given a database D of customer transactions

Each transaction consists of the fields customer-id transaction-time items purchased in the transaction

9

Problem Description No customer has more than one transaction

with the same transaction-time

Quantities of items bought are not

considered each item is a binary variable representing whether an item was bought or not

10

Problem Description(Terminology and definitions)

Itemset non-empty set of items Each itemset is mapped to an integer

Sequence Ordered list of itemsets

Customer Sequence List of customer transactions ordered by increasing transaction time

A customer supports a sequence if the sequence is contained in the customer-sequence

Support for a Sequence Fraction of total customers that support a sequence

11

Problem Description(Terminology and definitions) - Cont

Maximal Sequence A sequence that is not contained in any other sequence

Large Sequence Sequence that meets minisup

Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence

The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction

an itemset with minimum support is called a large itemset or Litemset

12

Problem Description(Terminology and definitions) - Cont

Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets

13

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 10: Spring 2014 Presentation by:  Thomas Little

Problem Description

We are given a database D of customer transactions

Each transaction consists of the fields customer-id transaction-time items purchased in the transaction

9

Problem Description No customer has more than one transaction

with the same transaction-time

Quantities of items bought are not

considered each item is a binary variable representing whether an item was bought or not

10

Problem Description(Terminology and definitions)

Itemset non-empty set of items Each itemset is mapped to an integer

Sequence Ordered list of itemsets

Customer Sequence List of customer transactions ordered by increasing transaction time

A customer supports a sequence if the sequence is contained in the customer-sequence

Support for a Sequence Fraction of total customers that support a sequence

11

Problem Description(Terminology and definitions) - Cont

Maximal Sequence A sequence that is not contained in any other sequence

Large Sequence Sequence that meets minisup

Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence

The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction

an itemset with minimum support is called a large itemset or Litemset

12

Problem Description(Terminology and definitions) - Cont

Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets

13

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 11: Spring 2014 Presentation by:  Thomas Little

Problem Description No customer has more than one transaction

with the same transaction-time

Quantities of items bought are not

considered each item is a binary variable representing whether an item was bought or not

10

Problem Description(Terminology and definitions)

Itemset non-empty set of items Each itemset is mapped to an integer

Sequence Ordered list of itemsets

Customer Sequence List of customer transactions ordered by increasing transaction time

A customer supports a sequence if the sequence is contained in the customer-sequence

Support for a Sequence Fraction of total customers that support a sequence

11

Problem Description(Terminology and definitions) - Cont

Maximal Sequence A sequence that is not contained in any other sequence

Large Sequence Sequence that meets minisup

Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence

The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction

an itemset with minimum support is called a large itemset or Litemset

12

Problem Description(Terminology and definitions) - Cont

Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets

13

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 12: Spring 2014 Presentation by:  Thomas Little

Problem Description(Terminology and definitions)

Itemset non-empty set of items Each itemset is mapped to an integer

Sequence Ordered list of itemsets

Customer Sequence List of customer transactions ordered by increasing transaction time

A customer supports a sequence if the sequence is contained in the customer-sequence

Support for a Sequence Fraction of total customers that support a sequence

11

Problem Description(Terminology and definitions) - Cont

Maximal Sequence A sequence that is not contained in any other sequence

Large Sequence Sequence that meets minisup

Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence

The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction

an itemset with minimum support is called a large itemset or Litemset

12

Problem Description(Terminology and definitions) - Cont

Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets

13

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 13: Spring 2014 Presentation by:  Thomas Little

Problem Description(Terminology and definitions) - Cont

Maximal Sequence A sequence that is not contained in any other sequence

Large Sequence Sequence that meets minisup

Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence

The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction

an itemset with minimum support is called a large itemset or Litemset

12

Problem Description(Terminology and definitions) - Cont

Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets

13

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 14: Spring 2014 Presentation by:  Thomas Little

Problem Description(Terminology and definitions) - Cont

Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets

13

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 15: Spring 2014 Presentation by:  Thomas Little

Problem Description - Cont

Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support

Each such sequence represents a sequential pattern

14

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 16: Spring 2014 Presentation by:  Thomas Little

Problem DescriptionExample

Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal

Seq with minimum support

15

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 17: Spring 2014 Presentation by:  Thomas Little

Outline

Introduction

Problem Description

Finding Sequential Patterns

Performance

Conclusion

Final Exam Questions

16

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 18: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns

The problem of finding sequential patterns is split into five phases

1 Sort Phase

2 Large itemset (Litemset) Phase

3 Transformation Phase

4 Sequence Phase

5 Maximal Phase

17

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 19: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns1 Sort Phase

The DB is sorted with customer-id as the major key and transaction-time as the minor-key

This step implicitly converts the original transaction DB into a DB of customer sequences

Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time

18

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 20: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns2 Litemset Phase

In this phase we find the set of all Large itemsets (Litemsets) L

We are also simultaneously finding the set of large 1-sequences since this set is just

lt l gt | l isin L

19

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 21: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns2 Litemset Phase - Cont

In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present

In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions

20

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 22: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns2 Litemset Phase - Cont

The set of Litemsets is mapped to a set of contiguous integers

By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence

21

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 23: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns2 Litemset Phase - Cont

bull Example with the minimum support 40

22

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 24: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns3 Transformation Phase

bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation

23

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 25: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns3 Transformation Phase - Cont

bull Each transaction is replaced by the set of all Litemsets contained in the transaction

bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)

bull A customer sequence is now represented by a list of sets of Litemsets

24

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 26: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns3 Transformation Phase - Cont

Note (10 20) dropped because of lack of support

(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)

25

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 27: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 Sequence Phase Overview

Seed set of large sequences

Create candidate sequences

Scan data to find support of candidate sequences

Determine large sequences 26

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 28: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 Sequence Phase

bull Use the set of Litemsets to find the desired

sequences

bull Two families of algorithms are presented

Count-all

Count-some

27

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 29: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 Sequence Phase

bull Count-all algorithms count all the large

sequences including non-maximal

sequences which are pruned out in the

maximal phase

28

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 30: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 Sequence Phase

bull Count-some algorithms try to avoid

counting non-maximal sequences by first

counting longer sequences in a forward

phase then counting the sequences skipped

in a backward phase

29

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 31: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 Sequence Phase AprioriAll

L1 = large 1-sequences result of Litemset phase

for (k = 2 Lk-1 ne k++) do

begin

Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

Answer = Maximal Sequences in cupk Lk

Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences

30

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 32: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 AprioriAll Candidate Generation

bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1

insert into Ck

select plitemset1 plitemsetk-1 qlitemsetk-1

from Lk-1 p Lk-1 q

where plitemset1 = qlitemset1

plitemsetk-2 = qlitemsetk-2

bull Next delete all sequences c isin Ck such that some

(k-1)-subsequence of c is not in Lk-131

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 33: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 AprioriAll Candidate Generation

Example

lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure

32

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 34: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 AprioriAll Maximal Phase

bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence

for ( k = n k gt 1 k --)

foreach k-sequence sk do

Delete from S all subsequences of sk

bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers

33

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 35: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 AprioriAll Example

34

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 36: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 AprioriSome

bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip

Let hitk = |Lk| |Ck|

(ie ratio of large k-sequences to candidate k-sequences)

function next(k integer) k is the length of seq counted last pass

beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2

elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5

end

bull next returns the length of sequences to count in the next pass 35

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 37: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 AprioriSome Forward Phase

L1 = large 1-sequences Result of Litemset phase

C1 = L1

last = 1 We last counted Clast

for (k = 2 Ck-1 ne and Llast ne k++) do

begin

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

if (k== next(last) ) then begin (next k to count)

foreach customer-sequence c in the database do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

last = k

end

end36

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 38: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 AprioriSome Backward Phase

for (k-- kgt=1 k--) do

if (Lk not found in forward phase) then begin

Delete all sequences in Ck contained in some L i igtk

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

else Lk already known

Delete all sequences in Lk contained in some Li igtk

Answer = Uk Lk (Maximal Phase not Needed)

Notation DT Transformed database 37

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 39: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 AprioriSome Example

38

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 40: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 AprioriSome Example - Cont

Minimum Support = 40 (2 customer sequences)

39

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 41: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 AprioriSome Example

Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 42: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome

bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase

bull AprioriSome generates Ck from Lk-1 or Ck-1

bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database

41

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 43: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome

bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3

bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database

o While generating sequences of length 9 with a step

size 3 While passing the data if sequences s6 isin L6

and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence

42

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 44: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome

bull In the intermediate phase generate the candidate sequences for the skipped lengths

o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5

bull The backward phase is identical to AprioriSome 43

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 45: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome Initialization Phase

step is an integer ge 1

L1 = large 1-sequences Result of litemset phase

for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do

begin

Ck = New candidates generated from Lk1048576-1

foreach customer-sequence c in DT do

Increment the count of all candidates in Ck

that are contained in c

Lk = Candidates in Ck with minimum support

end

44

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 46: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome Forward Phase

for ( k = step Lk1048576 ne k+= step ) do

begin

find Lk+step from Lk and Lstep

Ck+step =

foreach customer-sequence c in DT do

begin

X = otf-generate(Lk Lstep c)

foreach sequence x isin Xrsquo increment its count in

Ck+step (adding it to Ck+step if necessary)

end

Lk+step = Candidates in Ck+step with minimum support

end45

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 47: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome OTF-Generate

c is the customer sequence lt c1c2cn gt

Xk = subseq(Lk c)

forall sequences x isin Xk do

xend = min j | x sube lt c1c2cj gt

Xj = subseq(Lj c)

forall sequences x isin Xj do

xstart = max j | x sube lt cjcj+1cn gt

Answer = join of Xkwith Xj with the join condition

Xkend lt Xjstart46

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 48: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome OTF-Generate cont

Let otf-generate be called with

L2 and let c = lt1 2 3 7

4gt Thus c1 = 1 c2= 2

etc

Thus the result of the join with the join condition

X2end lt X2start

(where X2 denotes the seq of

length 2) is the single sequence lt1 2 3 4gt

Seq

lt1 2gt

End

2

Start

1

lt1 3gt 3 1

lt1 4gt 4 1

lt2 3gt 3 2

lt2 4gt 4 2

lt3 4gt 4 3

47

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 49: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome intermediate Phase

for ( k-- k gt 1 k-- ) do

if (Lk not yet determined) then

if (Lk-1 known) then

Ck = New candidates generated from Lk-1

else

Ck = New candidates generated from Ck-1

48

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 50: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome Example

bull Let step = 2

use L2 and L2 as argument in otf-generate to get C4

49

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 51: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome Example

bull Get 2 candidate sequences

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

50

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 52: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome Example

bull Only lt1 2 3 4gt is large

C4 Minisup

lt1 2 3 4gt 2

lt1 3 4 5gt 1

L4 Minisup

lt1 2 3 4gt 2

51

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 53: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome Example

bull pass as arg to otf-gen L2 and L4 to get C6

L4 sup

lt1 2 3 4gt 2

52

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 54: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome Example

bull C6 is found to be empty

L4 sup

lt1 2 3 4gt 2C6 =

53

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 55: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome Example

bull In the intermediate phase C3 is generated

C3

from L2 and C5 from L4

using apriori-generate

L4 sup

lt1 2 3 4gt 2C5

54

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 56: Spring 2014 Presentation by:  Thomas Little

Finding Sequential Patterns4 DynamicSome Example

bull C5 is found to be empty so only C3 is counted during the backward phase to get L3

L3 C3

C5 =

55

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 57: Spring 2014 Presentation by:  Thomas Little

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions56

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 58: Spring 2014 Presentation by:  Thomas Little

Performance Synthetic Data

57

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 59: Spring 2014 Presentation by:  Thomas Little

Performance Execution Times

58

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 60: Spring 2014 Presentation by:  Thomas Little

Performance Scale-Up Customers

59

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 61: Spring 2014 Presentation by:  Thomas Little

Performance Scale-Up

60

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 62: Spring 2014 Presentation by:  Thomas Little

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions61

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 63: Spring 2014 Presentation by:  Thomas Little

Conclusion

bull The problem of mining sequential patterns from a customer DB was introduced

bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome

bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup

bull AprioriAll and AprioriSome have excellent scale-up properties 62

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 64: Spring 2014 Presentation by:  Thomas Little

Outline

Introduction

Problem Description

Finding Sequential Patterns

Sequence Phase

Performance

Conclusion

Final Exam Questions63

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 65: Spring 2014 Presentation by:  Thomas Little

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

64

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 66: Spring 2014 Presentation by:  Thomas Little

Final Exam Question 1 Compare and contrast association rules and sequential

patterns How do they relate to each other in the context of the Apriori algorithms

Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules

65

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 67: Spring 2014 Presentation by:  Thomas Little

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

66

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 68: Spring 2014 Presentation by:  Thomas Little

Final Exam Question 2 What is the major difference between the two algorithms

CountSome and CountAll

CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)

CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 69: Spring 2014 Presentation by:  Thomas Little

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

68

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)
Page 70: Spring 2014 Presentation by:  Thomas Little

Final Exam Question 3 Why is the Transformation stage of these

pattern mining algorithms so important to their speed

The transformation allows each record to be looked up in constant time reducing the run time

69

  • Mining Sequential Patterns Rakesh Agrawal amp Ramakrishnan Sr
  • Outline
  • Outline (2)
  • Introduction
  • Introduction - Cont
  • Introduction - Cont
  • Introduction - Cont (2)
  • Introduction - Cont (2)
  • Outline (3)
  • Problem Description
  • Problem Description
  • Problem Description (Terminology and definitions)
  • Problem Description (Terminology and definitions) - Cont
  • Problem Description (Terminology and definitions) - Cont (2)
  • Problem Description - Cont
  • Problem Description Example
  • Outline (4)
  • Finding Sequential Patterns
  • Finding Sequential Patterns 1 Sort Phase
  • Finding Sequential Patterns 2 Litemset Phase
  • Finding Sequential Patterns 2 Litemset Phase - Cont
  • Finding Sequential Patterns 2 Litemset Phase - Cont (2)
  • Finding Sequential Patterns 2 Litemset Phase - Cont (3)
  • Finding Sequential Patterns 3 Transformation Phase
  • Finding Sequential Patterns 3 Transformation Phase - Cont
  • Finding Sequential Patterns 3 Transformation Phase - Cont (2)
  • Finding Sequential Patterns 4 Sequence Phase Overview
  • Finding Sequential Patterns 4 Sequence Phase
  • Finding Sequential Patterns 4 Sequence Phase (2)
  • Finding Sequential Patterns 4 Sequence Phase (3)
  • Finding Sequential Patterns 4 Sequence Phase AprioriAll
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation
  • Finding Sequential Patterns 4 AprioriAll Candidate Generation (2)
  • Finding Sequential Patterns 4 AprioriAll Maximal Phase
  • Finding Sequential Patterns 4 AprioriAll Example
  • Finding Sequential Patterns 4 AprioriSome
  • Finding Sequential Patterns 4 AprioriSome Forward Phase
  • Finding Sequential Patterns 4 AprioriSome Backward Phase
  • Finding Sequential Patterns 4 AprioriSome Example
  • Finding Sequential Patterns 4 AprioriSome Example - Cont
  • Finding Sequential Patterns 4 AprioriSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome
  • Finding Sequential Patterns 4 DynamicSome (2)
  • Finding Sequential Patterns 4 DynamicSome (3)
  • Finding Sequential Patterns 4 DynamicSome Initialization Phase
  • Finding Sequential Patterns 4 DynamicSome Forward Phase
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate
  • Finding Sequential Patterns 4 DynamicSome OTF-Generate cont
  • Finding Sequential Patterns 4 DynamicSome intermediate Phase
  • Finding Sequential Patterns 4 DynamicSome Example
  • Finding Sequential Patterns 4 DynamicSome Example (2)
  • Finding Sequential Patterns 4 DynamicSome Example (3)
  • Finding Sequential Patterns 4 DynamicSome Example (4)
  • Finding Sequential Patterns 4 DynamicSome Example (5)
  • Finding Sequential Patterns 4 DynamicSome Example (6)
  • Finding Sequential Patterns 4 DynamicSome Example (7)
  • Outline (5)
  • Performance Synthetic Data
  • Performance Execution Times
  • Performance Scale-Up Customers
  • Performance Scale-Up
  • Outline (6)
  • Conclusion
  • Outline (7)
  • Final Exam Question 1
  • Final Exam Question 1 (2)
  • Final Exam Question 2
  • Final Exam Question 2 (2)
  • Final Exam Question 3
  • Final Exam Question 3 (2)

Recommended