Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | anika-combs |
View: | 27 times |
Download: | 0 times |
Mining Sequential Patterns
Rakesh Agrawal ampRamakrishnan Srikant
Proc of the Intl Conference on Data Engineering (ICDE) Taipei Taiwan March 1995
Spring 2014 Presentation by Thomas Little
with slides adapted from Dan Brownrsquos 2011 presentation
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
1
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
2
Introduction Bar-code technology allows the collection of
massive amounts of sales data (basket data)
A typical data record consists of transaction date items bought customer-id
3
Introduction - Cont The problem of mining sequential patterns
over this data is introduced
So far we have seen frequent pattern mining
in the context of association rules where we were interested in what items were purchased in the same transaction These are intra-transactional patterns
4
Introduction - Cont
The problem of sequential pattern mining is concerned with inter-transactional patterns
A pattern in the first case consists of a set of unordered items
acdg A pattern in the second case is an ordered list of
sets of items
ltacdggt
5
Introduction - Cont
An example of a sequential pattern
Customers typically rent ldquoStar Warsrdquo then ldquoThe Empire Strikes Backrdquo followed by ldquoReturn of the Jedirdquo
Note that these rentals do not need to be consecutive Customers who rent other videos in between also support
this sequential pattern
6
Introduction - Cont
Elements of a sequential pattern can be sets of items as well For example
ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo
7
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
8
Problem Description
We are given a database D of customer transactions
Each transaction consists of the fields customer-id transaction-time items purchased in the transaction
9
Problem Description No customer has more than one transaction
with the same transaction-time
Quantities of items bought are not
considered each item is a binary variable representing whether an item was bought or not
10
Problem Description(Terminology and definitions)
Itemset non-empty set of items Each itemset is mapped to an integer
Sequence Ordered list of itemsets
Customer Sequence List of customer transactions ordered by increasing transaction time
A customer supports a sequence if the sequence is contained in the customer-sequence
Support for a Sequence Fraction of total customers that support a sequence
11
Problem Description(Terminology and definitions) - Cont
Maximal Sequence A sequence that is not contained in any other sequence
Large Sequence Sequence that meets minisup
Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence
The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction
an itemset with minimum support is called a large itemset or Litemset
12
Problem Description(Terminology and definitions) - Cont
Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets
13
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
1
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
2
Introduction Bar-code technology allows the collection of
massive amounts of sales data (basket data)
A typical data record consists of transaction date items bought customer-id
3
Introduction - Cont The problem of mining sequential patterns
over this data is introduced
So far we have seen frequent pattern mining
in the context of association rules where we were interested in what items were purchased in the same transaction These are intra-transactional patterns
4
Introduction - Cont
The problem of sequential pattern mining is concerned with inter-transactional patterns
A pattern in the first case consists of a set of unordered items
acdg A pattern in the second case is an ordered list of
sets of items
ltacdggt
5
Introduction - Cont
An example of a sequential pattern
Customers typically rent ldquoStar Warsrdquo then ldquoThe Empire Strikes Backrdquo followed by ldquoReturn of the Jedirdquo
Note that these rentals do not need to be consecutive Customers who rent other videos in between also support
this sequential pattern
6
Introduction - Cont
Elements of a sequential pattern can be sets of items as well For example
ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo
7
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
8
Problem Description
We are given a database D of customer transactions
Each transaction consists of the fields customer-id transaction-time items purchased in the transaction
9
Problem Description No customer has more than one transaction
with the same transaction-time
Quantities of items bought are not
considered each item is a binary variable representing whether an item was bought or not
10
Problem Description(Terminology and definitions)
Itemset non-empty set of items Each itemset is mapped to an integer
Sequence Ordered list of itemsets
Customer Sequence List of customer transactions ordered by increasing transaction time
A customer supports a sequence if the sequence is contained in the customer-sequence
Support for a Sequence Fraction of total customers that support a sequence
11
Problem Description(Terminology and definitions) - Cont
Maximal Sequence A sequence that is not contained in any other sequence
Large Sequence Sequence that meets minisup
Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence
The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction
an itemset with minimum support is called a large itemset or Litemset
12
Problem Description(Terminology and definitions) - Cont
Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets
13
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
2
Introduction Bar-code technology allows the collection of
massive amounts of sales data (basket data)
A typical data record consists of transaction date items bought customer-id
3
Introduction - Cont The problem of mining sequential patterns
over this data is introduced
So far we have seen frequent pattern mining
in the context of association rules where we were interested in what items were purchased in the same transaction These are intra-transactional patterns
4
Introduction - Cont
The problem of sequential pattern mining is concerned with inter-transactional patterns
A pattern in the first case consists of a set of unordered items
acdg A pattern in the second case is an ordered list of
sets of items
ltacdggt
5
Introduction - Cont
An example of a sequential pattern
Customers typically rent ldquoStar Warsrdquo then ldquoThe Empire Strikes Backrdquo followed by ldquoReturn of the Jedirdquo
Note that these rentals do not need to be consecutive Customers who rent other videos in between also support
this sequential pattern
6
Introduction - Cont
Elements of a sequential pattern can be sets of items as well For example
ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo
7
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
8
Problem Description
We are given a database D of customer transactions
Each transaction consists of the fields customer-id transaction-time items purchased in the transaction
9
Problem Description No customer has more than one transaction
with the same transaction-time
Quantities of items bought are not
considered each item is a binary variable representing whether an item was bought or not
10
Problem Description(Terminology and definitions)
Itemset non-empty set of items Each itemset is mapped to an integer
Sequence Ordered list of itemsets
Customer Sequence List of customer transactions ordered by increasing transaction time
A customer supports a sequence if the sequence is contained in the customer-sequence
Support for a Sequence Fraction of total customers that support a sequence
11
Problem Description(Terminology and definitions) - Cont
Maximal Sequence A sequence that is not contained in any other sequence
Large Sequence Sequence that meets minisup
Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence
The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction
an itemset with minimum support is called a large itemset or Litemset
12
Problem Description(Terminology and definitions) - Cont
Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets
13
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Introduction Bar-code technology allows the collection of
massive amounts of sales data (basket data)
A typical data record consists of transaction date items bought customer-id
3
Introduction - Cont The problem of mining sequential patterns
over this data is introduced
So far we have seen frequent pattern mining
in the context of association rules where we were interested in what items were purchased in the same transaction These are intra-transactional patterns
4
Introduction - Cont
The problem of sequential pattern mining is concerned with inter-transactional patterns
A pattern in the first case consists of a set of unordered items
acdg A pattern in the second case is an ordered list of
sets of items
ltacdggt
5
Introduction - Cont
An example of a sequential pattern
Customers typically rent ldquoStar Warsrdquo then ldquoThe Empire Strikes Backrdquo followed by ldquoReturn of the Jedirdquo
Note that these rentals do not need to be consecutive Customers who rent other videos in between also support
this sequential pattern
6
Introduction - Cont
Elements of a sequential pattern can be sets of items as well For example
ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo
7
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
8
Problem Description
We are given a database D of customer transactions
Each transaction consists of the fields customer-id transaction-time items purchased in the transaction
9
Problem Description No customer has more than one transaction
with the same transaction-time
Quantities of items bought are not
considered each item is a binary variable representing whether an item was bought or not
10
Problem Description(Terminology and definitions)
Itemset non-empty set of items Each itemset is mapped to an integer
Sequence Ordered list of itemsets
Customer Sequence List of customer transactions ordered by increasing transaction time
A customer supports a sequence if the sequence is contained in the customer-sequence
Support for a Sequence Fraction of total customers that support a sequence
11
Problem Description(Terminology and definitions) - Cont
Maximal Sequence A sequence that is not contained in any other sequence
Large Sequence Sequence that meets minisup
Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence
The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction
an itemset with minimum support is called a large itemset or Litemset
12
Problem Description(Terminology and definitions) - Cont
Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets
13
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Introduction - Cont The problem of mining sequential patterns
over this data is introduced
So far we have seen frequent pattern mining
in the context of association rules where we were interested in what items were purchased in the same transaction These are intra-transactional patterns
4
Introduction - Cont
The problem of sequential pattern mining is concerned with inter-transactional patterns
A pattern in the first case consists of a set of unordered items
acdg A pattern in the second case is an ordered list of
sets of items
ltacdggt
5
Introduction - Cont
An example of a sequential pattern
Customers typically rent ldquoStar Warsrdquo then ldquoThe Empire Strikes Backrdquo followed by ldquoReturn of the Jedirdquo
Note that these rentals do not need to be consecutive Customers who rent other videos in between also support
this sequential pattern
6
Introduction - Cont
Elements of a sequential pattern can be sets of items as well For example
ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo
7
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
8
Problem Description
We are given a database D of customer transactions
Each transaction consists of the fields customer-id transaction-time items purchased in the transaction
9
Problem Description No customer has more than one transaction
with the same transaction-time
Quantities of items bought are not
considered each item is a binary variable representing whether an item was bought or not
10
Problem Description(Terminology and definitions)
Itemset non-empty set of items Each itemset is mapped to an integer
Sequence Ordered list of itemsets
Customer Sequence List of customer transactions ordered by increasing transaction time
A customer supports a sequence if the sequence is contained in the customer-sequence
Support for a Sequence Fraction of total customers that support a sequence
11
Problem Description(Terminology and definitions) - Cont
Maximal Sequence A sequence that is not contained in any other sequence
Large Sequence Sequence that meets minisup
Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence
The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction
an itemset with minimum support is called a large itemset or Litemset
12
Problem Description(Terminology and definitions) - Cont
Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets
13
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Introduction - Cont
The problem of sequential pattern mining is concerned with inter-transactional patterns
A pattern in the first case consists of a set of unordered items
acdg A pattern in the second case is an ordered list of
sets of items
ltacdggt
5
Introduction - Cont
An example of a sequential pattern
Customers typically rent ldquoStar Warsrdquo then ldquoThe Empire Strikes Backrdquo followed by ldquoReturn of the Jedirdquo
Note that these rentals do not need to be consecutive Customers who rent other videos in between also support
this sequential pattern
6
Introduction - Cont
Elements of a sequential pattern can be sets of items as well For example
ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo
7
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
8
Problem Description
We are given a database D of customer transactions
Each transaction consists of the fields customer-id transaction-time items purchased in the transaction
9
Problem Description No customer has more than one transaction
with the same transaction-time
Quantities of items bought are not
considered each item is a binary variable representing whether an item was bought or not
10
Problem Description(Terminology and definitions)
Itemset non-empty set of items Each itemset is mapped to an integer
Sequence Ordered list of itemsets
Customer Sequence List of customer transactions ordered by increasing transaction time
A customer supports a sequence if the sequence is contained in the customer-sequence
Support for a Sequence Fraction of total customers that support a sequence
11
Problem Description(Terminology and definitions) - Cont
Maximal Sequence A sequence that is not contained in any other sequence
Large Sequence Sequence that meets minisup
Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence
The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction
an itemset with minimum support is called a large itemset or Litemset
12
Problem Description(Terminology and definitions) - Cont
Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets
13
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Introduction - Cont
An example of a sequential pattern
Customers typically rent ldquoStar Warsrdquo then ldquoThe Empire Strikes Backrdquo followed by ldquoReturn of the Jedirdquo
Note that these rentals do not need to be consecutive Customers who rent other videos in between also support
this sequential pattern
6
Introduction - Cont
Elements of a sequential pattern can be sets of items as well For example
ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo
7
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
8
Problem Description
We are given a database D of customer transactions
Each transaction consists of the fields customer-id transaction-time items purchased in the transaction
9
Problem Description No customer has more than one transaction
with the same transaction-time
Quantities of items bought are not
considered each item is a binary variable representing whether an item was bought or not
10
Problem Description(Terminology and definitions)
Itemset non-empty set of items Each itemset is mapped to an integer
Sequence Ordered list of itemsets
Customer Sequence List of customer transactions ordered by increasing transaction time
A customer supports a sequence if the sequence is contained in the customer-sequence
Support for a Sequence Fraction of total customers that support a sequence
11
Problem Description(Terminology and definitions) - Cont
Maximal Sequence A sequence that is not contained in any other sequence
Large Sequence Sequence that meets minisup
Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence
The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction
an itemset with minimum support is called a large itemset or Litemset
12
Problem Description(Terminology and definitions) - Cont
Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets
13
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Introduction - Cont
Elements of a sequential pattern can be sets of items as well For example
ldquoFitted sheet flat sheet and pillow casesrdquo followed by ldquocomforterrdquo followed by ldquodrapes and rufflesrdquo
7
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
8
Problem Description
We are given a database D of customer transactions
Each transaction consists of the fields customer-id transaction-time items purchased in the transaction
9
Problem Description No customer has more than one transaction
with the same transaction-time
Quantities of items bought are not
considered each item is a binary variable representing whether an item was bought or not
10
Problem Description(Terminology and definitions)
Itemset non-empty set of items Each itemset is mapped to an integer
Sequence Ordered list of itemsets
Customer Sequence List of customer transactions ordered by increasing transaction time
A customer supports a sequence if the sequence is contained in the customer-sequence
Support for a Sequence Fraction of total customers that support a sequence
11
Problem Description(Terminology and definitions) - Cont
Maximal Sequence A sequence that is not contained in any other sequence
Large Sequence Sequence that meets minisup
Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence
The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction
an itemset with minimum support is called a large itemset or Litemset
12
Problem Description(Terminology and definitions) - Cont
Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets
13
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
8
Problem Description
We are given a database D of customer transactions
Each transaction consists of the fields customer-id transaction-time items purchased in the transaction
9
Problem Description No customer has more than one transaction
with the same transaction-time
Quantities of items bought are not
considered each item is a binary variable representing whether an item was bought or not
10
Problem Description(Terminology and definitions)
Itemset non-empty set of items Each itemset is mapped to an integer
Sequence Ordered list of itemsets
Customer Sequence List of customer transactions ordered by increasing transaction time
A customer supports a sequence if the sequence is contained in the customer-sequence
Support for a Sequence Fraction of total customers that support a sequence
11
Problem Description(Terminology and definitions) - Cont
Maximal Sequence A sequence that is not contained in any other sequence
Large Sequence Sequence that meets minisup
Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence
The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction
an itemset with minimum support is called a large itemset or Litemset
12
Problem Description(Terminology and definitions) - Cont
Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets
13
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Problem Description
We are given a database D of customer transactions
Each transaction consists of the fields customer-id transaction-time items purchased in the transaction
9
Problem Description No customer has more than one transaction
with the same transaction-time
Quantities of items bought are not
considered each item is a binary variable representing whether an item was bought or not
10
Problem Description(Terminology and definitions)
Itemset non-empty set of items Each itemset is mapped to an integer
Sequence Ordered list of itemsets
Customer Sequence List of customer transactions ordered by increasing transaction time
A customer supports a sequence if the sequence is contained in the customer-sequence
Support for a Sequence Fraction of total customers that support a sequence
11
Problem Description(Terminology and definitions) - Cont
Maximal Sequence A sequence that is not contained in any other sequence
Large Sequence Sequence that meets minisup
Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence
The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction
an itemset with minimum support is called a large itemset or Litemset
12
Problem Description(Terminology and definitions) - Cont
Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets
13
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Problem Description No customer has more than one transaction
with the same transaction-time
Quantities of items bought are not
considered each item is a binary variable representing whether an item was bought or not
10
Problem Description(Terminology and definitions)
Itemset non-empty set of items Each itemset is mapped to an integer
Sequence Ordered list of itemsets
Customer Sequence List of customer transactions ordered by increasing transaction time
A customer supports a sequence if the sequence is contained in the customer-sequence
Support for a Sequence Fraction of total customers that support a sequence
11
Problem Description(Terminology and definitions) - Cont
Maximal Sequence A sequence that is not contained in any other sequence
Large Sequence Sequence that meets minisup
Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence
The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction
an itemset with minimum support is called a large itemset or Litemset
12
Problem Description(Terminology and definitions) - Cont
Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets
13
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Problem Description(Terminology and definitions)
Itemset non-empty set of items Each itemset is mapped to an integer
Sequence Ordered list of itemsets
Customer Sequence List of customer transactions ordered by increasing transaction time
A customer supports a sequence if the sequence is contained in the customer-sequence
Support for a Sequence Fraction of total customers that support a sequence
11
Problem Description(Terminology and definitions) - Cont
Maximal Sequence A sequence that is not contained in any other sequence
Large Sequence Sequence that meets minisup
Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence
The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction
an itemset with minimum support is called a large itemset or Litemset
12
Problem Description(Terminology and definitions) - Cont
Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets
13
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Problem Description(Terminology and definitions) - Cont
Maximal Sequence A sequence that is not contained in any other sequence
Large Sequence Sequence that meets minisup
Length of a sequence The of itemsets in the sequence A sequence of length k is called a k-sequence
The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction
an itemset with minimum support is called a large itemset or Litemset
12
Problem Description(Terminology and definitions) - Cont
Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets
13
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Problem Description(Terminology and definitions) - Cont
Note that each itemset in a large sequence must have minimum support Therefore any large sequence must be a list of Litemsets
13
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Problem Description - Cont
Given a database D of customer transactions the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support
Each such sequence represents a sequential pattern
14
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Problem DescriptionExample
Note Use Minisup of 25 no less than two customers must support the sequencelt (10 20) (30) gt Does not have enough support (Only by Customer 2)lt (30) gt lt (70) gt lt (30) (40) gt hellip are not maximal
Seq with minimum support
15
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Outline
Introduction
Problem Description
Finding Sequential Patterns
Performance
Conclusion
Final Exam Questions
16
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns
The problem of finding sequential patterns is split into five phases
1 Sort Phase
2 Large itemset (Litemset) Phase
3 Transformation Phase
4 Sequence Phase
5 Maximal Phase
17
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns1 Sort Phase
The DB is sorted with customer-id as the major key and transaction-time as the minor-key
This step implicitly converts the original transaction DB into a DB of customer sequences
Recall a Customer Sequence is a list of customer transactions ordered by increasing transaction time
18
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns2 Litemset Phase
In this phase we find the set of all Large itemsets (Litemsets) L
We are also simultaneously finding the set of large 1-sequences since this set is just
lt l gt | l isin L
19
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns2 Litemset Phase - Cont
In Apriori the support for an itemset was defined as the fraction of transactions in which an itemset is present
In the sequential pattern finding problem the support is the fraction of customers who bought the itemset in any one of their possibly many transactions
20
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns2 Litemset Phase - Cont
The set of Litemsets is mapped to a set of contiguous integers
By treating Litemsets as single entities two Litemsets can be compared for equality in constant time reducing the time required to check if a sequence is contained in a customer sequence
21
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns2 Litemset Phase - Cont
bull Example with the minimum support 40
22
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns3 Transformation Phase
bull As we shall see later we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence In order to make this test fast the customer sequences are transformed into an alternative representation
23
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns3 Transformation Phase - Cont
bull Each transaction is replaced by the set of all Litemsets contained in the transaction
bull Transactions with no Litemsets are dropped (But empty customer sequences still contribute to the total customer count)
bull A customer sequence is now represented by a list of sets of Litemsets
24
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns3 Transformation Phase - Cont
Note (10 20) dropped because of lack of support
(40 60 70) replaced with set of litemsets (40)(70)(40 70) (60 does not have minisup)
25
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 Sequence Phase Overview
Seed set of large sequences
Create candidate sequences
Scan data to find support of candidate sequences
Determine large sequences 26
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 Sequence Phase
bull Use the set of Litemsets to find the desired
sequences
bull Two families of algorithms are presented
Count-all
Count-some
27
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 Sequence Phase
bull Count-all algorithms count all the large
sequences including non-maximal
sequences which are pruned out in the
maximal phase
28
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 Sequence Phase
bull Count-some algorithms try to avoid
counting non-maximal sequences by first
counting longer sequences in a forward
phase then counting the sequences skipped
in a backward phase
29
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 Sequence Phase AprioriAll
L1 = large 1-sequences result of Litemset phase
for (k = 2 Lk-1 ne k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
Answer = Maximal Sequences in cupk Lk
Notation Lk Set of all large k-sequences Ck Set of candidate k-sequences
30
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 AprioriAll Candidate Generation
bull The apriori-generate function takes as argument Lk-1 the set of all large (k-1)-sequences The function works as follows First join Lk-1 with Lk-1
insert into Ck
select plitemset1 plitemsetk-1 qlitemsetk-1
from Lk-1 p Lk-1 q
where plitemset1 = qlitemset1
plitemsetk-2 = qlitemsetk-2
bull Next delete all sequences c isin Ck such that some
(k-1)-subsequence of c is not in Lk-131
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 AprioriAll Candidate Generation
Example
lt1 2 4 3gt is pruned out because the subsequence lt2 4 3gt is not in L3 The authors cite a previous paper for the proof of correctness of the candidate generation procedure
32
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 AprioriAll Maximal Phase
bull Having found the set S of all large sequences in the sequence phase the following algorithm can be used to find the maximal sequences Let n = length of the longest sequence
for ( k = n k gt 1 k --)
foreach k-sequence sk do
Delete from S all subsequences of sk
bull Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees) citing two of their earlier papers
33
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 AprioriAll Example
34
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 AprioriSome
bull AprioriSome uses the function ldquonextrdquo to determine which sequences to skip
Let hitk = |Lk| |Ck|
(ie ratio of large k-sequences to candidate k-sequences)
function next(k integer) k is the length of seq counted last pass
beginif (hitk lt 0666) return k + 1elseif (hitk lt 075) return k + 2
elseif (hitk lt 080) return k + 3elseif (hitk lt 085) return k + 4else return k + 5
end
bull next returns the length of sequences to count in the next pass 35
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 AprioriSome Forward Phase
L1 = large 1-sequences Result of Litemset phase
C1 = L1
last = 1 We last counted Clast
for (k = 2 Ck-1 ne and Llast ne k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k== next(last) ) then begin (next k to count)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
last = k
end
end36
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 AprioriSome Backward Phase
for (k-- kgt=1 k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in some L i igtk
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else Lk already known
Delete all sequences in Lk contained in some Li igtk
Answer = Uk Lk (Maximal Phase not Needed)
Notation DT Transformed database 37
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 AprioriSome Example
38
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 AprioriSome Example - Cont
Minimum Support = 40 (2 customer sequences)
39
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 AprioriSome Example
Answer lt1 2 3 4gt lt1 3 5gt lt4 5gt40
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome
bull Like AprioriSome it skips counting candidate sequences of certain lengths in the forward phase
bull AprioriSome generates Ck from Lk-1 or Ck-1
bull DynamicSome generates Ck ldquoon the flyrdquo based on large sequences found from the previous passes and the customer sequences read from the database
41
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome
bull In the initialization phase count only sequences up to and including step variable length If step is 3 count sequences of length 1 2 and 3
bull In the forward phase we generate sequences of length 2 times step 3 times step 4 times step etc on-the-fly based on previous passes and customer sequences in the database
o While generating sequences of length 9 with a step
size 3 While passing the data if sequences s6 isin L6
and s3 isin L3 are both contained in the customer sequence c in hand and they do not overlap in c then lt s6s3 gt is a candidate (6+3)-sequence
42
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome
bull In the intermediate phase generate the candidate sequences for the skipped lengths
o If we have counted L6 and L3 and L9 turns out to be empty we generate C7 and C8 count C8 followed by C7 after deleting non-maximal sequences and repeat the process for C4 and C5
bull The backward phase is identical to AprioriSome 43
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome Initialization Phase
step is an integer ge 1
L1 = large 1-sequences Result of litemset phase
for ( k = 2 k lt= step and Lk1048576-1 ne k++ ) do
begin
Ck = New candidates generated from Lk1048576-1
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
44
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome Forward Phase
for ( k = step Lk1048576 ne k+= step ) do
begin
find Lk+step from Lk and Lstep
Ck+step =
foreach customer-sequence c in DT do
begin
X = otf-generate(Lk Lstep c)
foreach sequence x isin Xrsquo increment its count in
Ck+step (adding it to Ck+step if necessary)
end
Lk+step = Candidates in Ck+step with minimum support
end45
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome OTF-Generate
c is the customer sequence lt c1c2cn gt
Xk = subseq(Lk c)
forall sequences x isin Xk do
xend = min j | x sube lt c1c2cj gt
Xj = subseq(Lj c)
forall sequences x isin Xj do
xstart = max j | x sube lt cjcj+1cn gt
Answer = join of Xkwith Xj with the join condition
Xkend lt Xjstart46
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome OTF-Generate cont
Let otf-generate be called with
L2 and let c = lt1 2 3 7
4gt Thus c1 = 1 c2= 2
etc
Thus the result of the join with the join condition
X2end lt X2start
(where X2 denotes the seq of
length 2) is the single sequence lt1 2 3 4gt
Seq
lt1 2gt
End
2
Start
1
lt1 3gt 3 1
lt1 4gt 4 1
lt2 3gt 3 2
lt2 4gt 4 2
lt3 4gt 4 3
47
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome intermediate Phase
for ( k-- k gt 1 k-- ) do
if (Lk not yet determined) then
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
48
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome Example
bull Let step = 2
use L2 and L2 as argument in otf-generate to get C4
49
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome Example
bull Get 2 candidate sequences
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
50
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome Example
bull Only lt1 2 3 4gt is large
C4 Minisup
lt1 2 3 4gt 2
lt1 3 4 5gt 1
L4 Minisup
lt1 2 3 4gt 2
51
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome Example
bull pass as arg to otf-gen L2 and L4 to get C6
L4 sup
lt1 2 3 4gt 2
52
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome Example
bull C6 is found to be empty
L4 sup
lt1 2 3 4gt 2C6 =
53
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome Example
bull In the intermediate phase C3 is generated
C3
from L2 and C5 from L4
using apriori-generate
L4 sup
lt1 2 3 4gt 2C5
54
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Finding Sequential Patterns4 DynamicSome Example
bull C5 is found to be empty so only C3 is counted during the backward phase to get L3
L3 C3
C5 =
55
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions56
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Performance Synthetic Data
57
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Performance Execution Times
58
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Performance Scale-Up Customers
59
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Performance Scale-Up
60
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions61
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Conclusion
bull The problem of mining sequential patterns from a customer DB was introduced
bull Two types of algorithms were introduced to find sequential patternso CountAll -AprioriAllo CountSome -AprioriSome DynamicSome
bull AprioriAll and AprioriSome have comparable performance with AprioriSome slightly better for lower minisup
bull AprioriAll and AprioriSome have excellent scale-up properties 62
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Outline
Introduction
Problem Description
Finding Sequential Patterns
Sequence Phase
Performance
Conclusion
Final Exam Questions63
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
64
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Final Exam Question 1 Compare and contrast association rules and sequential
patterns How do they relate to each other in the context of the Apriori algorithms
Association rules refer to intra-transaction patterns while sequential patterns refer to inter-transaction patterns Both of these are used in the Apriori algorithms studied here because the algorithms are looking for different sequential patterns made up of association rules
65
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
66
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Final Exam Question 2 What is the major difference between the two algorithms
CountSome and CountAll
CountAll (AprioriAll) is careful with respect to minimum support and careless with respect to maximality (The minimum support is checked for each sequence on each run but maximal sequences must be checked for later)
CountSome (AprioriSome) is careful with respect to maximality but careless with respect to minimum support (Non-maximal sequences are pruned out during runtime but the minimum support is not tested at all values of k) 67
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
68
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69
Final Exam Question 3 Why is the Transformation stage of these
pattern mining algorithms so important to their speed
The transformation allows each record to be looked up in constant time reducing the run time
69