7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
1
7. Sequence Mining
Sequences and StringsRecognition with Strings
MM & HMMSequence Association Rules
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
2
Sequences and Strings
• A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence– Sequences and strings are often used as synonyms– String elements (characters, letters, or symbols) are nominal– A type of particularly long string text
• |x| denotes the length of sequence x– |AGCTTC| is 6
• Any contiguous string that is part of x is called a substring, segment, or factor of x– GCT is a factor of AGCTTC
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
3
Recognition with Strings
• String matching– Given x and text, determine whether x is a factor of text
• Edit distance (for inexact string matching)– Given two strings x and y, compute the minimum
number of basic operations (character insertions, deletions and exchanges) needed to transform x into y
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
4
String Matching
• Given |text| >> |x|, with characters taken from an alphabet A– A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…}
• A shift s is an offset needed to align the first character of x with character number s+1 in text
• Find if there exists a valid shift where there is a perfect match between characters in x and the corresponding ones in text
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
5
Naïve (Brute-Force) String Matching
• Given A, x, text, n = |text|, m = |x|s = 0while s ≤ n-m
if x[1 …m] = text [s+1 … s+m] then print “pattern occurs at shift” ss = s + 1
• Time complexity (worst case): O((n-m+1)m)• One character shift at a time is not necessary
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
6
Boyer-Moore and KMP• See StringMatching.ppt and do not use the following alg• Given Given AA, , xx, , texttext, , nn = | = |texttext|, |, mm = | = |xx||
F(x) = last-occurrence functionF(x) = last-occurrence functionG(x) = good-suffix function; G(x) = good-suffix function; ss = 0 = 0whilewhile s s ≤ n-m≤ n-m
j = mj = mwhile while j>0j>0 and and xx[j] = [j] = texttext [s+j] [s+j] j = j-1j = j-1ifif j = 0 j = 0 thenthen print “pattern occurs at shift” s print “pattern occurs at shift” s
s = s + G(0)s = s + G(0) else s = s + max[G(j), j-F(text[s+j0])]else s = s + max[G(j), j-F(text[s+j0])]
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
7
Edit Distance
• ED between x and y describes how many fundamental operations are required to transform x to y.
• Fundamental operations (x=‘excused’, y=‘exhausted’)
– Substitutions e.g. ‘c’ is replaced by ‘h’– Insertions e.g. ‘a’ is inserted into x after ‘h’– Deletions e.g. a character in x is deleted
• ED is one way of measuring similarity between two strings
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
8
Classification using ED
• Nearest-neighbor algorithm can be applied for pattern recognition.– Training: data of strings with their class labels stored– Classification (testing): a test string is compared to
each stored string and an ED is computed; the nearest stored string’s label is assigned to the test string.
• The key is how to calculate ED.• An example of calculating ED
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
9
Hidden Markov Model
• Markov Model: transitional states• Hidden Markov Model: additional visible states• Evaluation• Decoding• Learning
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
10
Markov Model• The Markov property:
– given the current state, the transition probability is independent of any previous states.
• A simple Markov Model – State ω(t) at time t– Sequence of length T:
• ωT = {ω(1), ω(2), …, ω(T)}– Transition probability
• P(ω j(t+1)| ω i(t)) = aij
– It’s not required that aij = aji
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
11
Hidden Markov Model• Visible states
– VT = {v(1), v(2), …, v(T)}• Emitting a visible state vk(t)
– P(v k(t)| ω j(t)) = bjk
• Only visible states vk (t) are accessible and states ωi (t) are unobservable.
• A Markov model is ergodic if every state has a nonzero prob of occuring give some starting state.
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
12
Three Key Issues with HMM• Evaluation
– Given an HMM, complete with transition probabilities aij and bjk. Determine the probability that a particular sequence of visible states VT was generated by that model
• Decoding– Given an HMM and a set of observations VT. Determine
the most likely sequence of hidden states ωT that led to VT.• Learning
– Given the number of states and visible states and a set of training observations of visible symbols, determine the probabilities aij and bjk.
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
13
Other Sequential Patterns Mining Problems
• Sequence alignment (homology) and sequence assembly (genome sequencing)
• Trend analysis– Trend movement vs. cyclic variations, seasonal variations
and random fluctuations• Sequential pattern mining
– Various kinds of sequences (weblogs)– Various methods: From GSP to PrefixSpan
• Periodicity analysis– Full periodicity, partial periodicity, cyclic association rules
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
14
Periodic Pattern• Full periodic pattern
– ABC ABC ABC• Partial periodic pattern
– ABC ADC ACC ABC• Pattern hierarchy
– ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE
Sequences of transactions
[ABC:3|DE:4]
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
15
Sequence Association Rule Mining
• SPADE (Sequential Pattern Discovery using Equivalence classes)
• Constrained sequence mining (SPIRIT)
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
16
Bibliography
• R.O. Duda, P.E. Hart, and D.G. Stork, 2001. Pattern Classification. 2nd Edition. Wiley Interscience.
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
17
a33
1
3
2
a31
a22a11
a12
a32 a23
a13
a21
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
18
a33
1
3
2
a31
a22a11
a12
a32
a23
a13
a21
b31
v1
v2
v3
v4
v4
v1
v2
v3
v2
v3
v4
v1
b32
b34
b33
b21b22
b23
b24
b11
b12
b13 b14
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
19
vk
1
3 3 3 3
2 2 2 2
1 1 1 1
c c c c
2
3
c
…………
…………
…………
…………
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a12
a22
a32
ac2
b2k
1(2)
2(2)
3(2)
c(2)
1 2 3 T-1 Tt =
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
20
0
0.01 0.0077
0.0002
0
0.09 0.0052
0.0024
0
0 0 0 0.0011
0.2 0.0057
0.0007
0
1
0
0
0 1 2 3 4t =
3
2
1
0
v3 v1 v3 v2 v0
0.2 x
2
0.3 x 0.3
0.1 x 0.1
0.4 x 0.5
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
21
1 2 3 4 5 6 7 0
/v/ /i/ /t/ /e/ /r/ /b/ /i/ /-/
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
22
0
3 3 3 3
2 2 2 2
0 0 0 0
c c c c
2
3
c
…………
…………
…………
…………
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 2 3 T-1 Tt =
1 1 1 1 1…………
0
2
3
c
.
.
.
1
4
max(1)
max(2)
max(3) max(T-1)
max(T)
7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)
23
0
0.01 0.0077
0.0002
0
0.09 0.0052
0.0024
0
0 0 0 0.0011
0.2 0.0057
0.0007
0
1
0
0
0 1 2 3 4t =
3
2
1
0
v3 v1 v3 v2 v0