of 85
8/7/2019 Sequence Data Mining
1/85
Sequence Data Mining:
Techniques and Applications
Sunita Sarawagi
IIT Bombayhttp://www.it.iitb.ac.in/~sunita
Mark Craven
University of Wisconsinhttp://www.biostat.wisc.edu/~craven
8/7/2019 Sequence Data Mining
2/85
What is a sequence?
Ordered set of elements:s = a1,a2,..an Each element a
icould be
Numerical
Categorical: domain a finite set of symbols , | |=m
Multiple attributes
The length n of a sequence is not fixed
Order determined by time or position and couldbe regular or irregular
8/7/2019 Sequence Data Mining
3/85
Real-life sequences
Classical applications Speech: sequence of phonemes
Language: sequence of words and delimiters
Handwriting: sequence of strokes
Newer applications
Bioinformatics: Genes: Sequence of 4 possible nucleotides, | |=4
Example: AACTGACCTGGGCCCAATCC
Proteins: Sequence of 20 possible amino-acids, | |=20 Example:
Telecommunications: alarms, data packets
Retail data mining: Customer behavior, sessions in a e-store(Example, Amazon)
Intrusion detection
8/7/2019 Sequence Data Mining
4/85
Intrusion detection
Intrusions could be detected at Network-level (denial-of-service attacks,
port-scans, etc) [KDD Cup 99]
Sequence of TCP-dumps
Host-level (attacks on privileged programslike lpr, sendmail) Sequence of system calls
| | = set of all possible system calls ~100
open
lseek
lstat
mmap
execve
ioctl
ioctl
close
execve
close
unlink
8/7/2019 Sequence Data Mining
5/85
Outline
Traditional mining operations on sequences Classification
Clustering
Finding repeated patterns
Primitives for handling sequence data Sequence-specific mining operations
Partial sequence classification (Tagging)
Segmenting a sequence Predicting next symbol of a sequence
Applications in biological sequence mining
8/7/2019 Sequence Data Mining
6/85
Classification of whole sequences
Given: a set of classes C and
a number of example sequences in each class,
train a model so that for an unseen sequence we
can say to which class it belongsExample: Given a set of protein families, find family of a new
protein
Given a sequence of packets, label session asintrusion or normal
Given several utterances of a set of words, classify anew utterance to the right word
8/7/2019 Sequence Data Mining
7/85
Conventional classification methods
Conventional classification methods assume record data: fixed number of attributes
single record per instance to be classified (no order)
Sequences:
variable length, order important.
Adapting conventional classifiers to sequences
Generative classifiers
Boundary-based classifiers Distance based classifiers
Kernel-based classifiers
8/7/2019 Sequence Data Mining
8/85
Generative methods
For each class i,
train a generative model Mi to
maximize likelihood over all
training sequences in the class i
Estimate Pr(ci) as fraction of training instances inclass i
For a new sequencex,
find Pr(x|ci)*Pr(ci) for each i using Mi
choose iwith the largest value ofPr(x|ci)*P(ci)
x
Pr(x|c1)*Pr(c1)
Pr(x|c2)*Pr(c2)
Pr(x|c3)*Pr(c3)
Need a generative model for sequence data
8/7/2019 Sequence Data Mining
9/85
Boundary-based methods
Data: points in a fixed multi-dimensional space
Output of training: Boundaries that define regions
within which same class predicted
Application: Tests on boundaries to find region
Need to embed sequence data in a fixed coordinate space
Decision trees Neural networksLinear discriminants
8/7/2019 Sequence Data Mining
10/85
Kernel-based classifiers
Define function K(xi, x) that intuitively defines similarity between two
sequences and satisfies two properties K is symmetric i.e., K(xi, x) = K(x, xi)
K is positive definite
Training: learn for each class c,
wicfor each train sequence xi
bc
Application: predict class of x For each class c, find f(x,c) = wicK(xi, x)+bc
Predicted class is c with highest value f(x,c)
Well-known kernel classifiers
Nearest neighbor classifier
Support Vector Machines
Radial Basis functions
Need kernel/similarity function
8/7/2019 Sequence Data Mining
11/85
Sequence clustering
Given a set of sequences, create groups such
that similar sequences in the same group
Three kinds of clustering algorithms
Distance-based:
K-means Various hierarchical algorithms
Model-based algorithms
Expectation Maximization algorithm
Density-based algorithms Clique
Need similarity function
Need generative models
Need dimensional embedding
8/7/2019 Sequence Data Mining
12/85
Outline
Traditional mining on sequences
Primitives for handling sequence data Embed sequence in a fixed dimensional space
All conventional record mining techniques will apply
Generative models for sequence
Sequence classification: generative methods Clustering sequences: model-based approach
Distance between two sequences Sequence classification: SVM and NN
Clustering sequences: distance-based approach
Sequence-specific mining operations
Applications
8/7/2019 Sequence Data Mining
13/85
Embedding sequences in fixed
dimensional space Ignore order, each symbol maps to a dimension
extensively used in text classification and clustering
Extract aggregate features
Real-valued elements: Fourier coefficients, Wavelet coefficients,
Auto-regressive coefficients
Categorical data: number of symbol changes Sliding window techniques (k: window size)
Define a coordinate for each possible k-gram (mk coordinates)
-th coordinate is number of times in sequence
(k,b) mismatch score: -th coordinate is number of k-grams in
sequence with b mismatches with
Define a coordinate for each of the k-positions
Multiple rows per sequence
8/7/2019 Sequence Data Mining
14/85
Sliding window examples
o c l i e m
1 2 1 1 3 2 1
2 .. .. .. .. .. ..
3 .. .. .. .. .. ..
open
lseek
ioctl
mmap
execve
ioctl
ioctl
open
execve
close
mmap
One symbol per column
Sliding window: window-size 3
ioe cli oli lie lim ...
1 1 0 1 0 1
2 .. .. .. .. .. ..
3 .. .. .. .. .. ..
One row per tracesid A1 A2 A3
1 o l i
1
l i m1 i m e
1 .. .. ..
1 e c m
Multiple rows per trace
ioe cli oli lie lim ...
1 2 1 1 0 1
2 .. .. .. .. .. ..
3 .. .. .. .. .. ..
mis-match scores: b=1
8/7/2019 Sequence Data Mining
15/85
Detecting attacks on privileged programs
Short sequences of system calls made during
normal execution of system calls are very
consistent, yet different from the sequences of
its abnormal executions
Two approaches STIDE (Warrender 1999)
Create dictionary of unique k-windows in normal traces,
count what fraction occur in new traces and threshold.
RIPPER based (Lee 1998) next...
8/7/2019 Sequence Data Mining
16/85
Classification models on k-grams trace
data When both normal and
abnormal data available
class label =
normal/abnormal:
When only normal trace,
class-label=k-th system
call
7-grams class
vtimes open seek read read read seek normal
lstat lstat lstat bind open close vtimes abnormal
Learn rules to predict class-label [RIPPER]
6-attributes class
vtimes open seek read read read seek
lstat lstat lstat bind open closevtimes
8/7/2019 Sequence Data Mining
17/85
Examples of output RIPPER rules
Both-traces:
if the 2nd system call is vtimes and the 7th is vtrace, then thesequence is normal
if the 6th system call is lseekand the 7th is sigvec, then thesequence is normal
if none of the above, then the sequence is abnormal Only-normal:
if the 3rd system call is lstatand the 4th is write, then the 7th isstat
if the 1st system call is sigblockand the 4th is bind, then the 7th
is setsockopt
if none of the above, then the 7th is open
8/7/2019 Sequence Data Mining
18/85
Experimental results on sendmailtraces Only -normal BOTH
sscp -1 13.5 32.2
sscp -2 13.6 30.4sscp -3 13.6 30.4
syslog -remote -1 11.5 21.2
syslog -remote -2 8.4 15.6
syslog -local -1 6.1 11.1
syslog -local -2 8.0 15.9decode -1 3.9 2.1
decode -2 4.2 2.0
sm565a 8.1 8.0
sm5x 8.2 6.5
sendmail 0.6 0.1
The output rule sets contain~250 rules, each with 2 or 3attribute tests
Score each trace by counting
fraction of mismatches and
thresholding
Summary: Only normal traces
sufficient to detect intrusionsPercent of mismatching traces
8/7/2019 Sequence Data Mining
19/85
More realistic experiments
Different programs need different thresholds
Simple methods (e.g. STIDE) work as well Results sensitive to window size
Is it possible to do better with sequence specific
methods?
STIDE RIPPER
threshold %false-pos threshold %false-posSite-1 lpr 12 0.0 3 0.0016
Site-2 lpr 12 0.0013 4 0.0265
named 20 0.0019 10 0.0
xlock 20 0.00008 10 0.0
[from Warrender 99]
8/7/2019 Sequence Data Mining
20/85
Outline
Traditional mining on sequences
Primitives for handling sequence data
Embed sequence in a fixed dimensional space
All conventional record mining techniques will apply
Distance between two sequences
Sequence classification: SVM and NN
Clustering sequences: distance-based approach
Generative models for sequences Sequence classification: whole and partial
Clustering sequences: model-based approach
Sequence-specific mining operations
Applications in biological sequence mining
8/7/2019 Sequence Data Mining
21/85
Probabilistic models for sequences
Independent model
One-level dependence (Markov chains)
Fixed memory (Order-lMarkov chains)
Variable memory models
More complicated models Hidden Markov Models
8/7/2019 Sequence Data Mining
22/85
8/7/2019 Sequence Data Mining
23/85
Model structure A state for each symbol in
Edges between states with probabilities
Probability of a sequence s being
generated from the model Example: Pr(AACA)
= Pr(A) Pr(A|A) Pr(C|A) Pr(A|C)
= 0.5 0.1 0.9 0.4
Training transition probability betweenstatesPr( | ) = Count( T) / Count( T)
First Order Markov Chains
CA0.9
0.4
0.1 0.6
start
0.5 0.5
8/7/2019 Sequence Data Mining
24/85
l = memory of sequence Model
A state for each possible suffix oflength l| |lstates
Edges between states withprobabilities
Pr(AACA)
= Pr(AA)Pr(C|AA) Pr(A|AC)
= 0.5 0.9 0.7 Training model
Pr( |s) = count(s T) / count(s T)
Higher order Markov Chains
ACAA
C 0.3
C 0.9
A 0.1
l = 2
CCCA0.8
A 0.7
C 0.2
A 0.4
C 0.6
start
0.5
8/7/2019 Sequence Data Mining
25/85
Variable Memory models
Probabilistic Suffix Automata (PSA)
Model
States: substrings of size no greater than l
where no string is suffix of another
Calculating Pr(AACA):= Pr(A)Pr(A|A)Pr(C|A)Pr(A|AC)
= 0.5 0.3 0.7 0.1
Training: not straight-forward
Eased by Prediction Suffix Trees PSTs can be converted to PSA after training
CCAC
C 0.7
C 0.9
A 0.1
A
C 0.6
A 0.3
A 0.6
start
0.2 0.5
8/7/2019 Sequence Data Mining
26/85
Prediction Suffix Trees (PSTs)
Suffix trees with emission probabilities of
observation attached with each tree node
Linear time algorithms exist for constructingsuch PSTs from training data [Apostolico 2000]
CCAC
C 0.7
C 0.9
A 0.1
A
A 0.3
C 0.6e
A C
AC CC
0.3, 0.7
0.28, 0.72
0.25, 0.75
0.1, 0.9 0.4, 0.6
Pr(AACA)=0.28 0.3 0.7 0.1
8/7/2019 Sequence Data Mining
27/85
Hidden Markov Models
Doubly stochastic models
Efficient dynamic programmingalgorithms exist for Finding Pr(S)
The highest probability path P that
maximizes Pr(S|P) (Viterbi)
Training the model (Baum-Welch algorithm)
S2
S4
S1
0.9
0.5
0.5
0.8
0.2
0.1
S3
A
C
0.6
0.4
A
C
0.3
0.7
A
C
0.5
0.5
A
C
0.9
0.1
8/7/2019 Sequence Data Mining
28/85
8/7/2019 Sequence Data Mining
29/85
HMMs for profiling system calls
Training:
Initial number of states = 40 (roughly equals number
of distinct system calls)
Train on normal traces
Testing: Need to handle variable length and online data
For each call, find the total probability of outputting
given all calls before it.
If probability below a threshold call it abnormal.
Trace is abnormal if fraction of abnormal calls are
high
8/7/2019 Sequence Data Mining
30/85
8/7/2019 Sequence Data Mining
31/85
ROC curves comparing different
methods
[from Warrender 99]
8/7/2019 Sequence Data Mining
32/85
Outline
Traditional mining on sequences
Primitives for handling sequence data
Sequence-specific mining operations
Partial sequence classification (Tagging)
Segmenting a sequence
Predicting next symbol of a sequence
Applications in biological sequence mining
8/7/2019 Sequence Data Mining
33/85
Partial sequence classification (Tagging)
The tagging problem: Given:
A set of tags L
Training examples of sequences showing the breakup ofthe sequence into the set of tags
Learn to breakup a sequence into tags (classification of parts of sequences)
Examples: Text segmentation
Break sequence of words forming an address string intosubparts like Road, City name etc
Continuous speech recognition Identify words in continuous speech
8/7/2019 Sequence Data Mining
34/85
Text sequence segmentation
Example: Addresses, bib records
House
number Building Road CityZip
4089 Whispering Pines Nobel Drive San Diego CA 92122
P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S.Dordick (1993) Protein and Solvent Engineering of Subtilising
BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc.115, 12231-12237.
Author Year Title Journal VolumePage
State
8/7/2019 Sequence Data Mining
35/85
Approaches used for tagging
Learn to identify start and end boundaries of each
label
Method: for each label, build two classifiers for accepting
its two boundaries.
Any conventional classifier would do:
Rule-based most common.
K-windows approach:
For each label, train a classifier on k-windows
During testing
classify each k-window
Adapt state-based generative models like HMM
St t b d d l f
8/7/2019 Sequence Data Mining
36/85
State-based models for sequence
tagging Two approaches:
Separate model per tag with special prefix and suffix states to capture the start and end of atag
S2
S4
S1
S3
Prefix Suffix
Road name
S2
S4
S1
S3
PrefixSuffix
Building name
8/7/2019 Sequence Data Mining
37/85
Combined model over all tags
Mahatma Gandhi Road Near Parkland ...
[Mahatma Gandhi Road Near: Landmark] Parkland ...
Example: IE
Nave Model: One state per element
Nested model
Each element
another HMM
8/7/2019 Sequence Data Mining
38/85
Other approaches
Disadvantages of generative models (HMMs)
Maximizing joint probability of sequence and labels
may not maximize accuracy
Conditional independence of features a restrictive
assumption
Alternative: Conditional Random Fields
Maximize conditional probability of labels given
sequence
Arbitrary overlapping features allowed
8/7/2019 Sequence Data Mining
39/85
Outline
Traditional mining on sequences
Primitives for handling sequence data
Sequence-specific mining operations
Partial sequence classification (Tagging)
Segmenting a sequence
Predicting next symbol of a sequence
Applications in biological sequence mining
8/7/2019 Sequence Data Mining
40/85
Si l bl S ti 0/1
8/7/2019 Sequence Data Mining
41/85
Simpler problem: Segmenting a 0/1
sequence Players A and B
A has a set of coins with
different biases
A repeatedly
Picks arbitrary coin
Tosses it arbitrary number
of times
B observes H/T
Guesses transition
points and biases
Pick
Toss
Return
A
B
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
8/7/2019 Sequence Data Mining
42/85
A MDL-based approach
Given n head/tail observations
Can assume n different coins with bias 0 or 1
Data fits perfectly (with probability one)
Many coins needed
Or assume one coin
May fit data poorly
Best segmentation is a compromise between
model length and data fit
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
1/4 5/7 1/3
8/7/2019 Sequence Data Mining
43/85
MDL
For each segment i:
L(Mi): cost of model parameters: log(Pr(head))
+ segment boundary: log (sequence length)
L(Di|Mi): cost of describing data in segment Sigiven model Mi: log(H
h T(1-h) ) H: #heads, T: #tails
Goal: find segmentation that leads to smallest total cost
segment i L(Mi) + L(Di|Mi)
8/7/2019 Sequence Data Mining
44/85
How to find optimal segments
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
Sequence of 17 tosses:
Derived graph with 18 nodes and all possible edges
Edge cost =
model cost
+ data cost
Shortest path
8/7/2019 Sequence Data Mining
45/85
Non-independent models
In previous method each segment is assumed to be
independent of each other, does not allow model reuseof the form:
The (k,h) segmentation problem:
Assume a fixed number h of models is to be used for
segmenting into k parts an n-element sequence (k > h)
(k,k) segmentation solvable by dynamic programming
General (k,h) problem NP hard
0 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0
8/7/2019 Sequence Data Mining
46/85
Approximations: for (k,h) segmentation
1. Solve (k,k) to get segments
2. Solve (n,h) to get H models
Example: 2-medians
1. Assign each segment to best of H
A second variant (Replace step 2 above with) Find summary statistics for each k segment, cluster
them into H clusters
A C T G G T T T A C C C C T G T GS1 S2 S3 S4
M1 M2
A C T G G T T T A C C C C T G T G
Results of segmenting genetic
8/7/2019 Sequence Data Mining
47/85
Results of segmenting genetic
sequencesFrom:
Gionis &
Mannila
2003
8/7/2019 Sequence Data Mining
48/85
8/7/2019 Sequence Data Mining
49/85
Two sequence mining applications
8/7/2019 Sequence Data Mining
50/85
8/7/2019 Sequence Data Mining
51/85
image from the DOE Human Genome Program
http://www.ornl.gov/hgmis
8/7/2019 Sequence Data Mining
52/85
The roles of proteins
A protein family is
Figure from the DOE Human Genome Program
http://www.ornl.gov/hgmis
8/7/2019 Sequence Data Mining
53/85
The roles of proteins
The amino-acid sequence of a protein determines its structure
The structure of a protein determines its function Proteins play many key roles in cells
structural support
storage of amino acids
transport of other substances
coordination of an organisms activities
response of cell to chemical stimuli
movement
protection against disease
selective acceleration of chemical reactions
8/7/2019 Sequence Data Mining
54/85
Protein taxonomies
The SCOP and CATH databasesprovide hierarchical taxonomies of
protein families
An alignment of globin family proteins
8/7/2019 Sequence Data Mining
55/85
An alignment of globin family proteins
Figure from www-cryst.bioc.cam.ac.uk/~max/res_globin.html
The sequences in
a family may vary
in length
Some positions are
more conserved
than others
f
8/7/2019 Sequence Data Mining
56/85
Profile HMMs
i 2 i 3i 1i 0
d 1 d 2 d 3
m 1 m 3m 2start endtch states represent
y conserved positions
ert states account
extra characters
some sequences
lete states are silent; they
count for characters missing
some sequences
Profile HMMs are commonly used to model families
of sequences
A 0.01
R 0.12
D 0.04
N 0.29C 0.01
E 0.03
Q 0.02
G 0.01Insert and match states have
emission distributions over
sequence characters
8/7/2019 Sequence Data Mining
57/85
Profile HMMs
To classify sequences according to family, we can
train a profile HMM to model the proteins of eachfamily of interest
Given a sequencex, use Bayes rule to make
classification
-galacto
sidase
-glucana
se
-amylas
e
-amyla
se
=j
jj
ii
i ccx
ccx
xc )Pr()|Pr(
)Pr()|Pr(
)|Pr(
8/7/2019 Sequence Data Mining
58/85
How Likely is a Given Sequence?
=
+=
L
i
iNL iiiaxeaxx
1
001 11)()...,...Pr(
Lxx ...1
N ...0
)|Pr( icx
transition
probabilities
emission
probabilities
8/7/2019 Sequence Data Mining
59/85
How Likely Is A Given Sequence?
A 0.1
C 0.4
G 0.4
T 0.1
A 0.4
C 0.1
G 0.1
T 0.4
begin end
0.5
0.5
0.2
0.8
0.4
0.6
0.1
0.90.2
0.8
0 5
4
3
2
1
6.03.08.04.02.04.05.0
)C()A()A(),AACPr( 35313111101
=
= aeaeaea
A 0.4
C 0.1
G 0.2
T 0.3
A 0.2
C 0.3
G 0.3
T 0.2
8/7/2019 Sequence Data Mining
60/85
How Likely is a Given Sequence?
the probability overallpaths is:
)...,...Pr()...Pr( 011 =
NLL xxxx
but the number of paths can be exponential in the length of the sequence...
the Forward algorithm enables us to compute this efficiently using dynamic programming
8/7/2019 Sequence Data Mining
61/85
How Likely is a Given Sequence:
The Forward Algorithm
define to be the probability of being in state k
having observed the first icharacters ofx
)(ifk
we want to compute , the probability of being
in the end state having observed all ofx
can define this recursively
)(LfN
8/7/2019 Sequence Data Mining
62/85
8/7/2019 Sequence Data Mining
63/85
Training a profile HMM
The parameters in profile HMMs are typically trainedusing the Baum-Welch method (an EM algorithm)
Heuristic methods are used to determine the length of
the model
Initialize using an alignment of the trainingsequences (as in globin example)
Iteratively make local adjustments to length if delete
or insert states are used too much for training
sequences
The Fisher kernel method for
8/7/2019 Sequence Data Mining
64/85
The Fisher kernel method for
protein classification
Standard HMMs are generative models Training involves estimating
Predictions based on are made using
Bayes rule
Sometimes we can get more accurate predictions
using discriminative methods which try to
optimize directly
One example: the Fisher kernel method
[Jaakola et al. 00]
)|Pr( icx
)|Pr( xci
)|Pr( xci
The Fisher kernel method for
8/7/2019 Sequence Data Mining
65/85
The Fisher kernel method for
protein classification
Consider learning to discriminate proteins in classfrom proteins other families
1. Train an HMM for
2. Use HMM to map each protein sequencex
into a fixed-length vector
3. Train an support vector machine (SVM)
whose kernel function is the Euclidean
distance between vectors
The resulting discriminative model is given by
1c
1c
=
11 ::
),(),()(cxi
i
i
cxi
i
iii
xxKxxKxD
8/7/2019 Sequence Data Mining
66/85
Profile HMM accuracy
8/7/2019 Sequence Data Mining
67/85
Profile HMM accuracyFigure from Jaakola et al., ISMB 1999
BLAST-based
methods
profile HMMs
classifying 2447proteins into 33 families
x-axis represents the median fraction of negative sequences that score as highas a positive sequence for a given familys model
profile HMMs w/
Fisher kernel SVM
Th fi di t k
8/7/2019 Sequence Data Mining
68/85
The gene finding task
Given: an uncharacterized DNA sequenceDo: locate the genes in the sequence, including the
coordinates of individual exons and introns
image from the UCSC Genome Browser
http://genome.ucsc.edu/
8/7/2019 Sequence Data Mining
69/85
image from the DOE Human Genome Program
http://www.ornl.gov/hgmis
Th t t f
8/7/2019 Sequence Data Mining
70/85
The structure of genes
Genes consist of alternating sequences ofexons and
introns Introns are spliced out before the gene is translated into
protein
ATG GAA ACC CGA TCG GGC AC
intergenic
region
intergenic
region
intron exonexon exon intron
G TAA AGT CTA
Exons consist of three-letter words, called codons
Each codon encodes a single amino acid (character in
a protein sequence)
The GENSCAN HMM for gene finding
8/7/2019 Sequence Data Mining
71/85
Each shape represents a functional unitf a gene or genomic region
Pairs of intron/exon units represent
he different ways an intron can interrupt
coding sequence (after 1st base in codon,
fter 2nd base or after 3rd base)
omplementary submodel
not shown) detects genes on
pposite DNA strand
The GENSCAN HMM for gene finding
[Burge & Karlin 97]
Th GENSCAN HMM
8/7/2019 Sequence Data Mining
72/85
The GENSCAN HMM For each sequence type, GENSCAN models
the length distribution
the sequence composition
Length distribution models vary depending on sequence type Nonparametric (using histograms) Parametric (using geometric distributions)
Fixed-length
Sequence composition models vary depending on type 5th-order, inhomogeneous 5th -order homogenous
Independent and 1st-order inhomogeneous Tree-structured variable memory
R ti i GENSCAN
8/7/2019 Sequence Data Mining
73/85
Representing exons in GENSCAN
For exons, GENSCAN uses Histograms to represent exon lengths
5th-order, inhomogeneous Markov models to
represent exon sequences
5th-order, inhomogeneous models can represent
statistics about pairs of neighboring codons
A 5th d M k d l f DNA
8/7/2019 Sequence Data Mining
74/85
A 5th-order Markov model for DNA
GCTAC
AAAAA
TTTTT
CTACG
CTACA
CTACC
CTACT
Pr(A | GCTAC)
start
Pr(GCTAC)
)|Pr()Pr)Pr( GCTACA(GCTACGCTACA=
Markov models for exons
8/7/2019 Sequence Data Mining
75/85
Markov models for exons
for each word we evaluate, well want to consider itsposition with respect to the assumed codon framing
thus well want to use an inhomogenous model to
represent genes
G C T A C G G A G C T T C G G A G C
G C T A C G Gis in 3rd
codon position
C T A C G G Gis in 1st position
T A C G G AAis in 2nd position
A 5th order inhomogeneous model
8/7/2019 Sequence Data Mining
76/85
A 5th-order inhomogeneous model
GCTAC
CTACG
CTACA
CTACC
CTACT
AAAAA
TTTTT
start
TACAG
TACAA
TACAC
TACAT
AAAAA
TTTTT
GCTAC
CTACG
CTACA
CTACC
CTACT
AAAAA
TTTTT
position 1 position 2 position 3
Transitions go to
states in position 1
Inference with the gene finding HMM
8/7/2019 Sequence Data Mining
77/85
Inference with the gene-finding HMM
given: an uncharacterized DNA sequence
do: find the most probable path through the model for the
sequence
This path will specify the coordinates of the predicted
genes (including intron and exon boundaries)
The Viterbi algorithm is used to compute this path
Finding the most probable path:
8/7/2019 Sequence Data Mining
78/85
Finding the most probable path:
the Viterbi algorithm
define to be the probability of the most probable
path accounting for the first icharacters ofxand ending
in state k
)(ivk
we want to compute , the probability of the most probable
path accounting for all of the sequence and ending in the end state
can define recursively
can use dynamic programming to find efficiently
)(LvN
)(LvN
Fi di th t b bl th
8/7/2019 Sequence Data Mining
79/85
Finding the most probable path:
the Viterbi algorithm
initialization:
1)0(0 =v
statessilentnotarethatfor,0)0( kvk =
Th Vit bi l ith
8/7/2019 Sequence Data Mining
80/85
The Viterbi algorithm
recursion for emitting states (i=1L):
[ ]klkk
ill aivxeiv )1(max)()( =
[ ]klkkl aiviv )(max)( =
recursion for silent states:
[ ]klkk
l aivi )(maxarg)(ptr =
[ ]klkkl aivi )1(maxarg)(ptr =keep track of most
probable path
8/7/2019 Sequence Data Mining
81/85
The Viterbi algorithm
to recover the most probable path, follow pointers
back starting at
termination:
L
( )kNkk
aLv )(maxargL =
( )kNk
kaLvx )(max),Pr( =
Parsing a DNA Sequence
8/7/2019 Sequence Data Mining
82/85
Parsing a DNA Sequence
CGTTACGTGTCATTCTACGTGATCATCGGATCCTAGAATCATCGATCCGTGCGATCGATCGGATTAGCTAGCTTAGCTAGGAGAGCATCGATCGGATCGAGGAGGAGCCTATATAAATCAA
e Viterbi path represents
parse of a given sequence,edicting exons, introns, etc
Some lessons from these
8/7/2019 Sequence Data Mining
83/85
biological applications HMMs provide state-of-the-art performance in protein classification
and gene finding
HMMs can naturally classify and parse sequences of variable length
Much domain knowledge can be incorporated into the structure of themodels
Types, lengths and ordering of sequence features
Appropriate amount of memory to represent various sequencefeatures
Models can vary representation across sequence features
Discriminative methods often provide superior predictive accuracy togenerative methods
References
8/7/2019 Sequence Data Mining
84/85
References S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped
BLAST and PSI-BLAST: A new generation of protein database search programs.Nucleic AcidsResearch, 25:33893402, 1997.
Apostolico, A., and Bejerano, G. 2000. Optimal amnesic probabilistic automata or how to learn andclassify proteins in linear time and space. In Proceedings of RECOMB2000.http://citeseer.nj.nec.com/apostolico00optimal.html
Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic text segmentation forextracting structured records. SIGMOD 2001.
C. Burge and S. Karlin, Prediction of Complete Gene Structures in Human Genomic DNA. Journal ofMolecular Biology, 268:78-94, 1997.
Mary Elaine Calif and R. J. Mooney. Relational learning of pattern-match rules for informationextraction. AAAI 1999.
S. Chakrabarti, S. Sarawagi and B.Dom, Mining surprising patterns using temporal descriptionlength,VLDB, 1998
M. Collins, Discriminitive training method for Hidden Markov Models: Theory and experiments withperceptron algorithms, EMNLP 2002
R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: probabilistic models ofproteins and nucleic acids, Cambridge University Press, 1998.
Eleazar Eskin, Wenke Lee and Salvatore J. Stolfo. ``Modeling System Calls for Intrusion Detection withDynamic Window Sizes.''Proceedings of DISCEX II. June 2001.
IDS http://www.cs.columbia.edu/ids/publications/
D Freitag and A McCallum, Information Extraction with HMM Structures Learned by StochasticOptimization, AAAI 2000
References
http://www.cs.columbia.edu/ids/publications/http://www.cs.columbia.edu/ids/publications/8/7/2019 Sequence Data Mining
85/85
References Gionis and H. Mannila: Finding recurrent sources in sequences.ACM ReCOMB 2003
Michael T. Goodrich, Efficient Piecewise-Approximation Using the Uniform Metric Symposium onComputational Geometry , (1994)
D. Haussler. Convolution kernels on discrete structure. Technical report, UC Santa Cruz, 1999. K. Karplus, C. Barrett, and R. Hughey, Hidden Markov models for detecting remote protein
homologies. Bioinformatics 14(10): 846-856, 1998.
L. Lo Conte, S. Brenner, T. Hubbard, C. Chothia, and A. Murzin. SCOP database in 2002:refinements accommodate structural genomics. Nucleic Acids Research, 30:264-267, 2002
Wenke Lee and Sal Stolfo. ``Data Mining Approaches for Intrusion Detection''In Proceedings of theSeventh USENIX Security Symposium (SECURITY '98), San Antonio, TX, January 1998
A. McCallum and D. Freitag and F. Pereira, Maximum entropy Markov models for informationextraction and segmentation, ICML-2000
Rabiner, Lawrence R., 1990.A tutorial on hidden Markov models and selected applications in speechrecognition. In Alex Weibel and Kay-Fu Lee (eds.), Readings in Speech Recognition. Los Altos, CA:Morgan Kaufmann, pages 267--296.
D. Ron, Y. Singer and N. Tishby. The power of amnesia: learning probabilistic automata with variablememory length. Machine Learning, 25:117-- 149, 1996
Warrender, Christina, Stephanie Forrest, and Barak Pearlmutter. Detecting Intrusions Using System
Calls: Alternative Data Models. To appear, 1999 IEEE Symposium on Security and Privacy. 1999
http://www.cs.helsinki.fi/u/mannila/postscripts/p115-gionis.pshttp://www.cs.helsinki.fi/u/mannila/postscripts/p115-gionis.ps