+ All Categories
Home > Documents > Stochastic processes and Hidden Markov Models

Stochastic processes and Hidden Markov Models

Date post: 12-Sep-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
40
Stochastic processes and Hidden Markov Models Dr Mauro Delorenzi and Dr Frédéric Schütz Swiss Institute of Bioinformatics EMBnet course – Basel 23.3.2006 EMBNET course Basel 23.3.2006 Introduction A mainstream topic in bioinformatics is the problem of sequence annotation: given a sequence of DNA/RNA or protein, we want to identify “interesting” elements Examples: DNA/RNA: genes, promoters, splicing signals, segmentation of heterogeneous DNA, binding sites, etc Proteins: coiled-coil domains, transmembrane domains, signal peptides, phosphorylation sites, etc Generally: homologs, etc. The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster– http://www.fruitfly.org/GASP1/tutorial/presentation/
Transcript
Page 1: Stochastic processes and Hidden Markov Models

Stochastic processes andHidden Markov Models

Dr Mauro Delorenzi and

Dr Frédéric Schütz

Swiss Institute of Bioinformatics

EMBnet course – Basel 23.3.2006

EMBNET courseBasel 23.3.2006

Introduction

A mainstream topic in bioinformatics is the problem of sequence annotation: given a sequence of DNA/RNA or protein, we want to identify “interesting” elements Examples:– DNA/RNA: genes, promoters, splicing signals, segmentation of

heterogeneous DNA, binding sites, etc– Proteins: coiled-coil domains, transmembrane domains, signal

peptides, phosphorylation sites, etc– Generally: homologs, etc.

“The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster”– http://www.fruitfly.org/GASP1/tutorial/presentation/

Page 2: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Sequence annotation

The sequence of many of these interesting elements can be characterized statistically, so we are interested in modeling them.By modeling, we mean find statistical models than can:– Accurately describe the observed elements of provided

sequences;– Accurately predict the presence of particular elements in

new, unannotated, sequences;– If possible, be readily interpretable and provide some

insight into the actual biological process involved (i.e. not a black box).

EMBNET courseBasel 23.3.2006

Example: heterogeneity of DNA sequences

The nucleotide composition of segments of genomic DNA changes between different regions in a single organism– Example: coding regions in the human genome tend to be

GC-rich.

Modeling the differences between different homogeneous regions is interesting because– These differences often have a biological meaning

– Many bioinformatics tools depend on the “background distribution” of nucleotides, often assumed to be constant.

Page 3: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Modeling tools (quick review)

Among the different tools used for modeling sequences, we have (sorted by increasing complexity):– Consensus sequences– Regular expressions– Position Specific Scoring Matrices (PSSM), or Weight

Matrices– Markov Models, Hidden Markov Models and other

stochastic processesThese tools (in particular the stochastic processes) are also used for bioinformatics problems other than pure sequence analysis.

EMBNET courseBasel 23.3.2006

Consensus sequence

Exact sequence that correspond to a certain regionExample: Transcription initiation in E. coli– Transcription initiated at the promoter; the sequence of the promoter

is recognised by the sigma factor ot RNA polymerase– For the sigma factor σ70 , the consensus sequence of the promoter is

given by-35 -10

TTGACA … TATAATVery rigid, and do not allow for any variationThis works also well for enzyme restriction sites, or, in general, for sites for which strict conservation is important (in the case of restriction sites: cutting of the DNA at a certain site is a question of “life and death” for the DNA)

Page 4: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Example: binding site for TF p53

The Transcription Factor Binding Site (TFBS) for p53 as been described as having the consensus sequence

GGA CATG CCC * GGG CATG TCT

where * represents a spacer of various length.In this case, the sequence is not entirely conserved; this is believed to allow the cell some flexibility in the level of response for different signals (which was not possible or desirable for restriction sites).

EMBNET courseBasel 23.3.2006

Example: binding site for TF p53

This flexibility translates into the need for more complicated models to describe the site.

Since the binding site is not entirely conserved, the consensus sequence represents only the nucleotides most frequently observed.

The protein could potentially bind to many other similar, but different, sites along the genome.

In theory, if the sites are not independent, the protein may not even bind to the actual consensus sequence !

Page 5: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Patterns/Regular ExpressionPatterns attempts to explain observed motifs by trying to identify the most important combinations of positions and nucleotides/residues of a given site (to be compared with the consensus sequence, where the most important nucleotide/residue at each position was identified)

They are often described using the Regular Expression syntax.

Prosite database (developed at the SIB): http://www.expasy.org/prosite/

EMBNET courseBasel 23.3.2006

Example: Cys-Cys-His-His zinc finger DNA binding domain

Its characteristic motif has regular expression

C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

Where ‘x’ means any amino acid, ‘(2,4)’ means between 2 and 4 occurences, and ‘[…]’ indicate a list of possible amino acids.

Example: 1ZNF: XYKCGLCERSFVEKSALSRHQRVHKNX

Page 6: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Example: TFBS for p53

The TFBS has been described with the pattern→← … →←

where →← is the palindromic sequence

(5’) Pu-Pu-Pu-C-[AT][TA]-G-Py-Py-Py (3’)

and “…” is a spacer with 0 to 14 nucleotides

Note that this pattern (with the palindromiccondition) can not be expressed using a regular expression (at least not in a simple or general way).

J. Hoh et al. “The p53MH algorithm and its application in detecting p53-responsive genes”. PNAS, 99(13), June 2002, 8467-8472.

EMBNET courseBasel 23.3.2006

Example: TFBS for p53

The pattern approaches clearly allows more flexibility than the consensus sequence; however it is still too rigid, especially for sites that are not well conserved.

When applying the pattern, each possible amino/acid or nucleotide at a given position has the same weight, i.e. is it not possible to specifiy if one is more likely to appear than another.

Page 7: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Position-Specific Scoring Matrices

“Stochastic consensus sequence”

Indicates the relative importance of a given nucleotide or amino acid at a certain position.

Usually built from an alignment of sequences corresponding to the domain we are interested in, and either a collection of sequences known not to contain the domain, or (most often), background probabilities for the different nucleotides or amino acids.

EMBNET courseBasel 23.3.2006

0.04 0.88 0.26 0.59 0.49 0.03

0.09 0.03 0.11 0.13 0.22 0.05

0.07 0.01 0.12 0.16 0.12 0.02

0.80 0.08 0.51 0.13 0.18 0.89

9 214 63 142 118 8

22 7 26 31 52 13

18 2 29 38 29 5

193 19 124 31 43 216

-2.76 1.82 0.06 1.23 0.96 -2.92

-1.46 -3.11 -1.22 -1.00 -0.22 -2.21

-1.76 -5.00 -1.06 -0.67 -1.06 -3.58

1.67 -1.66 1.04 -1.00 -0.49 1.84

A

C

G

T

Counts from 242 known sites

PSSM:

log fbl/pb

(pb=background probabilities)

Relative frequencies: fbl

A

C

G

T

A

C

G

T

Building a PSSMPos. 1 2 3 4 5 6

Page 8: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Scoring a sequence using a PSSMC T A T A A T C

-38 19 1 12 10 -48

-15 -38 -8 -10 -3 -32

-13 -48 -6 -7 -10 -48

17 -32 8 -9 -6 19

-38 19 1 12 10 -48

-15 -38 -8 -10 -3 -32

-13 -48 -6 -7 -10 -48

17 -32 8 -9 -6 19

A

C

G

T

-38 19 1 12 10 -48

-15 -38 -8 -10 -3 -32

-13 -48 -6 -7 -10 -48

17 -32 8 -9 -6 19

sum

-93

+85

-95

Move the matrixalong the sequenceand score each “window”

Peaks should occur at the “true” sites

Of course in general any threshold will have some false positive and false negative rate

A

C

G

T

A

C

G

T

EMBNET courseBasel 23.3.2006

Sequence logo: graphical representationCys-Cys-His-His zinc finger DNA binding domain

The total height of each stack represent the degree of conservation of eachposition; the heights of the letters on a stack are proportional to their frequencies.

Page 9: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

14 11 26 0 28 2.5 0 0.5 0 3 6 2 11.5 0 27 4 0 0.5 1 2

3 1 1 36 1 0.5 0 24.5 33 23 2 0 0.5 36 2 0 0 9.5 24 15

16 24 10 0 0 0 37 0 0 0 23.5 34 25 0 2 1 37 0 0 3

4 1 0 1 7 34 0 12 4 10 5.5 1 0 1 5 32 0 27 12 16

Counts from 37 known sites

A

C

G

T

PSSM for p53 binding site

Pos. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

J. Hoh et al., “The p53HM algorithm and its application indetecting p53-responsive genes”, 2002.

EMBNET courseBasel 23.3.2006

What is missing ?

PSSM help to deal with the stochastic distributions of symbols at a given position.

However, they lack the ability to deal with the length distribution of the motif they describe.

Stochastic processes provide a more general framework to deal with these questions.

Page 10: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Stochastic Process

A stochastic process X is a collection of random variables {Xt, t ∈T}

The variable t is often interpreted as time, and X(t) is the state of the process at time t.

A realization of X is called a sample path.

While this definition is quite general, there are a number of special cases that are of high interest in bioinformatics, in particular Markov processes.

EMBNET courseBasel 23.3.2006

Markov ChainA Markov process is a stochastic process X with the following properties:– The random variables Xt can take values in a finite list of

states S={s1,s2,…,sn} (often taken as {1,2,…,n})– P(Xt+1=jt+1 | Xt=jt, Xt-1=jt-1, …) = P(Xt+1=jt+1 | Xt=jt)

• This Markov Property means that the probabilities of getting to the next state depend only on the current state, and not on the previous states; in other word, the Markov process is memoryless.

– We assume that the process is homogeneous, meaning that the probabilities are the sameat all points: P(Xt+1=jt+1 | Xt=jt) = pij

– The pij are usually represented asa matrix P and an initial probabilityvector π.

1 2 n

p11 p12 … p1n 1

p21 p22 … p2n 2

p31 p32 … p3n 3

... … .pn1 pn2 … pnn n

P =

Next state

Current state

Page 11: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Markov Chain: matrix form

Given a vector πt containing the probabilities of each state at time t, the probabilities of the states at time t+1 are given by πt+1 = πt P

Looking further into the future of the chain, the probabilities P(Xt+n=jt+n | Xt=jt) are given by

πt+n = πt Pn

(matrix multiplication calculatesthe long-rang probabilities) 1 2 n

p11 p12 … p1n 1

p21 p22 … p2n 2

p31 p32 … p3n 3

... … .pn1 pn2 … pnn n

P =

Next state

Current state

EMBNET courseBasel 23.3.2006

Stationary processes

Under certain conditions, there exists a stationary distribution for the states of the Markov model, that is, a distribution for which the probabilities of the different states do not change anymore.

A state π is stationary if π = π P

If it exists, the stationary state can be found by solving this equation (along with the constraints that the sum of elements in π must be 1)

An alternative way is to calculate lim n→∞Pn

Page 12: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

An example: mutations in DNA

A Markov chain can be used to model mutations in DNA, e.g. for use in evolutionary problems.

Different models are possible, depending on the constraints that we place on the possible mutations

See for example http://www.molecularevolution.org/resources/models/dnamodels.php

Example (fake):

S. Tavaré, “Some Probabilistic and Statistical Problems inthe Analysis of DNA Sequences”.

A C G T

A 0.80 0.05 0.05 0.10C 0.05 0.70 0.20 0.05G 0.05 0.20 0.70 0.05T 0.10 0.05 0.05 0.80

EMBNET courseBasel 23.3.2006

Models for mutations in DNA

The stationary state for this process is given by the vector (0.25, 0.25, 0.25, 0.25), meaning that after a certain time, the distribution of nucleotides will be uniform, regardless of the initial distribution. Of course, this does not take into account other factors such as natural selection, etc.These substitution models are especially important for database search of proteins, where they allow the algorithms to give more weight to sequences that are evolutionary close, even if they are different from the input (cf PAM matrices, etc).

Page 13: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Example: CpG islands

In the human genome, the frequency of the dinucleotide CG (written CpG to differentiate it from a C-G pair across the two strands of DNA) is lower than what would expected from the frequencies of the C and G nucleotides, because the Cytosine is usually methylated, and often mutated into a Thymine (T)

However, in some regions of the DNA (in general around the promoter region of a gene), called CpG Islands, the methylation process is suppressed and a much higher frequency of CpG dinucleotides is observed.

EMBNET courseBasel 23.3.2006

Difference between the 2 examples

In the example concerning mutations of DNA, we have a single nucleotide that changes over timeIf we look at several nucleotides, they can mutate independently and a different Markov chain is associated with each of them.

In the example concerning CpG islands, we are looking at consecutive nucleotides and how they are linked using a single Markov chain.

ACTAGCTAGCTTGATCTGATCGACTGTGG

ACAAGGTAGCTAGACCTGATCGACAGTGG

TCAAGCTAGATAGACCTGCTCGTCGGTGG

.............................

A→C→T→A→G→C→T→A→G→C→T→T→G→A→T→C→T ...

Page 14: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

A Markov model for CpG islands

Roughly speaking, being in a CpG island means that C is more frequent than in the rest of the genome, and is more often followed by G.

This can be modeled using a Markov model.

Given a list of sequences from putative CpGislands, a Markov model (“+”) can be derived

Given a list of sequences not part of a CpG island, a second Markov model (“-”) can be derived

EMBNET courseBasel 23.3.2006

Deriving the Markov models

A C G T

A 0.18 0.27 0.43 0.12C 0.17 0.37 0.27 0.19G 0.16 0.34 0.38 0.12T 0.08 0.36 0.38 0.18

Eddy et al., “Biological sequence analysis”

A C G T

A 0.30 0.21 0.28 0.21C 0.32 0.30 0.08 0.30G 0.25 0.24 0.30 0.21T 0.18 0.24 0.29 0.29

Model “+” Model “-”

Data from 48 putative CpG islands, total of about 60,000 nucleotides.To estimate the probabilities, calculate the observed frequencies of transition from each nucleotide to any other (maximum likelihood estimator).Example: in CpG islands, 1,000 “A”s observed, 270 of which are followed by a “C” gives an estimated probability P(C|A)=0.27.

Page 15: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

How to use the model ?

For a given sequence X0X1X2…Xn, we are interested in calculating the probability that one of our Markov models generated this particular sequence

This is easily done using the Markov property:– P(Xo, X1, X2, …, Xn) = P(X0) P(X1|X0) P(X2|X1)… P(Xn|Xn-1)

The different probabilities are provided either by the transition matrix or by an initial probability vector

The initial probability vector can correspond for example to the stationary distribution, here:

A C G T

0.16 0.34 0.35 0.15

A C G T

0.26 0.24 0.24 0.25

Model “+” Model “-”

EMBNET courseBasel 23.3.2006

How to use the model ?

Example: A → C → G → T → A → C → G → TP(CpG) = 0.16 x 0.27 x 0.27 x 0.12 x 0.08 x 0.27 x 0.27 x 0.12 = 3.6 x 10-6

Simply calculating a probability of going through a certain path in the Markov chain may not be the best thing to do, because the probability will keep becoming smaller and smaller (because the probability of seeing any particular sequence decreases quickly if the sequence length increases).More interesting is the likelihood ratio (LR), the comparison of the probability according to the “CpGisland”, compared to the probability according to a “background model”.

Page 16: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Likelihood Ratio

The background distribution can be given by the “non-CpG island” Markov chain, or another distribution of probabilities.

The (log) LR will indicate if we are more likely to be in a CpG island (LR>1) or outside of one (LR<1).

In our example:– P(+) = 0.16 x 0.27 x 0.27 x 0.12 x 0.08 x 0.27 x 0.27 x 0.12 = 3.6 x 10-6

– P(-) = 0.26 x 0.21 x 0.08 x 0.21 x 0.18 x 0.21 x 0.28 x 0.21 = 2.0 x 10-6

– LR = (3.6 x 10-6) / (2.0 x 10-6) = 1.8

This is an example of why we indicated that the problem of DNA segmentation was important.

EMBNET courseBasel 23.3.2006

LR in practiceProblems with the analyze of a long sequence of DNA with this method:– Need to specify a sliding window on which the test will be applied

– If the window is too small, the results may not be significant (few data points used for the LR)

– If the window is too large, some CpG islands may be missed.

Hidden Markov Models provide a framework where we can use a single model to specify both the observations (nucleotides) and a state (“are we in a CpG island or not ?”) which we call hidden because it is not observed, and does not appear directly when observing the sequence.

Page 17: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Hidden Markov Models

Described by Leonard E. Baum and colleagues in a series of 5 papers between 1966 and 1972

Originally used for Speech Recognition

Applied in genetics in the late 1980s

Now ubiquitous in bioinformatics:

Prediction of protein domains

Interpretation of peptide tandem MS data

Analysis of ChIP CHIPmicroarray data

Multiple sequence alignment

Functional classification of proteins

Prediction of protein folding

Gene finding

EMBNET courseBasel 23.3.2006

HMM: Formal definition

Two processes {(St, Ot), t=1…}, where– St is the hidden state at time t

– Ot is the observation at time t

– Pr(St | St-1 , Ot-1 , St-2 , Ot-2 , …) = Pr (St | St-1 )(Markov Chain)

– Pr(Ot | St, St-1, Ot-1, St-2, Ot-2, …) = Pr (Ot | St )

Many variants are used (e.g. the distribution of Ocan depend on previous S or previous O)

Page 18: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Example: heterogeneity of DNA

2 states {AT,CG}, representing AT-rich or CG-rich regions

What we see is a series of symbols A,C,G,T.

What is hidden is the state: are we in aAT-rich or CG-rich region ?

Example (with made-up numbers) ofa possible architecture

Models are usually specified with aBEGIN and an END state (not shownhere)

A 0.40C 0.10G 0.10T 0.40

A 0.10C 0.40G 0.40T 0.10

AT

CG

0.9

0.1

0.95

0.05

EMBNET courseBasel 23.3.2006

CpG islands, part 2

CpG islands can be modelled using a Hidden Markov Model

Parameters:– 2 sets of states A, C, G and T; one set for “CpG island”,

and another one for “non-CpG island”

– Transition probabilities between the differents states (inside the 2 sides, and between them)

– Observation probabilities

Page 19: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

CpG islands, part 2

A

A C G T

C G T

The green transitions are similar to the probabilities defined by our “+” and “-” Markov models for CpG islands.

The red transitions incorporate the probabilities of entering or leaving a CpG island (absent from the previous model).

EMBNET courseBasel 23.3.2006

Fitting of a HMM

If the sequence of states is known, the probabilities can be trained by maximum likelihood, exactly like our Markov model for heterogeneity CpG islands– However, we can not use 2 separate sets of data for CpG islands/non

CpG islands: we require a set of annotated sequences with transitions from one state to the other, in order to estimate these transition probabilities.

In most cases, however, we know only the sequence of symbols, and not the subjacent states

In this case, there is no closed-form equation to estimate the parameters, and an (usually iterative) optimisation procedure must be used, starting from a random estimation of the parameters (starting point).

Page 20: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Fitting of a HMM

Given the general architecture of an HMM (number of states, possible transitions), and a list of training sequences, the standard algorithm for estimating the transition and emission probabilities is the Baum-Welch training algorithm (based on the Expectation-Maximization (EM) algorithm)

An alternative training algorithm is the Viterbitraining algorithm, which calculates the most probable paths for the training sequences, and re-estimate the probabilities from these results.

EMBNET courseBasel 23.3.2006

Expectation-Maximization (EM) algorithm

The EM algorithm allows us to perform Maximum Likelihood estimation when some data is missing (in this case, the unknown sequence of states).

It requires a “starting point”, i.e. an initial guess of the parameters of the model

It consists in the iteration of 2 steps:– E-step: using the current model, estimate the missing observations

– M-step: use the observations (real+estimated) to find a new model by maximum likelihood

It can be shown that the likelihood increases with each iteration, so that the new model always get better, until we reach a maximum (hopefully global)

Used to solve many different problems in bioinformatics

Page 21: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

QuestionsGiven an HMM, and a (new) sequence, we may want to know:– What is the probability that the sequence was generated

by this model (scoring problem) ? Solved by the Forward algorithm

– What is the most probable path of states followed by the HMM when generating this sequence (“annotation”problem, or decoding) ? Solved by the Viterbi algorithm

– What is the most probable state for a given observation (symbol) (posterior probability) ? Solved by the backward algorithm.

These elegant algorithms are based on a similar concept, dynamic programming; we will detail only the forward algorithm.

EMBNET courseBasel 23.3.2006

Forward algorithm

If the number of states increases, the number of possible paths increases exponentiallyFor most models, it is not possible to enumerate all the pathsThe “trick” is to recognise the following fact (based on the Markov property):– The probability that the kth character of

a given sequence was generated bystate i depends only on which stategenerated the (k-1)th character, andon the transition probabilities betweenthe state

– The actual path that was followed beforeis not relevant.

This allows us to calculate the totalprobability recursively.

i

z

b

a

Page 22: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

α1(AT) = PAT(A) • 0.8 = 0.4 • 0.8 = 0.32

α1(CG) = PCG(A) • 0.2 = 0.1 • 0.2 = 0.02

S E

0 0 0

1 1 1

S

AT

CG

AT

CG

AT

CG

E

s1=A s2=T s3=G

0.8

0.2

0.95

0.05

0.9

0.1

0.95

0.05

0.9

0.1

1.0

1.0

Forward algorithm: exampleWhat is the probability that our model for DNA heterogeneity produced the sequence ATG ?Naïve approach: 3 symbols, 2 possible states for each symbols, 8 paths to consider

EMBNET courseBasel 23.3.2006

α1(AT) = PAT(A) • 0.8 = 0.4 • 0.8 = 0.32

α1(CG) = PCG(A) • 0.2 = 0.1 • 0.2 = 0.02

α2(AT) = PAT(T) • (0.95 • α1(AT) + 0.1 • α1(CG) ) = 0.1224

α2(CG) = PCG(T) • (0.05 • α1(CG) + 0.9 • α1(CG) ) = 0.0034

Forward algorithm

S E

0 0 0

1 1 1

S

AT

CG

AT

CG

AT

CG

E

s1=A s2=T s3=G

0.8

0.2

0.95

0.05

0.9

0.1

0.95

0.05

0.9

0.1

1.0

1.0

Page 23: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

1.0

1.0

Forward algorithm

s1=A s2=T s3=G

0.8

0.2

0.95

0.05

0.9

0.1

0.95

0.05

0.9

0.1S E

0 0 0

1 1 1

S

AT

CG

AT

CG

AT

CG

E

α1(AT) = PAT(A) • 0.8 = 0.4 • 0.8 = 0.32

α1(CG) = PCG(A) • 0.2 = 0.1 • 0.2 = 0.02

α2(AT) = PAT(T) • (0.95 • α1(AT) + 0.1 • α1(CG) ) = 0.1224

α2(CG) = PCG(T) • (0.05 • α1(CG) + 0.9 • α1(CG) ) = 0.0034

α3(AT) = PAT(G) • (0.95 • α2(AT) + 0.1 • α2(CG) ) = 0.0116

α3(CG) = PCG(G) • (0.05 • α2(CG) + 0.9 • α2(CG) )= 0.0037

α = α3(1) + α3(0) = 0.0153

EMBNET courseBasel 23.3.2006

For a given sequence, the Viterbi algorithm finds the most likely path across the model

In some cases, this may not be the most relevant information (e.g. if two paths have very close probabilities). Posterior decoding using the backward algorithm can find the most probable state for a given observation, summed over all the possible paths (the Viterbi algorithm considered the maximum of all the paths, instead of the sum).

1.0

1.0

Viterbi algorithm and posterior decoding

A T G

0.8

0.2

0.95

0.05

0.9

0.1

0.95

0.05

0.9

0.1S E

0 0 0

1 1 1

S

AT

CG

AT

CG

AT

CG

E

Page 24: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

A longer example

Several hundred bases were generated using the model for heterogeneity of DNA

The resulting sequence was “decoded” using the Viterbi algorithm to find the sequence of states that is most likely.

EMBNET courseBasel 23.3.2006

Obs. seq: TTATTTAACTTAATAAATATGTCAATCAATTTTCTGCTTCAGTTCAGTAGGGGAACATCATACTTGGAAAGGAAATATAAReal state: 11111111111111111111111111111111111111111111111112222111111111111111111111111111Pred.state: 11111111111111111111111111111111111111111111111112222111111111111111111111111111

CTTTATTGCATATTAAGGCGCCACGGGGCCGGCGCGCGGGCATAAAATAGCTATTTTTCTTCGTATGTGAATGAATCCAT1111111111111111122222222222222222222222221111111122111111111111111111111111111111111111111111112222222222222222222222222111111111111111111111111111111111111111

TATCCAAATCATTTGATCCATTTAAATTTTATTATGTTTTTCAGCTGTAACAGTAAAGTTTTTACACGCAATTGTGAAGA1111111111111112111111111111111111111111111111111111111111211112222222211111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

ATCATCTCAAAAGAAATAAAAAATTTAAGGTGACCCGGGACGGCGCCTGAATTTAGAATCGCCGGGACCCCAGCGACTTG1111111111111111111111111111122222222222222222221111111111122222222222222222221111111111111111111111111111112222222222222222222111111111111222222222222222222222

CCACTTTGGGTTCGATATGAATATTATCAACGTGGGGGCCGACCGCGCTTAATAAAATATTATAGTGCAAAAATATAAAG2222111112211111111111111111111222222222222221222221111111111111111111111111111122111111111111111111111111111122222222222222222211111111111111111111111111111111

ATCTAATATTGCAGTTCAAATTTGTAATATATATTTTGCCGAAATATGTGCGCGTGGCCCTTTTGACTTAACATTATTCG1111111111112111111111112111111111111122221111111222222222222111111111111111111111111111111111111111111111111111111112222111111112222222222211111111111111111122

GCACAGCCCCTGCGCCGAAGAATTAACAATAGATATATTTCATATTTAATACGTAAAACAAAGATTACATTTGCATAGAT2222222222222222222221111111111111111111111111111111111111111111111111111111111122222222222222222111111111111111111111111111111111111111111111111111111111111111

TAGGCCTAACCTCGATGATGGTATATAACTATTAATTTGATAATAATAATGGGTTCAATCTATTTTTACGCCCCCACATA1111111111221111111111111111111111111111111111111112222221111111111112222222221111222211111111111111111111111111111111111111111111111111111111111111222222211111

Page 25: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Some comments

The model is unable to find an “island” of one state “2” in a series of states “1” or the opposite, because two state transitions in a row are much less likely than any observationSince we work with random processes, this is expected because a misleading sequence of observations can always appear at randomLonger series in one state are more likely to be found, because the model has more evidence in hand to find them, and longer series are more likely to be real, and less likely tohappen just by chance.Boundaries of regions are often incorrect, because a nucleotide at the boundary will “attached” to the region it is most likely from.

EMBNET courseBasel 23.3.2006

Couplage of state and observation

One problem with HMMs is the fact that the states and observations are coupled.If a symbol is emitted for each stateand the probability of staying on agiven state is (1-p), the number of symbolsemitted will be distributed according to a geometric distribution (mean 1/p).Variants have been developed where state and observation are decoupled; they are called generalized HMMs(GHMMs), HMM with duration or hidden semi-Markov model.With a GHMM, states do not loop on themselves to repeat symbols; instead, each state can emit a number of symbols according to an arbitrary distribution.

p

1-p

Page 26: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Gene finding

Given all the genomes that have been sequenced in the past few years, gene finding is an important (and challenging) task.

By “Gene finding” we can mean– Find if a particular piece of DNA sequence is in a gene or

not (segmentation)

– Find and annotate all the different parts of a gene

– … and anything in between

EMBNET courseBasel 23.3.2006

Page 27: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

A very simple model for prokaryotes

Calculate the frequency of appearance of bases (or codons) in genes, and in intergenic regions

Similar to the models we used for the heterogeneity of DNA (segmentation)

This could be modelled using anHMM with this architecture andrelevant observation probabilities.

Non-codingregion

Codingregion

p q

1-p

1-q

EMBNET courseBasel 23.3.2006

More sophisticated model

A more sophisticated model uses frame-dependent composition

Non-codingregion

Position 1 Position 2 Position 31 1

1-p

1-qp

q

Page 28: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

More sophisticated model (2)

We can modify our model so that it requires an initiation of translation site (ATG), as well as a site for termination of translation (TAA, TAG or TGA).Note that “Coding region” and “Non-coding regions can represent HMMs as well (not detailed here)

Non-codingregion

A

T

G

Codingregion

T

A

A

T

A

G

T

G

A

EMBNET courseBasel 23.3.2006

More sophisticated models

We can build more complicated models on top of the simpler ones– Promoters, second strand, etc

For every model, we hope that it will be better at finding real genes than the previous ones

However, if the model gets more complicated, we have more parameters to estimate.

Page 29: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Human genes finding

Finding genes in eukaryotes genomes is complicated by the presence of exons and introns.

Many gene finders have been developed specifically for finding genes in the human genome

GENSCAN is currently one of the most popular and successful human gene finder

It is based on a GHMM.

Burge et al., J. Mol. Biol., 1997

EMBNET courseBasel 23.3.2006

E0 E1 E2

E

poly-A

3'UTR5'UTR

tEi

Es

I0 I 1 I 2

intergenicregion

Forward (+) strand

Reverse (-) strand

Forward (+) strand

Reverse (-) strand

promoter

62001 AGGACAGGTA CGGCTGTCAT CACTTAGACC TCACCCTGTG GAGCCACACC

62051 CTAGGGTTGG CCAATCTACT CCCAGGAGCA GGGAGGGCAG GAGCCAGGGC

62101 TGGGCATAAA AGTCAGGGCA GAGCCATCTA TTGCTTACAT TTGCTTCTGA

62151 CACAACTGTG TTCACTAGCA ACCTCAAACA GACACCATGG TGCACCTGAC

62201 TCCTGAGGAG AAGTCTGCCG TTACTGCCCT GTGGGGCAAG GTGAACGTGG

62251 ATGAAGTTGG TGGTGAGGCC CTGGGCAGGT TGGTATCAAG GTTACAAGAC

62301 AGGTTTAAGG AGACCAATAG AAACTGGGCA TGTGGAGACA GAGAAGACTC

62351 TTGGGTTTCT GATAGGCACT GACTCTCTCT GCCTATTGGT CTATTTTCCC

62401 ACCCTTAGGC TGCTGGTGGT CTACCCTTGG ACCCAGAGGT TCTTTGAGTC

62451 CTTTGGGGAT CTGTCCACTC CTGATGCTGT TATGGGCAAC CCTAAGGTGA

62501 AGGCTCATGG CAAGAAAGTG CTCGGTGCCT TTAGTGATGG CCTGGCTCAC

62551 CTGGACAACC TCAAGGGCAC CTTTGCCACA CTGAGTGAGC TGCACTGTGA

62601 CAAGCTGCAC GTGGATCCTG AGAACTTCAG GGTGAGTCTA TGGGACCCTT

62651 GATGTTTTCT TTCCCCTTCT TTTCTATGGT TAAGTTCATG TCATAGGAAG

62701 GGGAGAAGTA ACAGGGTACA GTTTAGAATG GGAAACAGAC GAATGATTGC

62751 ATCAGTGTGG AAGTCTCAGG ATCGTTTTAG TTTCTTTTAT TTGCTGTTCA

62801 TAACAATTGT TTTCTTTTGT TTAATTCTTG CTTTCTTTTT TTTTCTTCTC

62851 CGCAATTTTT ACTATTATAC TTAATGCCTT AACATTGTGT ATAACAAAAG

62901 GAAATATCTC TGAGATACAT TAAGTAACTT AAAAAAAAAC TTTACACAGT

62951 CTGCCTAGTA CATTACTATT TGGAATATAT GTGTGCTTAT TTGCATATTC

63001 ATAATCTCCC TACTTTATTT TCTTTTATTT TTAATTGATA CATAATCATT

63051 ATACATATTT ATGGGTTAAA GTGTAATGTT TTAATATGTG TACACATATT

63101 GACCAAATCA GGGTAATTTT GCATTTGTAA TTTTAAAAAA TGCTTTCTTC

63151 TTTTAATATA CTTTTTTGTT TATCTTATTT CTAATACTTT CCCTAATCTC

63201 TTTCTTTCAG GGCAATAATG ATACAATGTA TCATGCCTCT TTGCACCATT

63251 CTAAAGAATA ACAGTGATAA TTTCTGGGTT AAGGCAATAG CAATATTTCT

63301 GCATATAAAT ATTTCTGCAT ATAAATTGTA ACTGATGTAA GAGGTTTCAT

63351 ATTGCTAATA GCAGCTACAA TCCAGCTACC ATTCTGCTTT TATTTTATGG

63401 TTGGGATAAG GCTGGATTAT TCTGAGTCCA AGCTAGGCCC TTTTGCTAAT

63451 CATGTTCATA CCTCTTATCT TCCTCCCACA GCTCCTGGGC AACGTGCTGG

63501 TCTGTGTGCT GGCCCATCAC TTTGGCAAAG AATTCACCCC ACCAGTGCAG

63551 GCTGCCTATC AGAAAGTGGT GGCTGGTGTG GCTAATGCCC TGGCCCACAA

63601 GTATCACTAA GCTCGCTTTC TTGCTGTCCA ATTTCTATTA AAGGTTCCTT

63651 TGTTCCCTAA GTCCAACTAC TAAACTGGGG GATATTATGA AGGGCCTTGA

63701 GCATCTGGAT TCTGCCTAAT AAAAAACATT TATTTTCATT GCAATGATGT

GENSCAN

Page 30: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Assessing performance

The performances of a predictive model can be assessed using a second dataset (testing set) different from the one used to train the model (learning or training set). We assume that we know the “truth” in each dataset.

Important quantities:– TP (True Positives): number of positive predictions which are correct

– FP (False Positives): number of positive predictions which are incorrect (i.e. spurious prediction of a site)

– TN (True Negatives): number of negative predictions which are correct

– FN (False Negatives): number of negative predictions which are incorrect (i.e. true site, missed).

EMBNET courseBasel 23.3.2006

Assessing performance

Most of the time, the number of False Positives and False Negatives (i.e., the errors that we would like to minimises) is not fixed, but depends on a continuous score (e.g. probability returned by the forward algorithm) on which we set a threshold.

Changing this threshold will change the number of errors; a higher threshold (i.e. a more stringent criteria for classifying an observation as “positive”) will usually result in a lower number of False Positives, but a higher number of False Negatives

Page 31: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Specificity and sensitivity

Sensitivity is TP/(TP+FN), the proportion of true sites that are correctly found and annotatedSpecificity is generally defined as TN/(TN+FP), the proportion of false sites that are correctly annotatedIn some cases, specificity is defined as TP/(TP+FP), the proportion of all positive predictions that are correct. This is especially useful if the proportion of false sites correctly annotated is high, in which case TN/(TN+FP) will not be sensitive to change in FP.The false positive rate is 1-Specificity.

S.E. Cawley, A.I. Wirth and TP. Speed, Mol. And Biochem.Parasitology, 118(2), 167-174 (2001).

EMBNET courseBasel 23.3.2006

Graphical representation

Distribution of scores fornegative results Distribution of scores for

positive results

The two distribution overlap, so thatit is impossible to use this score toperfectly discriminate between positiveand negative results.

Score

TPTN

FN FPChosen

threshold

Observations classifiedas “positive”

Observations classifiedas “negative”

Page 32: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Specificity and sensitivity

Sensitivity is TP/(TP+FN), the proportion of true sites that are correctly found and annotatedSpecificity is generally defined as TN/(TN+FP), the proportion of false sites that are correctly annotatedIn some cases, specificity is defined as TP/(TP+FP), the proportion of all positive predictions that are correct. This is especially useful if the proportion of false sites correctly annotated is high, in which case TN/(TN+FP) will not be sensitive to change in FP.The false positive rate is 1-Specificity.

S.E. Cawley, A.I. Wirth and TP. Speed, Mol. And Biochem.Parasitology, 118(2), 167-174 (2001).

EMBNET courseBasel 23.3.2006

Comparison of different models

Different models can be applied to a given statistical problem, as we have seen in the case of gene finding.

ROC (Receiver Operating Characteristic) curves help summarising the FP/FN rates of one or several models.

Page 33: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

ROC curve

False positive rate(1-Specificity)

0% 100%0%

100%

Tru

e p

osi

tive

rat

e(S

ensi

tivity

)

EMBNET courseBasel 23.3.2006

ROC curve

False positive rate(1-Specificity)

0% 100%0%

100%

Tru

e p

osi

tive

rat

e(S

ensi

tivity

)

The Area Under the Curve (AUC) provides a way to compare the global performance of different models.

A perfect model would have an AUC of 1

A random model would have an AUC of 0.5 (worst)

Models in between can have AUCs around 0.9 (excellent), 0.8 (good), 0.7 (fair), 0.6 (poor).

Page 34: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

ROC curves and other methods

Problems with ROC curves:– When comparing two curves, the comparison is not straightforward if

the two curves cross once or several times.

– Most of the time, we are only interested in some regions of the graph, typically the low FP region (we usually do not care about an algorithm that produces 85% of false positives, even it produces no false negatives).

– The comparison between models does not take into account the complexity of the model; a model A may be slightly better than model B, but if model A is much more complicated (higher number of parameters), the gain may not be worth the work. Comparison methods that penalise more complicated models also exist (for example, AIC or BIC).

EMBNET courseBasel 23.3.2006

What is still missing with HMMs ?

By default, HMM use a Markov model of order 1, i.e. the transition and emission probabilities depend only on the current state

This means that they can not take into account dependencies between the sites

Models of higher orders, which solve this problem, can be transformed into models of order 1 by increasing the number of states

Doing so is straightforward, but the complexity of the resulting models increases exponentially and they are quickly computationally infeasible.

In general, this problem is unsolved.

Page 35: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

What is still missing with HMMs ?

Example: probabilities of observation of nucleotides in DNA in a Markov model

Increase the number of states by having a state for each digram instead of each symbol

A C G T

A 0.18 0.27 0.43 0.12C 0.17 0.37 0.27 0.19G 0.16 0.34 0.38 0.12T 0.08 0.36 0.38 0.18

A C G T

AA 0.18 0.27 0.43 0.12AC 0.17 0.37 0.27 0.19AG 0.16 0.34 0.38 0.12AT 0.17 0.37 0.27 0.19CA 0.16 0.34 0.38 0.12CC 0.08 0.36 0.38 0.18… ……………………..

EMBNET courseBasel 23.3.2006

Other stochastic models

Many other stochastic models are available,such as– Variable Length Markov Models: many variants; uses

models of order higher than 1, but only when needed; otherwise use simpler models; compromise to reduce the increase in complexity. Uses trees.

– Permuted Variable Length Markov Models: similar to the VLMM, but the sites may be reordered so that the “previous” site in the probabilities may not be the previous site in the sequence; this allows to model simple but long-range effects

Page 36: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Some caveats…

Some care has to be taken because they can break down at any time, without warning !

Ideally, any prediction made with a statistical model should be confirmed experimentally.

However, these models have been hugely successful, as shown by our examples and references

EMBNET courseBasel 23.3.2006

Introduction: Coiled-coil domains

Coiled-coil domains (CCD) are a protein motif foundin many types of proteins

It consists of two or more identical strands of aminoacid sequences forming alpha-helices that are wrapped around each other, forming one of thesimplest tertiary structures

Most coiled-coil domains contain a pattern calledheptad repeat: seven residues of the formabcdefg, where the a and d positions are generallyhydrophobic.

Page 37: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Coiled-coil domains

This apparent structure is a good candidate for statistical modeling.

Methods used for modeling:– PSSM

– HMM

For the practical work, we will use an HMM to modelCCDs.

EMBNET courseBasel 23.3.2006

References (web)

Wikipedia (http://en.wikipedia.org/) contains much information on HMMs and other topics

Introduction to Coiled Coils http://www.lifesci.sussex.ac.uk/research/woolfson/html/coils.html

Terry Speed’s courses at University of Berkeleyhttp://www.stat.berkeley.edu/users/terry/Classes/index.html

Page 38: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

References (General articles)

Sean R. Eddy, “What is a hidden Markov Model ?”, Nat. Biotech. 22, p. 1315-1316 (2004).

Karen Heyman, “Gene Finding with Hidden Markov Models”, The Scientist, 19(6), p. 26 (2005).

L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proc. IEEE. 77(2), p. 257-285 (1989).

EMBNET courseBasel 23.3.2006

References (Specific articles)

p53 transcription factor:– J. Hoh et al., “The p53HM algorithm and its application

in detecting p53-responsive genes”, PNAS 99 (13), 25 June 2002, 8647-8472.

– C.-L. Wei et al., “A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome”, Cell 124, 13 January 2006, 207-219.

Other stochastic models:– X. Zhao, H. Huang and T.P. Speed, “Finding short DNA

motifs using permuted Markov models”, Proceedings of the Eighth Annual International Conference on Computational Molecular Biology (RECOMB 2004)

Page 39: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

References (Specific articles)

S. Tavaré, “Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences”. Lectures on Mathematics and Life Sciences, American Mathematical Society, vol. 17, 1986.L. Peshkin and M.S. Gelfand, “Segmentation of yeast DNA using hidden Markov models”, Bioinformatics 15(12), p. 980-986 (1999).C. Burge and S. Karlin, “Prediction of complete gene structures in human genomic DNA”, J. Mol. Biol., 268, p. 78-94 (1997).M. Delorenzi and T. Speed, “An HMM model for coiled-coil domains and a comparison with PSSM-based predictions”, Bioinformatics 18(4), p. 617-625 (2002).

EMBNET courseBasel 23.3.2006

References (Books)R. Durbin, S. Eddy, A. Krogh, G. Mitchison, “Biological sequence analysis”, Cambridge University Press, 1998.

T. Koski, “Hidden Markov Models for Bioinformatics”, Kluwer Academic Publishers, 2001.

W.J. Ewens and G.R. Grant, “Statistical Methods in Bioinformatics” (2nd edition), Springer, 2005.

S.M. Ross, “Stochastic Processes” (2nd edition), John Wiley and Sons, 1996.

Page 40: Stochastic processes and Hidden Markov Models

EMBNET courseBasel 23.3.2006

Acknowledgements

Many slides and ideas were borrowed from talks and courses by Terry Speed (University of Berkeley, California, and WEHI, Melbourne, Australia)

EMBNET courseBasel 23.3.2006

More questions ?

[email protected]

[email protected]


Recommended