Stochastic processes and Hidden Markov Models

Stochastic processes andHidden Markov Models

Dr Mauro Delorenzi and

Dr Frédéric Schütz

Swiss Institute of Bioinformatics

EMBnet course – Basel 23.3.2006

EMBNET courseBasel 23.3.2006

Introduction

A mainstream topic in bioinformatics is the problem of sequence annotation: given a sequence of DNA/RNA or protein, we want to identify “interesting” elements Examples:– DNA/RNA: genes, promoters, splicing signals, segmentation of

heterogeneous DNA, binding sites, etc– Proteins: coiled-coil domains, transmembrane domains, signal

peptides, phosphorylation sites, etc– Generally: homologs, etc.

“The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster”– http://www.fruitfly.org/GASP1/tutorial/presentation/


Sequence annotation

The sequence of many of these interesting elements can be characterized statistically, so we are interested in modeling them.By modeling, we mean find statistical models than can:– Accurately describe the observed elements of provided

sequences;– Accurately predict the presence of particular elements in

new, unannotated, sequences;– If possible, be readily interpretable and provide some

insight into the actual biological process involved (i.e. not a black box).


Example: heterogeneity of DNA sequences

The nucleotide composition of segments of genomic DNA changes between different regions in a single organism– Example: coding regions in the human genome tend to be

GC-rich.

Modeling the differences between different homogeneous regions is interesting because– These differences often have a biological meaning

– Many bioinformatics tools depend on the “background distribution” of nucleotides, often assumed to be constant.


Modeling tools (quick review)

Among the different tools used for modeling sequences, we have (sorted by increasing complexity):– Consensus sequences– Regular expressions– Position Specific Scoring Matrices (PSSM), or Weight

Matrices– Markov Models, Hidden Markov Models and other

stochastic processesThese tools (in particular the stochastic processes) are also used for bioinformatics problems other than pure sequence analysis.


Consensus sequence

Exact sequence that correspond to a certain regionExample: Transcription initiation in E. coli– Transcription initiated at the promoter; the sequence of the promoter

is recognised by the sigma factor ot RNA polymerase– For the sigma factor σ70 , the consensus sequence of the promoter is

given by-35 -10

TTGACA … TATAATVery rigid, and do not allow for any variationThis works also well for enzyme restriction sites, or, in general, for sites for which strict conservation is important (in the case of restriction sites: cutting of the DNA at a certain site is a question of “life and death” for the DNA)


Example: binding site for TF p53

The Transcription Factor Binding Site (TFBS) for p53 as been described as having the consensus sequence

GGA CATG CCC * GGG CATG TCT

where * represents a spacer of various length.In this case, the sequence is not entirely conserved; this is believed to allow the cell some flexibility in the level of response for different signals (which was not possible or desirable for restriction sites).


Example: binding site for TF p53

This flexibility translates into the need for more complicated models to describe the site.

Since the binding site is not entirely conserved, the consensus sequence represents only the nucleotides most frequently observed.

The protein could potentially bind to many other similar, but different, sites along the genome.

In theory, if the sites are not independent, the protein may not even bind to the actual consensus sequence !


Patterns/Regular ExpressionPatterns attempts to explain observed motifs by trying to identify the most important combinations of positions and nucleotides/residues of a given site (to be compared with the consensus sequence, where the most important nucleotide/residue at each position was identified)

They are often described using the Regular Expression syntax.

Prosite database (developed at the SIB): http://www.expasy.org/prosite/


Example: Cys-Cys-His-His zinc finger DNA binding domain

Its characteristic motif has regular expression

C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

Where ‘x’ means any amino acid, ‘(2,4)’ means between 2 and 4 occurences, and ‘[…]’ indicate a list of possible amino acids.

Example: 1ZNF: XYKCGLCERSFVEKSALSRHQRVHKNX


Example: TFBS for p53

The TFBS has been described with the pattern→← … →←

where →← is the palindromic sequence

(5’) Pu-Pu-Pu-C-[AT][TA]-G-Py-Py-Py (3’)

and “…” is a spacer with 0 to 14 nucleotides

Note that this pattern (with the palindromiccondition) can not be expressed using a regular expression (at least not in a simple or general way).

J. Hoh et al. “The p53MH algorithm and its application in detecting p53-responsive genes”. PNAS, 99(13), June 2002, 8467-8472.


Example: TFBS for p53

The pattern approaches clearly allows more flexibility than the consensus sequence; however it is still too rigid, especially for sites that are not well conserved.

When applying the pattern, each possible amino/acid or nucleotide at a given position has the same weight, i.e. is it not possible to specifiy if one is more likely to appear than another.


Position-Specific Scoring Matrices

“Stochastic consensus sequence”

Indicates the relative importance of a given nucleotide or amino acid at a certain position.

Usually built from an alignment of sequences corresponding to the domain we are interested in, and either a collection of sequences known not to contain the domain, or (most often), background probabilities for the different nucleotides or amino acids.


0.04 0.88 0.26 0.59 0.49 0.03

0.09 0.03 0.11 0.13 0.22 0.05

0.07 0.01 0.12 0.16 0.12 0.02

0.80 0.08 0.51 0.13 0.18 0.89

9 214 63 142 118 8

22 7 26 31 52 13

18 2 29 38 29 5

193 19 124 31 43 216

-2.76 1.82 0.06 1.23 0.96 -2.92

-1.46 -3.11 -1.22 -1.00 -0.22 -2.21

-1.76 -5.00 -1.06 -0.67 -1.06 -3.58

1.67 -1.66 1.04 -1.00 -0.49 1.84

A

C

G

T

Counts from 242 known sites

PSSM:

log fbl/pb

(pb=background probabilities)

Relative frequencies: fbl

A

C

G

T

A

C

G

T

Building a PSSMPos. 1 2 3 4 5 6


Scoring a sequence using a PSSMC T A T A A T C

-38 19 1 12 10 -48

-15 -38 -8 -10 -3 -32

-13 -48 -6 -7 -10 -48

17 -32 8 -9 -6 19

-38 19 1 12 10 -48

-15 -38 -8 -10 -3 -32

-13 -48 -6 -7 -10 -48

17 -32 8 -9 -6 19

A

C

G

T

-38 19 1 12 10 -48

-15 -38 -8 -10 -3 -32

-13 -48 -6 -7 -10 -48

17 -32 8 -9 -6 19

sum

-93

+85

-95

Move the matrixalong the sequenceand score each “window”

Peaks should occur at the “true” sites

Of course in general any threshold will have some false positive and false negative rate

A

C

G

T

A

C

G

T


Sequence logo: graphical representationCys-Cys-His-His zinc finger DNA binding domain

The total height of each stack represent the degree of conservation of eachposition; the heights of the letters on a stack are proportional to their frequencies.


14 11 26 0 28 2.5 0 0.5 0 3 6 2 11.5 0 27 4 0 0.5 1 2

3 1 1 36 1 0.5 0 24.5 33 23 2 0 0.5 36 2 0 0 9.5 24 15

16 24 10 0 0 0 37 0 0 0 23.5 34 25 0 2 1 37 0 0 3

4 1 0 1 7 34 0 12 4 10 5.5 1 0 1 5 32 0 27 12 16

Counts from 37 known sites

A

C

G

T

PSSM for p53 binding site

Pos. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

J. Hoh et al., “The p53HM algorithm and its application indetecting p53-responsive genes”, 2002.


What is missing ?

PSSM help to deal with the stochastic distributions of symbols at a given position.

However, they lack the ability to deal with the length distribution of the motif they describe.

Stochastic processes provide a more general framework to deal with these questions.


Stochastic Process

A stochastic process X is a collection of random variables {Xt, t ∈T}

The variable t is often interpreted as time, and X(t) is the state of the process at time t.

A realization of X is called a sample path.

While this definition is quite general, there are a number of special cases that are of high interest in bioinformatics, in particular Markov processes.


Markov ChainA Markov process is a stochastic process X with the following properties:– The random variables Xt can take values in a finite list of

states S={s1,s2,…,sn} (often taken as {1,2,…,n})– P(Xt+1=jt+1 | Xt=jt, Xt-1=jt-1, …) = P(Xt+1=jt+1 | Xt=jt)

• This Markov Property means that the probabilities of getting to the next state depend only on the current state, and not on the previous states; in other word, the Markov process is memoryless.

– We assume that the process is homogeneous, meaning that the probabilities are the sameat all points: P(Xt+1=jt+1 | Xt=jt) = pij

– The pij are usually represented asa matrix P and an initial probabilityvector π.

1 2 n

p11 p12 … p1n 1

p21 p22 … p2n 2

p31 p32 … p3n 3

... … .pn1 pn2 … pnn n

P =

Next state

Current state


Markov Chain: matrix form

Given a vector πt containing the probabilities of each state at time t, the probabilities of the states at time t+1 are given by πt+1 = πt P

Looking further into the future of the chain, the probabilities P(Xt+n=jt+n | Xt=jt) are given by

πt+n = πt Pn

(matrix multiplication calculatesthe long-rang probabilities) 1 2 n

p11 p12 … p1n 1

p21 p22 … p2n 2

p31 p32 … p3n 3

... … .pn1 pn2 … pnn n

P =

Next state

Current state


Stationary processes

Under certain conditions, there exists a stationary distribution for the states of the Markov model, that is, a distribution for which the probabilities of the different states do not change anymore.

A state π is stationary if π = π P

If it exists, the stationary state can be found by solving this equation (along with the constraints that the sum of elements in π must be 1)

An alternative way is to calculate lim n→∞Pn


An example: mutations in DNA

A Markov chain can be used to model mutations in DNA, e.g. for use in evolutionary problems.

Different models are possible, depending on the constraints that we place on the possible mutations

See for example http://www.molecularevolution.org/resources/models/dnamodels.php

Example (fake):

S. Tavaré, “Some Probabilistic and Statistical Problems inthe Analysis of DNA Sequences”.

A C G T

A 0.80 0.05 0.05 0.10C 0.05 0.70 0.20 0.05G 0.05 0.20 0.70 0.05T 0.10 0.05 0.05 0.80


Models for mutations in DNA

The stationary state for this process is given by the vector (0.25, 0.25, 0.25, 0.25), meaning that after a certain time, the distribution of nucleotides will be uniform, regardless of the initial distribution. Of course, this does not take into account other factors such as natural selection, etc.These substitution models are especially important for database search of proteins, where they allow the algorithms to give more weight to sequences that are evolutionary close, even if they are different from the input (cf PAM matrices, etc).


Example: CpG islands

In the human genome, the frequency of the dinucleotide CG (written CpG to differentiate it from a C-G pair across the two strands of DNA) is lower than what would expected from the frequencies of the C and G nucleotides, because the Cytosine is usually methylated, and often mutated into a Thymine (T)

However, in some regions of the DNA (in general around the promoter region of a gene), called CpG Islands, the methylation process is suppressed and a much higher frequency of CpG dinucleotides is observed.


Difference between the 2 examples

In the example concerning mutations of DNA, we have a single nucleotide that changes over timeIf we look at several nucleotides, they can mutate independently and a different Markov chain is associated with each of them.

In the example concerning CpG islands, we are looking at consecutive nucleotides and how they are linked using a single Markov chain.

ACTAGCTAGCTTGATCTGATCGACTGTGG

ACAAGGTAGCTAGACCTGATCGACAGTGG

TCAAGCTAGATAGACCTGCTCGTCGGTGG

.............................

A→C→T→A→G→C→T→A→G→C→T→T→G→A→T→C→T ...


A Markov model for CpG islands

Roughly speaking, being in a CpG island means that C is more frequent than in the rest of the genome, and is more often followed by G.

This can be modeled using a Markov model.

Given a list of sequences from putative CpGislands, a Markov model (“+”) can be derived

Given a list of sequences not part of a CpG island, a second Markov model (“-”) can be derived


Deriving the Markov models

A C G T

A 0.18 0.27 0.43 0.12C 0.17 0.37 0.27 0.19G 0.16 0.34 0.38 0.12T 0.08 0.36 0.38 0.18

Eddy et al., “Biological sequence analysis”

A C G T

A 0.30 0.21 0.28 0.21C 0.32 0.30 0.08 0.30G 0.25 0.24 0.30 0.21T 0.18 0.24 0.29 0.29

Model “+” Model “-”

Data from 48 putative CpG islands, total of about 60,000 nucleotides.To estimate the probabilities, calculate the observed frequencies of transition from each nucleotide to any other (maximum likelihood estimator).Example: in CpG islands, 1,000 “A”s observed, 270 of which are followed by a “C” gives an estimated probability P(C|A)=0.27.


How to use the model ?

For a given sequence X0X1X2…Xn, we are interested in calculating the probability that one of our Markov models generated this particular sequence

This is easily done using the Markov property:– P(Xo, X1, X2, …, Xn) = P(X0) P(X1|X0) P(X2|X1)… P(Xn|Xn-1)

The different probabilities are provided either by the transition matrix or by an initial probability vector

The initial probability vector can correspond for example to the stationary distribution, here:

A C G T

0.16 0.34 0.35 0.15

A C G T

0.26 0.24 0.24 0.25

Model “+” Model “-”


How to use the model ?

Example: A → C → G → T → A → C → G → TP(CpG) = 0.16 x 0.27 x 0.27 x 0.12 x 0.08 x 0.27 x 0.27 x 0.12 = 3.6 x 10-6

Simply calculating a probability of going through a certain path in the Markov chain may not be the best thing to do, because the probability will keep becoming smaller and smaller (because the probability of seeing any particular sequence decreases quickly if the sequence length increases).More interesting is the likelihood ratio (LR), the comparison of the probability according to the “CpGisland”, compared to the probability according to a “background model”.


Likelihood Ratio

The background distribution can be given by the “non-CpG island” Markov chain, or another distribution of probabilities.

The (log) LR will indicate if we are more likely to be in a CpG island (LR>1) or outside of one (LR<1).

In our example:– P(+) = 0.16 x 0.27 x 0.27 x 0.12 x 0.08 x 0.27 x 0.27 x 0.12 = 3.6 x 10-6

– P(-) = 0.26 x 0.21 x 0.08 x 0.21 x 0.18 x 0.21 x 0.28 x 0.21 = 2.0 x 10-6

– LR = (3.6 x 10-6) / (2.0 x 10-6) = 1.8

This is an example of why we indicated that the problem of DNA segmentation was important.


LR in practiceProblems with the analyze of a long sequence of DNA with this method:– Need to specify a sliding window on which the test will be applied

– If the window is too small, the results may not be significant (few data points used for the LR)

– If the window is too large, some CpG islands may be missed.

Hidden Markov Models provide a framework where we can use a single model to specify both the observations (nucleotides) and a state (“are we in a CpG island or not ?”) which we call hidden because it is not observed, and does not appear directly when observing the sequence.


Hidden Markov Models

Described by Leonard E. Baum and colleagues in a series of 5 papers between 1966 and 1972

Originally used for Speech Recognition

Applied in genetics in the late 1980s

Now ubiquitous in bioinformatics:

Prediction of protein domains

Interpretation of peptide tandem MS data

Analysis of ChIP CHIPmicroarray data

…

Multiple sequence alignment

Functional classification of proteins

Prediction of protein folding

Gene finding


HMM: Formal definition

Two processes {(St, Ot), t=1…}, where– St is the hidden state at time t

– Ot is the observation at time t

– Pr(St | St-1 , Ot-1 , St-2 , Ot-2 , …) = Pr (St | St-1 )(Markov Chain)

– Pr(Ot | St, St-1, Ot-1, St-2, Ot-2, …) = Pr (Ot | St )

Many variants are used (e.g. the distribution of Ocan depend on previous S or previous O)


Example: heterogeneity of DNA

2 states {AT,CG}, representing AT-rich or CG-rich regions

What we see is a series of symbols A,C,G,T.

What is hidden is the state: are we in aAT-rich or CG-rich region ?

Example (with made-up numbers) ofa possible architecture

Models are usually specified with aBEGIN and an END state (not shownhere)

A 0.40C 0.10G 0.10T 0.40

A 0.10C 0.40G 0.40T 0.10

AT

CG

0.9

0.1

0.95

0.05


CpG islands, part 2

CpG islands can be modelled using a Hidden Markov Model

Parameters:– 2 sets of states A, C, G and T; one set for “CpG island”,

and another one for “non-CpG island”

– Transition probabilities between the differents states (inside the 2 sides, and between them)

– Observation probabilities


CpG islands, part 2

A

A C G T

C G T

The green transitions are similar to the probabilities defined by our “+” and “-” Markov models for CpG islands.

The red transitions incorporate the probabilities of entering or leaving a CpG island (absent from the previous model).


Fitting of a HMM

If the sequence of states is known, the probabilities can be trained by maximum likelihood, exactly like our Markov model for heterogeneity CpG islands– However, we can not use 2 separate sets of data for CpG islands/non

CpG islands: we require a set of annotated sequences with transitions from one state to the other, in order to estimate these transition probabilities.

In most cases, however, we know only the sequence of symbols, and not the subjacent states

In this case, there is no closed-form equation to estimate the parameters, and an (usually iterative) optimisation procedure must be used, starting from a random estimation of the parameters (starting point).


Fitting of a HMM

Given the general architecture of an HMM (number of states, possible transitions), and a list of training sequences, the standard algorithm for estimating the transition and emission probabilities is the Baum-Welch training algorithm (based on the Expectation-Maximization (EM) algorithm)

An alternative training algorithm is the Viterbitraining algorithm, which calculates the most probable paths for the training sequences, and re-estimate the probabilities from these results.


Expectation-Maximization (EM) algorithm

The EM algorithm allows us to perform Maximum Likelihood estimation when some data is missing (in this case, the unknown sequence of states).

It requires a “starting point”, i.e. an initial guess of the parameters of the model

It consists in the iteration of 2 steps:– E-step: using the current model, estimate the missing observations

– M-step: use the observations (real+estimated) to find a new model by maximum likelihood

It can be shown that the likelihood increases with each iteration, so that the new model always get better, until we reach a maximum (hopefully global)

Used to solve many different problems in bioinformatics


QuestionsGiven an HMM, and a (new) sequence, we may want to know:– What is the probability that the sequence was generated

by this model (scoring problem) ? Solved by the Forward algorithm

– What is the most probable path of states followed by the HMM when generating this sequence (“annotation”problem, or decoding) ? Solved by the Viterbi algorithm

– What is the most probable state for a given observation (symbol) (posterior probability) ? Solved by the backward algorithm.

These elegant algorithms are based on a similar concept, dynamic programming; we will detail only the forward algorithm.


Forward algorithm

If the number of states increases, the number of possible paths increases exponentiallyFor most models, it is not possible to enumerate all the pathsThe “trick” is to recognise the following fact (based on the Markov property):– The probability that the kth character of

a given sequence was generated bystate i depends only on which stategenerated the (k-1)th character, andon the transition probabilities betweenthe state

– The actual path that was followed beforeis not relevant.

This allows us to calculate the totalprobability recursively.

i

z

b

a

…


α1(AT) = PAT(A) • 0.8 = 0.4 • 0.8 = 0.32

α1(CG) = PCG(A) • 0.2 = 0.1 • 0.2 = 0.02

S E

0 0 0

1 1 1

S

AT

CG

AT

CG

AT

CG

E

s1=A s2=T s3=G

0.8

0.2

0.95

0.05

0.9

0.1

0.95

0.05

0.9

0.1

1.0

1.0

Forward algorithm: exampleWhat is the probability that our model for DNA heterogeneity produced the sequence ATG ?Naïve approach: 3 symbols, 2 possible states for each symbols, 8 paths to consider


α1(AT) = PAT(A) • 0.8 = 0.4 • 0.8 = 0.32

α1(CG) = PCG(A) • 0.2 = 0.1 • 0.2 = 0.02

α2(AT) = PAT(T) • (0.95 • α1(AT) + 0.1 • α1(CG) ) = 0.1224

α2(CG) = PCG(T) • (0.05 • α1(CG) + 0.9 • α1(CG) ) = 0.0034

Forward algorithm

S E

0 0 0

1 1 1

S

AT

CG

AT

CG

AT

CG

E

s1=A s2=T s3=G

0.8

0.2

0.95

0.05

0.9

0.1

0.95

0.05

0.9

0.1

1.0

1.0


1.0

1.0

Forward algorithm

s1=A s2=T s3=G

0.8

0.2

0.95

0.05

0.9

0.1

0.95

0.05

0.9

0.1S E

0 0 0

1 1 1

S

AT

CG

AT

CG

AT

CG

E

α1(AT) = PAT(A) • 0.8 = 0.4 • 0.8 = 0.32

α1(CG) = PCG(A) • 0.2 = 0.1 • 0.2 = 0.02

α2(AT) = PAT(T) • (0.95 • α1(AT) + 0.1 • α1(CG) ) = 0.1224

α2(CG) = PCG(T) • (0.05 • α1(CG) + 0.9 • α1(CG) ) = 0.0034

α3(AT) = PAT(G) • (0.95 • α2(AT) + 0.1 • α2(CG) ) = 0.0116

α3(CG) = PCG(G) • (0.05 • α2(CG) + 0.9 • α2(CG) )= 0.0037

α = α3(1) + α3(0) = 0.0153


For a given sequence, the Viterbi algorithm finds the most likely path across the model

In some cases, this may not be the most relevant information (e.g. if two paths have very close probabilities). Posterior decoding using the backward algorithm can find the most probable state for a given observation, summed over all the possible paths (the Viterbi algorithm considered the maximum of all the paths, instead of the sum).

1.0

1.0

Viterbi algorithm and posterior decoding

A T G

0.8

0.2

0.95

0.05

0.9

0.1

0.95

0.05

0.9

0.1S E

0 0 0

1 1 1

S

AT

CG

AT

CG

AT

CG

E


A longer example

Several hundred bases were generated using the model for heterogeneity of DNA

The resulting sequence was “decoded” using the Viterbi algorithm to find the sequence of states that is most likely.


Obs. seq: TTATTTAACTTAATAAATATGTCAATCAATTTTCTGCTTCAGTTCAGTAGGGGAACATCATACTTGGAAAGGAAATATAAReal state: 11111111111111111111111111111111111111111111111112222111111111111111111111111111Pred.state: 11111111111111111111111111111111111111111111111112222111111111111111111111111111

CTTTATTGCATATTAAGGCGCCACGGGGCCGGCGCGCGGGCATAAAATAGCTATTTTTCTTCGTATGTGAATGAATCCAT1111111111111111122222222222222222222222221111111122111111111111111111111111111111111111111111112222222222222222222222222111111111111111111111111111111111111111

TATCCAAATCATTTGATCCATTTAAATTTTATTATGTTTTTCAGCTGTAACAGTAAAGTTTTTACACGCAATTGTGAAGA1111111111111112111111111111111111111111111111111111111111211112222222211111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

ATCATCTCAAAAGAAATAAAAAATTTAAGGTGACCCGGGACGGCGCCTGAATTTAGAATCGCCGGGACCCCAGCGACTTG1111111111111111111111111111122222222222222222221111111111122222222222222222221111111111111111111111111111112222222222222222222111111111111222222222222222222222

CCACTTTGGGTTCGATATGAATATTATCAACGTGGGGGCCGACCGCGCTTAATAAAATATTATAGTGCAAAAATATAAAG2222111112211111111111111111111222222222222221222221111111111111111111111111111122111111111111111111111111111122222222222222222211111111111111111111111111111111

ATCTAATATTGCAGTTCAAATTTGTAATATATATTTTGCCGAAATATGTGCGCGTGGCCCTTTTGACTTAACATTATTCG1111111111112111111111112111111111111122221111111222222222222111111111111111111111111111111111111111111111111111111112222111111112222222222211111111111111111122

GCACAGCCCCTGCGCCGAAGAATTAACAATAGATATATTTCATATTTAATACGTAAAACAAAGATTACATTTGCATAGAT2222222222222222222221111111111111111111111111111111111111111111111111111111111122222222222222222111111111111111111111111111111111111111111111111111111111111111

TAGGCCTAACCTCGATGATGGTATATAACTATTAATTTGATAATAATAATGGGTTCAATCTATTTTTACGCCCCCACATA1111111111221111111111111111111111111111111111111112222221111111111112222222221111222211111111111111111111111111111111111111111111111111111111111111222222211111


Some comments

The model is unable to find an “island” of one state “2” in a series of states “1” or the opposite, because two state transitions in a row are much less likely than any observationSince we work with random processes, this is expected because a misleading sequence of observations can always appear at randomLonger series in one state are more likely to be found, because the model has more evidence in hand to find them, and longer series are more likely to be real, and less likely tohappen just by chance.Boundaries of regions are often incorrect, because a nucleotide at the boundary will “attached” to the region it is most likely from.


Couplage of state and observation

One problem with HMMs is the fact that the states and observations are coupled.If a symbol is emitted for each stateand the probability of staying on agiven state is (1-p), the number of symbolsemitted will be distributed according to a geometric distribution (mean 1/p).Variants have been developed where state and observation are decoupled; they are called generalized HMMs(GHMMs), HMM with duration or hidden semi-Markov model.With a GHMM, states do not loop on themselves to repeat symbols; instead, each state can emit a number of symbols according to an arbitrary distribution.

p

1-p


Gene finding

Given all the genomes that have been sequenced in the past few years, gene finding is an important (and challenging) task.

By “Gene finding” we can mean– Find if a particular piece of DNA sequence is in a gene or

not (segmentation)

– Find and annotate all the different parts of a gene

– … and anything in between



A very simple model for prokaryotes

Calculate the frequency of appearance of bases (or codons) in genes, and in intergenic regions

Similar to the models we used for the heterogeneity of DNA (segmentation)

This could be modelled using anHMM with this architecture andrelevant observation probabilities.

Non-codingregion

Codingregion

p q

1-p

1-q


More sophisticated model

A more sophisticated model uses frame-dependent composition

Non-codingregion

Position 1 Position 2 Position 31 1

1-p

1-qp

q


More sophisticated model (2)

We can modify our model so that it requires an initiation of translation site (ATG), as well as a site for termination of translation (TAA, TAG or TGA).Note that “Coding region” and “Non-coding regions can represent HMMs as well (not detailed here)

Non-codingregion

A

T

G

Codingregion

T

A

A

T

A

G

T

G

A


More sophisticated models

We can build more complicated models on top of the simpler ones– Promoters, second strand, etc

For every model, we hope that it will be better at finding real genes than the previous ones

However, if the model gets more complicated, we have more parameters to estimate.


Human genes finding

Finding genes in eukaryotes genomes is complicated by the presence of exons and introns.

Many gene finders have been developed specifically for finding genes in the human genome

GENSCAN is currently one of the most popular and successful human gene finder

It is based on a GHMM.

Burge et al., J. Mol. Biol., 1997


E0 E1 E2

E

poly-A

3'UTR5'UTR

tEi

Es

I0 I 1 I 2

intergenicregion

Forward (+) strand

Reverse (-) strand

Forward (+) strand

Reverse (-) strand

promoter

62001 AGGACAGGTA CGGCTGTCAT CACTTAGACC TCACCCTGTG GAGCCACACC

62051 CTAGGGTTGG CCAATCTACT CCCAGGAGCA GGGAGGGCAG GAGCCAGGGC

62101 TGGGCATAAA AGTCAGGGCA GAGCCATCTA TTGCTTACAT TTGCTTCTGA

62151 CACAACTGTG TTCACTAGCA ACCTCAAACA GACACCATGG TGCACCTGAC

62201 TCCTGAGGAG AAGTCTGCCG TTACTGCCCT GTGGGGCAAG GTGAACGTGG

62251 ATGAAGTTGG TGGTGAGGCC CTGGGCAGGT TGGTATCAAG GTTACAAGAC

62301 AGGTTTAAGG AGACCAATAG AAACTGGGCA TGTGGAGACA GAGAAGACTC

62351 TTGGGTTTCT GATAGGCACT GACTCTCTCT GCCTATTGGT CTATTTTCCC

62401 ACCCTTAGGC TGCTGGTGGT CTACCCTTGG ACCCAGAGGT TCTTTGAGTC

62451 CTTTGGGGAT CTGTCCACTC CTGATGCTGT TATGGGCAAC CCTAAGGTGA

62501 AGGCTCATGG CAAGAAAGTG CTCGGTGCCT TTAGTGATGG CCTGGCTCAC

62551 CTGGACAACC TCAAGGGCAC CTTTGCCACA CTGAGTGAGC TGCACTGTGA

62601 CAAGCTGCAC GTGGATCCTG AGAACTTCAG GGTGAGTCTA TGGGACCCTT

62651 GATGTTTTCT TTCCCCTTCT TTTCTATGGT TAAGTTCATG TCATAGGAAG

62701 GGGAGAAGTA ACAGGGTACA GTTTAGAATG GGAAACAGAC GAATGATTGC

62751 ATCAGTGTGG AAGTCTCAGG ATCGTTTTAG TTTCTTTTAT TTGCTGTTCA

62801 TAACAATTGT TTTCTTTTGT TTAATTCTTG CTTTCTTTTT TTTTCTTCTC

62851 CGCAATTTTT ACTATTATAC TTAATGCCTT AACATTGTGT ATAACAAAAG

62901 GAAATATCTC TGAGATACAT TAAGTAACTT AAAAAAAAAC TTTACACAGT

62951 CTGCCTAGTA CATTACTATT TGGAATATAT GTGTGCTTAT TTGCATATTC

63001 ATAATCTCCC TACTTTATTT TCTTTTATTT TTAATTGATA CATAATCATT

63051 ATACATATTT ATGGGTTAAA GTGTAATGTT TTAATATGTG TACACATATT

63101 GACCAAATCA GGGTAATTTT GCATTTGTAA TTTTAAAAAA TGCTTTCTTC

63151 TTTTAATATA CTTTTTTGTT TATCTTATTT CTAATACTTT CCCTAATCTC

63201 TTTCTTTCAG GGCAATAATG ATACAATGTA TCATGCCTCT TTGCACCATT

63251 CTAAAGAATA ACAGTGATAA TTTCTGGGTT AAGGCAATAG CAATATTTCT

63301 GCATATAAAT ATTTCTGCAT ATAAATTGTA ACTGATGTAA GAGGTTTCAT

63351 ATTGCTAATA GCAGCTACAA TCCAGCTACC ATTCTGCTTT TATTTTATGG

63401 TTGGGATAAG GCTGGATTAT TCTGAGTCCA AGCTAGGCCC TTTTGCTAAT

63451 CATGTTCATA CCTCTTATCT TCCTCCCACA GCTCCTGGGC AACGTGCTGG

63501 TCTGTGTGCT GGCCCATCAC TTTGGCAAAG AATTCACCCC ACCAGTGCAG

63551 GCTGCCTATC AGAAAGTGGT GGCTGGTGTG GCTAATGCCC TGGCCCACAA

63601 GTATCACTAA GCTCGCTTTC TTGCTGTCCA ATTTCTATTA AAGGTTCCTT

63651 TGTTCCCTAA GTCCAACTAC TAAACTGGGG GATATTATGA AGGGCCTTGA

63701 GCATCTGGAT TCTGCCTAAT AAAAAACATT TATTTTCATT GCAATGATGT

GENSCAN


Assessing performance

The performances of a predictive model can be assessed using a second dataset (testing set) different from the one used to train the model (learning or training set). We assume that we know the “truth” in each dataset.

Important quantities:– TP (True Positives): number of positive predictions which are correct

– FP (False Positives): number of positive predictions which are incorrect (i.e. spurious prediction of a site)

– TN (True Negatives): number of negative predictions which are correct

– FN (False Negatives): number of negative predictions which are incorrect (i.e. true site, missed).


Assessing performance

Most of the time, the number of False Positives and False Negatives (i.e., the errors that we would like to minimises) is not fixed, but depends on a continuous score (e.g. probability returned by the forward algorithm) on which we set a threshold.

Changing this threshold will change the number of errors; a higher threshold (i.e. a more stringent criteria for classifying an observation as “positive”) will usually result in a lower number of False Positives, but a higher number of False Negatives


Specificity and sensitivity

Sensitivity is TP/(TP+FN), the proportion of true sites that are correctly found and annotatedSpecificity is generally defined as TN/(TN+FP), the proportion of false sites that are correctly annotatedIn some cases, specificity is defined as TP/(TP+FP), the proportion of all positive predictions that are correct. This is especially useful if the proportion of false sites correctly annotated is high, in which case TN/(TN+FP) will not be sensitive to change in FP.The false positive rate is 1-Specificity.

S.E. Cawley, A.I. Wirth and TP. Speed, Mol. And Biochem.Parasitology, 118(2), 167-174 (2001).


Graphical representation

Distribution of scores fornegative results Distribution of scores for

positive results

The two distribution overlap, so thatit is impossible to use this score toperfectly discriminate between positiveand negative results.

Score

TPTN

FN FPChosen

threshold

Observations classifiedas “positive”

Observations classifiedas “negative”


Specificity and sensitivity

Sensitivity is TP/(TP+FN), the proportion of true sites that are correctly found and annotatedSpecificity is generally defined as TN/(TN+FP), the proportion of false sites that are correctly annotatedIn some cases, specificity is defined as TP/(TP+FP), the proportion of all positive predictions that are correct. This is especially useful if the proportion of false sites correctly annotated is high, in which case TN/(TN+FP) will not be sensitive to change in FP.The false positive rate is 1-Specificity.

S.E. Cawley, A.I. Wirth and TP. Speed, Mol. And Biochem.Parasitology, 118(2), 167-174 (2001).


Comparison of different models

Different models can be applied to a given statistical problem, as we have seen in the case of gene finding.

ROC (Receiver Operating Characteristic) curves help summarising the FP/FN rates of one or several models.


ROC curve

False positive rate(1-Specificity)

0% 100%0%

100%

Tru

e p

osi

tive

rat

e(S

ensi

tivity

)


ROC curve

False positive rate(1-Specificity)

0% 100%0%

100%

Tru

e p

osi

tive

rat

e(S

ensi

tivity

)

The Area Under the Curve (AUC) provides a way to compare the global performance of different models.

A perfect model would have an AUC of 1

A random model would have an AUC of 0.5 (worst)

Models in between can have AUCs around 0.9 (excellent), 0.8 (good), 0.7 (fair), 0.6 (poor).


ROC curves and other methods

Problems with ROC curves:– When comparing two curves, the comparison is not straightforward if

the two curves cross once or several times.

– Most of the time, we are only interested in some regions of the graph, typically the low FP region (we usually do not care about an algorithm that produces 85% of false positives, even it produces no false negatives).

– The comparison between models does not take into account the complexity of the model; a model A may be slightly better than model B, but if model A is much more complicated (higher number of parameters), the gain may not be worth the work. Comparison methods that penalise more complicated models also exist (for example, AIC or BIC).


What is still missing with HMMs ?

By default, HMM use a Markov model of order 1, i.e. the transition and emission probabilities depend only on the current state

This means that they can not take into account dependencies between the sites

Models of higher orders, which solve this problem, can be transformed into models of order 1 by increasing the number of states

Doing so is straightforward, but the complexity of the resulting models increases exponentially and they are quickly computationally infeasible.

In general, this problem is unsolved.


What is still missing with HMMs ?

Example: probabilities of observation of nucleotides in DNA in a Markov model

Increase the number of states by having a state for each digram instead of each symbol

A C G T

A 0.18 0.27 0.43 0.12C 0.17 0.37 0.27 0.19G 0.16 0.34 0.38 0.12T 0.08 0.36 0.38 0.18

A C G T

AA 0.18 0.27 0.43 0.12AC 0.17 0.37 0.27 0.19AG 0.16 0.34 0.38 0.12AT 0.17 0.37 0.27 0.19CA 0.16 0.34 0.38 0.12CC 0.08 0.36 0.38 0.18… ……………………..


Other stochastic models

Many other stochastic models are available,such as– Variable Length Markov Models: many variants; uses

models of order higher than 1, but only when needed; otherwise use simpler models; compromise to reduce the increase in complexity. Uses trees.

– Permuted Variable Length Markov Models: similar to the VLMM, but the sites may be reordered so that the “previous” site in the probabilities may not be the previous site in the sequence; this allows to model simple but long-range effects


Some caveats…

Some care has to be taken because they can break down at any time, without warning !

Ideally, any prediction made with a statistical model should be confirmed experimentally.

However, these models have been hugely successful, as shown by our examples and references


Introduction: Coiled-coil domains

Coiled-coil domains (CCD) are a protein motif foundin many types of proteins

It consists of two or more identical strands of aminoacid sequences forming alpha-helices that are wrapped around each other, forming one of thesimplest tertiary structures

Most coiled-coil domains contain a pattern calledheptad repeat: seven residues of the formabcdefg, where the a and d positions are generallyhydrophobic.


Coiled-coil domains

This apparent structure is a good candidate for statistical modeling.

Methods used for modeling:– PSSM

– HMM

For the practical work, we will use an HMM to modelCCDs.


References (web)

Wikipedia (http://en.wikipedia.org/) contains much information on HMMs and other topics

Introduction to Coiled Coils http://www.lifesci.sussex.ac.uk/research/woolfson/html/coils.html

Terry Speed’s courses at University of Berkeleyhttp://www.stat.berkeley.edu/users/terry/Classes/index.html


References (General articles)

Sean R. Eddy, “What is a hidden Markov Model ?”, Nat. Biotech. 22, p. 1315-1316 (2004).

Karen Heyman, “Gene Finding with Hidden Markov Models”, The Scientist, 19(6), p. 26 (2005).

L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proc. IEEE. 77(2), p. 257-285 (1989).


References (Specific articles)

p53 transcription factor:– J. Hoh et al., “The p53HM algorithm and its application

in detecting p53-responsive genes”, PNAS 99 (13), 25 June 2002, 8647-8472.

– C.-L. Wei et al., “A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome”, Cell 124, 13 January 2006, 207-219.

Other stochastic models:– X. Zhao, H. Huang and T.P. Speed, “Finding short DNA

motifs using permuted Markov models”, Proceedings of the Eighth Annual International Conference on Computational Molecular Biology (RECOMB 2004)


References (Specific articles)

S. Tavaré, “Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences”. Lectures on Mathematics and Life Sciences, American Mathematical Society, vol. 17, 1986.L. Peshkin and M.S. Gelfand, “Segmentation of yeast DNA using hidden Markov models”, Bioinformatics 15(12), p. 980-986 (1999).C. Burge and S. Karlin, “Prediction of complete gene structures in human genomic DNA”, J. Mol. Biol., 268, p. 78-94 (1997).M. Delorenzi and T. Speed, “An HMM model for coiled-coil domains and a comparison with PSSM-based predictions”, Bioinformatics 18(4), p. 617-625 (2002).


References (Books)R. Durbin, S. Eddy, A. Krogh, G. Mitchison, “Biological sequence analysis”, Cambridge University Press, 1998.

T. Koski, “Hidden Markov Models for Bioinformatics”, Kluwer Academic Publishers, 2001.

W.J. Ewens and G.R. Grant, “Statistical Methods in Bioinformatics” (2nd edition), Springer, 2005.

S.M. Ross, “Stochastic Processes” (2nd edition), John Wiley and Sons, 1996.


Acknowledgements

Many slides and ideas were borrowed from talks and courses by Terry Speed (University of Berkeley, California, and WEHI, Melbourne, Australia)


More questions ?

[email protected]

[email protected]

Date post:	12-Sep-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Stochastic processes and Hidden Markov Models

Documents