+ All Categories
Home > Documents > Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis...

Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis...

Date post: 13-Sep-2018
Category:
Upload: dinhdieu
View: 226 times
Download: 0 times
Share this document with a friend
38
Markov Models and HMM in Genome Analysis Bernard PRUM La genopole Evry France [email protected] 20 ans du Magistère de Mathématiques Strasbourg 20 septembre 2007
Transcript
Page 1: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Markov Models and HMMin Genome Analysis

Bernard PRUM

La genopole – Evry – France [email protected]

20 ans du Magistère de MathématiquesStrasbourg – 20 septembre 2007

Page 2: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Why Markov Models ?A biological sequence :

X = (X1, X2, … , Xn)

where Xk ∈ A = { t , c , a , g } or {A C D E F G H I K L M N P Q R S T V W Y}

A very common tool for analyzing these sequences is the Markov Model (MM)

P(Xk = v | Xj , j < k) = P(Xk = v | Xk – 1) u, v ∈ Adenoted by π(u , v) if Xk – 1 = u

Page 3: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

All the properties of the sequence (periodicity, « exceptionalcharacter» of a word, etc.) depend on

• the frequency of each letter → sufficient stat. for model M0 (Bernoulli),

• the frequency of 2-words tt , tc ta, tg, … gg → sufficient stat. for model M1 (Markov)

Exemple HIV1 t c a g

t 548 342 684 590 2164c 470 413 795 95 1773a 713 561 1112 1024 3410g 432 457 820 661 2370

Markov models : not mimizing Nature taking into account an esential information

Page 4: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Results MM

Page 5: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Parsimonious Markov ModelsWhen we modelize a sequence in order to find exceptional motifsor for annotation, we have to estimate the parameters of themodel, and more parameters we have, worst is the estimation.

In a Markov Model of order m, there are 4m predictors(the m-words), hence 3 x 4m parameters

In the M2 model, there are 16 predictors and 48 parameters

In M5, there are 1024 predictors, 3072 parameters

Page 6: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

PMM – 2 Various model pursue “parcimony”:MTD = Mixed Transition DistributionVLMC = Variable Length Markov ChainPMM = Parsimonious Markov Models

P(Xt = u | Xt – 1 = [tc])

P(Xt = u | Xt – 1 = a, Xt – 2 = c)

P(Xt = u | Xt – 1 = a, Xt – 2 = [tag])

P(Xt = u | Xt – 1 = g)

+

A Parsimonious Markov Model is defined by• such a dependance tree τ• for each leaf (= for each predictor) a law on A

Page 7: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

PMM – 3

More precisely : in the tree of predictors (*) below any nodeall the partitions of A = { t , c , a , g } may appear

(*) : the differents predictors appear in this treelike the path from all the leaves to the root

Hence, there are 15 possibilities below each node.

More models, less parameters

Page 8: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Application

This picture plots BIC(MM) – BIC(MM) against the size ofthe bacterial genome. For all the bacteriae, PMM fits betterthan classical MM

Page 9: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Approach using FDA

Recent results (Gregory Nuel) concern the use of «Finite(Deterministic) Automata» in the statistic of words or patterns

To a word, we may associate an FDA :Example 1 : on {a,b}, w = aaabStates : b

aaaaaaaaab

a

aab

aaa

aaab

This can be generalizedif “one“ word w is replaced bya motif (finite family of words) or even a language.

Page 10: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Approach using FDA

This automata is especially dedicated to the study of the word w (themotif, ...) : if we “run“ a sequence on this graph, the automate countsthe occurences of w (the motif, ...)

It turns to be VERY efficient :“wordcount “, program in EMBOSS, needs 4352 seconds to countthe occurrences of all 12-words in E. coli, Nuel’s program acheivesthis task in 9.86 seconds

The prosite motif

[LIVMF]GE.[GAS][LIVM].(5-11)R[STAQ]A.[LIVAM].[STACV]

(some 1012 words) is treated by a FDA of 329(30) states in M01393(78) states in M1

Page 11: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Approach using FDA

If this sequence X is a Markov chain, we then have an other MCrunning on this graph.

Even for “rather complicated motifs“, this allows to get the law of“all“ statistics of words : - exact law of the first occurrence of a motif (taking intoaccount the “starting point“),

- exact law of the number of occurrences of the motif,- in particuler expectation and variance of these laws,

opening the possibility of gaussian, poisonnian,... approximations(and an exhaustive study of the qualities of these approximation),

- law of a motif M conditionally to the number ofoccurrences of another one, M’.

Page 12: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Other field for « genomic MC »

Markov chainsare a central toolfor the inferenceof genetic networks

= a central domain forthe next 20 years

Page 13: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Hidden Markov Models

Page 14: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Hidden Markov Models

An important criticism against Markov modelization is its stationarity: a well known theorem says that, under weak conditions,

P(Xk = u) → µ(u) (when k → ∞)

(and the rate of convergence is exponential.)

But biological sequences are not homogeneous. There are g+c rich segments / g+c poor segments (isochores).

One may presume (and verify) that the rules of succession of letters differ in coding parts / non-coding parts.

Is it possible to take avantage of this problemand to develop a tool for the analysis of heterogeneity ?

=> annotation

Page 15: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

HMM – 2

Suppose that d states alternate along the sequence

And in each state we have a MC : if Sk = 1, then P(Xk = v | Xk–1 = u) = π1(u ; v) if Sk = 2, then P(Xk = v | Xk–1 = u) = π2(u ; v)

and (more technical than biological - see HSMM)

P(Sk = y | Sk–1 = x) = π0(u ; v)

Sk = 1 Sk = 2 Sk = 1 Sk = 2 Sk = 1

Our objectives• Estimate the parameters π1, π2, π0

• Allocate a state {1, 2} to each position

Page 16: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

HMM – 3

¡ Use the likelihood !!

L(θ) = ∑ µ0(S1) µS (X1) ....1

...∏ π0(Sk-1,Sk) πk (Xk-1,Xk)

n terms (length of the sequence)

over all possibilities S1S2...Sn ; there are sn terms

210 000 = 103 000 Désespoir !!!

Page 17: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

HMM – 4

Idea : use an E.M. algorithmWhat is an E.M. algorithm ?

Population : men – height ~ N (µH ; σ2

H)

women – height ~ N (µF ; σ2F)

? ?

Page 18: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

HMM – 4

? ?

We do not known which points are men, which are women.

We arbitraly allocate points to each category

• • • • • • • •• • • • • • • •

Page 19: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

It is now very easy to estimate the expectation µ and the variance σ2

for each category.Then to have the densities of each (erroneous !) law :

• • • • • • • •• • • • • • • •

Page 20: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Then for a given point, it is possible, using Bayes formula, to have anestimation of P(• = man) and P(• = woman)

• • • • • • • •• • • • • • • •

For the indicated point, say P(man) = 2 P(woman)

Page 21: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

1rst idea :

We then allocate colors (man/woman)

• • • • •••• •• • • • • • •

Bad !

we decide “man” iff P(man) > P(woman)

Page 22: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Better

• • • • • • •• • • •• • • • •

We give to each point a weight• within men P(man)• within women P(woman)

chose man/woman according to this probabilityOR

EM

SEM

and again

Page 23: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

The EM algoruthm

We do not know what is S = 1 and what is Sk = 2

Sk = 1 Sk = 2 Sk = 1 Sk = 2 Sk = 1

Step n° 1

We make an arbitrary allocation of these states :

“Knowing” the states, it is obvious to compute the parameters

exple : π1(c,g) = % of the c which are followed by a g in green parts.

π0(•,•) = % of • which are followed by a •

Page 24: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Baum & Churchill formulaStep n° 2

Definethe predictive probability αk(v) = P(Sk = v | X1

k-1)the filtragee probability βk(v) = P(Sk = v | X1

k)the estimated probability ϕk(v) = P(Sk = v | X1

n)Bayes formula

αk(v) = ∑u βk-1(u) π0 (u, v)

βk-1(u) =αk-1(u) πu (Xk-2, Xk-1)∑w αk-1(w) πw (Xk-2, Xk-1)

Forward (k = 2 to n)

ϕk-1(v) = βk-1(u) ∑v π0 (u, v)Backward (k = n to 2)

ϕk(v)βk(v)

Page 25: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

and Sk ?Bad idea : Sk = arg max ϕk(v) (*)

First good idea : keep the distribution ϕk(v) [EM]

Second good idea : draw Sk according to ϕk(v)[SEM]

(*) except, may be at the endof the algorithm (freezing)

Page 26: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Annotation

Page 27: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

SHMMNew defect : transition between statesis described using a Markov model π0 =

1 - p p q 1 - q( )

• •••

Consequence : the length of segments of‘a given colour’ are r.v. ~ Expo(–)

Page 28: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

SHMMDoes not correspond to the reality ! !Histograms of ‘biological segments‘ (after smoothing) look more like

h

density g(h)

It is easy to make the probability of leaving the state depend on hto get the suitable law :

p(h) =g(h)

1 - G(h)

Page 29: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Searching nucleosomes

Page 30: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Searching nucleosomes

In eukaryotes (only), an important part of the chromosomesforms chromatine, a state where the double helix winds round “beads“forming a collar :

Each bead is called a nucleosome. Its core is a complexinvolving 8 proteins (an octamer) called histone (H2A, H2B, H3, H4).DNA winds twice this core and is locked by an other histone (H1).The total weight of the histones is ± equal to the weight of the DNA.

|||10 nm

What are nucleosomes ?

Page 31: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

NucleosomesCurvature within curvature

Back to one nucleosome :the DNA helix turns twicearound the histone core.Each turn corresponds toabout 7 pitches of the helix,each one made with about10 nucleotides.

Total = 146 nt within eachnucleosome.

Depending on the position (“in”vs “out”) the curvature satisfiesdifferent constraints

Page 32: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

Bendability

Following an idea (Baldi, Lavery,...) we introduce anindice of bendability ; it depends on succession of 2, 3, 4, ...di-nucleotides.

a

ag

c

t

t

θ

a

ag

c

t

t

Page 33: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

PNUC table 2nd lettera t g c

a 0.0 2.8 3,3 5.2 at 7,3 7,3 10,0 10,0 ag 3,0 6,4 6,2 7,5 ac 3,3 2,2 8,3 5,4 aa 0.7 0.7 5,8 5,8 tt 2.8 0.0 5.2 3,3 tg 5.3 3.7 5.4 7,5 tc 6,7 5.2 5.4 5.4 t

1rst letter 3rd lettera 5.2 6,7 5.4 5.4 gt 2,2 3,3 5.4 8,3 gg 5.4 6,5 6,0 7,5 gc 4,2 4,2 4,7 4,7 ga 3.7 5.3 7,5 5.4 ct 6,4 3,0 7,5 6,2 cg 5.6 5.6 8.2 8.2 cc 6,5 5.4 7,5 6,0 c

PNUC(cga) = 8,3

There exist varioustables which indicatethe bendability of di-,tri or even tetra-nucleotides (PNUC,DNase, ...)

We used PNUC-3 :

(*) Goodsell, Dickerson, NAR 22 (1994)

PNUC(tcg) = 8,3

Page 34: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

HMM for nucleosomes

Better : consider a different state for ”each” position in thenucleosome (say 146 states)

. . .

and the repetition of r (say 4) identical states for the between-nucleosome regions (= spacer).These brother states give to the law of the length of the b.n.regions a Gamma form which is not geometrical ! !

1 2 3 145 146

Page 35: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

A no-nuc state

. . .1 2 70

no-nuc

spacer

nucleosome core

Trifonov (99) as well as Rando (05) underline that there are ‘no‘ nucleosome in the gene promotors (accessibility)The introduce “before“ nucleosome a “no-nucleosme” state.

Ioshikhes, Trifonov, Zhang Proc. Natl Acad. Sc. 96 (1999)Yuan, Liu, Dion, Slack, Wu, Altschuler, Rando, Sciencexpress (2005)

Page 36: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC
Page 37: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

AcknowledgementsLabo «Statistique et Génome»

Christophe AMBROISE Franck PICARD Maurice BAUDRY Jean-Loup RISLER Etienne BIRMELE Karène RISSONPierre BREZELEC Anne-Sophie TOCQUETCécile COT Nicolas VERGNEMarie-Odile DELORME Sec : Michèle ILBERTClaudine DEVAUCHELLEYolande DIAZ Labo MIG – INRAMark HOEBEKE François RODOLPHEMickael GUEDJ Sophie SCHBATHAlex GROSMAN Élisabeth de TURCKHEIMFrançois KÉPÈSSophie LEBRE Labo AGRO Catherine MATIAS Stéphane ROBINVincent MIELEFlorence MURI-MAJOUE Lab’ RouenGrégory NUEL Dominique CELLIER

Page 38: Markov Models and HMM in Genome Analysis - Unistra · Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr ... Pierre BREZELEC

un peu de pub’Une présentation des modèles de Markov (et HMM) et

de leur application en génomique (annotation, alignements, ..)peut être trouvée dans :

NUEL, Grégory & PRUM, Bernard :Analyse statistique des séquences biologiques (modélisationmarkovienne, alignements et motifsHermès Sciences 2007


Recommended