Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La...

Analysis of biological sequences using Markov Chains

and Hidden Markov ModelsBernard PRUM,

La genopole – Evry – France

[email protected]

Colloque T.A.G – LAPTH

Annecy – 8-10 novembre 2006

Why Markov Models ?

A biological sequence :

X = (X1, X2, … , Xn)

where Xk A = { t , c , a , g } or {A C D E F G H I K L M N P Q R S T V W Y}

A very common tool for analyzing these sequences is the Markov Model (MM)

P(Xk = v | Xj , j < k) = P(Xk = v | Xk – 1) u, v A

denoted by π(u , v) if Xk – 1 = u

Why MM ? – 2Exemple :

E. coli Rec BCD

viruses

own bacteria genomechi

A complex, called Rec BCD, protects the cell against viruses

To avoid the destruction of the genome of the cell, along the genome exists a password gctggtgg (it is called chi). When rec BCD bumps into the chi, it stops its destruction. In order to be efficient the number of occurrences of the chi is much higher that the number predicted in a Markov model.

Results MM

Parsimonious Markov Models

When we modelize a sequence in order to find exceptional motifs or for annotation, we have to estimate the parameters of the model, and more parameters we have, worst is the estimation.

In a Markov Model of order m, there are 4m predictors (the m-words), hence 3 x 4m parameters

In the M2 model, there are 16 predictors and 48 parameters

In M5, there are 1024 predictors, 3072 parameters

PMM – 2

A first restriction consists in taking into account the past up to the point : we use a large past when the sequence shows that this is necesary, we use a short past when the sequence allows the economy : these models are called

VLMC = Variable Length Markov Chains

In this VLMC, there are 12 predictors :aa ca ga taac cc gc tcgat tt [gc]t

There are 36 parameters

Notation : [gc] denotes « g or c » ; [act] denotes « a or c or t »

PMM – 3

But it is not obvious that for the prediction of Xk , Xk – p is less and less informative inasmuch p increases.

As an example (*), let us consider the ’jumper’ model

P(Xt = v | past) = P(Xt = v | Xt – 2)

(the dependance ‘jumps’ over Xt – 1)

it corresponds to this tree(4 predictors, 12 parameters)

(*) this model is not as scholar as it seems : for example in a coding region (periodic model depending on the phase), the 2nd position in a codon strongly depends on the 2nd position in the previous codon (cf hydrophobicity)

PMM – 4

In this PMM there are 8 predictors (24 parameters) :

a[ac] c[ac] g[ac] t[ac]gat [cg]t tt

More general (?) example :

These models are calledPMM = Parsimonious Markov Models

PMM – 5

More precisely : in the tree of predictors (*) below any node all the partitions of A = { t , c , a , g } may appear

(*) : the differents predictors appear in this tree like the path from all the leaves to the root

Hence, there are 15 possibilities below each node.

PMM – 6

A Parsimonious Markov Model (PMM) is defined by• such a dependance tree • for each leaf (= for each predictor) a law on A

P(Xt = u | Xt – 1 = [tc])

P(Xt = u | Xt – 1 = a, Xt – 2 = c)

P(Xt = u | Xt – 1 = a, Xt – 2 = [tag])

P(Xt = u | Xt – 1 = g)

+

PMM – 7

We will only work with finite order PMM : the longer predictor contains, say, m letters (the depth of the tree is m)

Obviously a PMM of order m is a MM of order m

Note : the number of PMM increases very quickly with m : in the 4-letter alphabet and for m = 5 there are some 1085

trees

Notations : denotes a tree of predictors

W its sets of predictors in (the leaves)

For w W , w,u = P(Xt = u | w)

–––––––

Statistics on PMM

L() = … ∏ w,u N(wu)

For a fixed tree , the likelihood is obviously

(The dots correspond to the first letters in the sequence. We will not care about them today)

Which leads to the classical MLE

w,u =N(wu)N(w+)

^(where N(w+) = ∑ N(wv)

The difficulty arises when we want to choose the tree : problem of

choice of model

(within, for example 1085 models)

Statistics – 2

Therefore we adopt a Bayesian approach

A priori law :• on the tree let us choose the uniform law (it can be changed)• on the transition parameter, it is natural to chose a Dirichlet law which is conjugate :

if, for w W , a priori P(w,•) = ∏ w,u

then, a posteriori P(w,•) = ∏ w,u

(w,u)

(w,u)

The MAP estimator of w,u remains the same as before, except the fact that N(w,u) has to be changed in

N’(wu) = N(wu) + (w, u)

Statistics – 3

The use of Bayes formula then gives as a posterior law on the trees

ln P( | X) = S(w)

Where the sum is taken over all the predictors in the tree and

S(w) = ln (N(wu)) – ln N(wu))

( is such as (k+1) = k ! , k N)

Writing the posterior law in this way shows that P( | X) may be maximized in a recursive way

Application to real genomes

We fitted MM and PMM for the orders m = 3 , 4 and 5• on the set of the 224 complete bacterial genomes

published today• on their coding regions (CDS)

To compare the adequacy of this modelizations, we computed

the BIC criterion for each model MBIC(M) = 2 L(M) - nb_param(M) . ln n

“ The higher BIC, the better the model ”

Application – 2

This picture plots BIC(PMM) – BIC(MM) against the size of the bacterial genome. For all the bacteriae, PMM fits better than classical MM

QuickTime™ et undécompresseur TIFF (LZW)

sont requis pour visionner cette image.

Approach using FDA

Recent results (Gregory Nuel) concern the use of «Finite (Deterministic) Automata» in the statistic of words or patterns

To a word, we may associate an FDA :Example 1 : on {a,b}, w = aaabStates : b

aaaaaaaaab

a

aab

aaa

aaab

This can be generalizedif “one“ word w is replaced bya motif (finite family of words) or even a language.

Approach using FDA

This automata is especially dedicated to the study of the word w (the motif, ...) : if we “run“ a sequence on this graph, the automate counts the occurences of w (the motif, ...)

It turns to be VERY efficient :“wordcount “, program in EMBOSS, needs 4352 seconds to count the occurrences of all 12-words in E. coli, Nuel’s program acheives this task in 9.86 seconds

The prosite motif

[LIVMF]GE.[GAS][LIVM].(5-11)R[STAQ]A.[LIVAM].[STACV]

(some 1012 words) is treated by a FDA of 329(30) states in M01393(78) states in M1

Approach using FDA

If this sequence X is a Markov chain, we then have an other MC running on this graph.

Even for “rather complicated motifs“, this allows to get the law of “all“ statistics of words : - exact law of the first occurrence of a motif (taking into account the “starting point“),

- exact law of the number of occurrences of the motif,- in particuler expectation and variance of these laws,

opening the possibility of gaussian, poisonnian,... approximations(and an exhaustive study of the qualities of these approximation),

- law of a motif M conditionally to the number of occurrences of another one, M’.

Hidden Markov Models

2nd Part :

Hidden Markov Models

An important criticism against Markov modelization is its stationarity: a well known theorem says that, under weak conditions,

P(Xk = u) µ(u) (when k ∞)

(and the rate of convergence is exponential.)

But biological sequences are not homogeneous.

There are g+c rich segments / g+c poor segments (isochores).

One may presume (and verify) that the rules of succession of letters differ in coding parts / non-coding parts.

Is it possible to take avantage of this problem and to develop a tool for the analysis of heterogeneity ? => annotation

HMM – 2

Suppose that d states alternate along the sequence

And in each state we have a MC :

if Sk = 1, then P(Xk = v | Xk–1 = u) = π1(u ; v)

if Sk = 2, then P(Xk = v | Xk–1 = u) = π2(u ; v)

and (more technical than biological - see HSMM)

P(Sk = y | Sk–1 = x) = π0(u ; v)

Sk = 1 Sk = 2 Sk = 1 Sk = 2 Sk = 1

Our objectives• Estimate the parameters π1, π2, π0

• Allocate a state {1, 2} to each position

HMM – 3

¡ Use the likelihood !!

L() = ∑ µ0(S1) µS (X1) ....1

...∏ π0(Sk-1,Sk) πk (Xk-1,Xk)

n terms (length of the sequence)

over all possibilities S1S2...Sn ; there are sn terms

210 000 = 103 000 Désespoir !!!

Annotation

H.M.M. continue

Searching nucleosome positions



In eukaryotes (only), an important part of the chromosomes forms chromatine, a state where the double helix winds round “beads“ forming a collar :

Each bead is called a nucleosome. Its core is a complex involving 8 proteins (an octamer) called histone (H2A, H2B, H3, H4). DNA winds twice this core and is locked by an other histone (H1). The total weight of the histones is ± equal to the weight of the DNA.

|||10 nm

Curvature within curvature



The DNA helix turns twice around the histone core. Each turn corresponds to about 7 pitches of the helix, each one made with about 10 nucleotides.

Total = 146 nt within each nucleosome.

Depending on the position (“in”vs “out”) the curvature satisfies different constraints

Nuc and “no-nuc” states

. . .1 2 70

no-nuc

spacer

nucleosome core

Trifonov (99) as well as Rando (05) underline that there are ‘no‘ nucleosome in the gene promotors (accessibility)The introduce “before“ nucleosome a “no-nucleosme” state.

Ioshikhes, Trifonov, Zhang Proc. Natl Acad. Sc. 96 (1999)Yuan, Liu, Dion, Slack, Wu, Altschuler, Rando, Sciencexpress (2005)

Bendability

Following an idea (Baldi, Lavery,...) we introduce an indice of bendability ; it depends on succession of 2, 3, 4, ...di-nucleotides.

a

ag

c

t

ta

ag

c

t

t

PNUC table2nd letter

a t g c

a 0.0 2.8 3,3 5.2 at 7,3 7,3 10,0 10,0 ag 3,0 6,4 6,2 7,5 ac 3,3 2,2 8,3 5,4 a

a 0.7 0.7 5,8 5,8 tt 2.8 0.0 5.2 3,3 tg 5.3 3.7 5.4 7,5 tc 6,7 5.2 5.4 5.4 t

1rst letter 3rd lettera 5.2 6,7 5.4 5.4 gt 2,2 3,3 5.4 8,3 gg 5.4 6,5 6,0 7,5 gc 4,2 4,2 4,7 4,7 g

a 3.7 5.3 7,5 5.4 ct 6,4 3,0 7,5 6,2 cg 5.6 5.6 8.2 8.2 cc 6,5 5.4 7,5 6,0 c

PNUC(cga) = 8,3

There exist various tables which indicate the bendability of di-, tri or even tetra-nucleotides (PNUC, DNase, ...)

We used PNUC-3 :

(*) Goodsell, Dickerson, NAR 22 (1994)

PNUC(tcg) = 8,3

Scan of K3 of yeast Sometime it works :



What about positions ?

We represent (*) parts of the chromosome K3 of Yeast



The green curve (“proba” of the no-nuc state) increases between genes (promotors)

The red curve (“proba” of the nucleosome state) appears periodically in genes.(*) using the software

MuGeN, by Mark Hoebeke

AcknowledgementsLabo «Statistique et Génome» Labo MIG – INRA

Christopha AMBROISE Philippe BESSIÈREMaurice BAUDRY François RODOLPHEEtienne BIRMELE Sophie SCHBATHCécile COT Élisabeth de TURCKHEIMEmmanuelle DELLA-CHIESAMark HOEBEKEMickael GUEDJFrançois KÉPÈS Labo AGROSophie LEBRE Jean-Noël BACROCatherine MATIAS Jean-Jacques DAUDINVincent MIELE Stéphane ROBINFlorence MURI-MAJOUBEGrégory NUELFranck PICARD Lab’ RouenHugues RICHARD Dominique CELLIERAnne-Sophie TOCQUET Sabine MERCIERNicolas VERGNE

Sec : Michèle ILBERT

Date post:	18-Dec-2015
Category:	Documents
Upload:	myron-mcdaniel
View:	217 times
Download:	1 times

Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La...

Documents