+ All Categories
Home > Documents > Hidden Markov Models - Rensselaer Polytechnic Institute · Hidden Markov Models A profile HMM. What...

Hidden Markov Models - Rensselaer Polytechnic Institute · Hidden Markov Models A profile HMM. What...

Date post: 15-Apr-2018
Category:
Upload: phungxuyen
View: 221 times
Download: 2 times
Share this document with a friend
97
Hidden Markov Models A profile HMM
Transcript

HiddenMarkovModels

A profile HMM

WhatisaHMM?• It’s a model. It condenses information.

• Models 1D discrete data.• Directed graph.

• Nodes emit discrete data• Edges transition between nodes.

• What’s hidden about it?• Node identity.

• Who is Markov?

2

Markov processes

time

sequence

Markov process is any process where the next item in the list dependson the current item. The dimension can be time, sequence position, etc

Modeling protein secondary structure using Markov chains

“states” connected by “transitions”

H=helix

E=extended (strand)

L=loop

H E

L

Setting the parameters of a Markov model from data.

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLEEEEELLLLEEEEELLLLLLLLLLLEEEEEEEEELLLLLEEEEEEEEELLLLLLLLEEEEEELLLLLEEEEEELLLLLLLLLLLLLLLEEEEEELLLLLEEEEELLLLLLLLLLEEEELLLLEEEELLLLEEEEEEEELLLLLLEEEEEEEEELLLLLLEELLLLLLLLLLLLLLLLLLLLLLLLEEEEELLLEEEEEELLLLLLLLLLEEEEEELLLLLEEELLLLLLLLLLLLLEEEEEEEEELLLEEEEEELLLLLLLLLLLLLLLLLLLHHHHHHHHHHHHHLLLLLLLEELHHHHHHHHHHLLLLLLHHHHHHHHHHHLLLLLLLELHHHHHHHHHHHHLLLLLHHHHHHHHHHHHHLLLLLEEELHHHHHHHHHHLLLLLLHHHHHHHHHHEELLLLLLHHHHHHHHHHHLLLLLLLHHHHHHHHHHHHHHHHHHHHHHHHHHHLLLLLLHHHHHHHHHHHHHHHHHHLLLHHHHHHHHHHHHHHLLLLEEEELLLLLLLLLLLLLLLLEEEELLLLHHHHHHHHHHHHHHHLLLLLLLLEELLLLLHHHHHHHHHHHHHHLLLLLLEEEEELLLLLLLLLLHHHHHHHHHHHHHHHHHHHHHHHLLLLLHHHHHHHHHLLLLHHHHHHHLLHHHHHHHHHHHHHHHHHHHH

E L

P(L|E) = P(EL)/P(E) = counts(EL)/counts(E)

Secondary structure data

Count the pairs to get the transition probability.

counts(E) = counts(EE) + counts(EL) + counts(EH)

Therefore: P(E|E) + P(L|E) + P(H|E) = 1.

P(L|E)

A transition matrix

**This is a “first-order” MM. Transition probabilities depend on only the current state.

.93 .01 .06

.01 .80 .19

.04 .06 .90

H

E

L

H E L

P(qt|qt-1)

H E

L

P(L|E)

P(E|L)

P(E|H)

P(H|E)

P(H|L)

P(L|H)

P(H|H)

P(L|L)

P(E|E)

P(S|λ), the probability of a sequence, given the model.

P(“HHEELL”| λ)

=P(H)P(H|H)P(E|H)P(E|E)P(L|E)P(L|L) =(.33)(.93)(.01)(.80)(.19)(.90) =4.2E-4.93 .01 .06

.01 .80 .19

.04 .06 .90

H

E

L

H E L

P(“HHHHHH” | λ) =0.69

P(“HEHEHE” | λ) =1E-6

Probability discriminates between realistic and unrealistic sequences

λ

not protein secondary structure

common protein secondary structure

What is the maximum likelihood model given a dataset of sequences?

1 1 0

0 1 1

0 0 1

H

E

L

H E L

HHEELL

HHEELL

HHEELL

HHEELL

HHEELL

Count the state pairs.

0.5 0.5 0

0 0.5 0.5

.0 0 1.0

H

E

L

H E L

Normalize by row.

HHEELLDataset.

Maximum likelihood model

Real helix length data*L.Pal et al, J. Mol. Biol. (2003) 326, 273–291

“A model should be as simple as possible but not simpler” --Einstein

Freq

uenc

y Synthetic helix length data from this model

1 2 3 4 5 6 7 8 9 10

Is this model too simple? H E

L

H H H H

EL

A Markov chain for proteins where helices are always exactly 4 residues long

A pseudo-higher-order HMM

H H H H

EL

A Markov chain for proteins where helices are always at least 4 residues long

Can you draw a Markov chain where helices are always a multiple of 4 long?

H H H H

EL

H1 H2 H3 H4 E L

H1

H2

H3

H4

E

L

1

1

1

0.5 0.1 0.4

0.2 0.7 0.1

0.2 0.2 0.6

Calculate probability of EHHHHHLLE.

1

Example application: A Markov chain for CpG islands

P(ATCGCGTA...) = πAaATaTCaCGaGCaCGaGTaTA …

AA T

GC

a “saturated” 4-state MM

CpG Islands- ......

methylated Not methylated

DNA is methylated on C to protect against endonucleases.

Using mass spectroscopy we can find regions of DNA that are methylated and regions that are not. Regions that are protected from methylation may be functionally important, i.e. transcription factor binding sites.

-++

During the course of evolution. Methylated CpG’s get mutated to TpG’s

NNNCGNNN NNNTGNNN

DNA

2

Using Markov chains for descrimination:

CpG Islands in human chromosome sequences

From Durbin,Eddy, Krogh and Mitcheson “Biological Sequence Analysis” (1998) p.50

+-+ - ......CpG rich CpG poor

CpG poor= "-" CpG rich= "+"

P(CGCG|+) = πC(0.274)(0.339)(0.274) = πC 0.0255

P(CGCG|-) = πC(0.078)(0.246)(0.078) = πC 0.0015

+ A C G T

A 0.180 0.274 0.426 0.120

C 0.171 0.368 0.274 0.188

G 0.161 0.339 0.385 0.125

T 0.079 0.355 0.384 0.182

- A C G T

A 0.300 0.205 0.285 0.210

C 0.322 0.298 0.078 0.302

G 0.248 0.246 0.298 0.208

T 0.177 0.239 0.292 0.292

Rabiner’s notationayx = P(x | y) =

P(y, x)P(y)

=F(y, x)F(y)

...the conditional probability of x given y, or the transition probability of state y to state x.

πx = P(x) = F(x)/N

...the unconditional probability of being in state x . (used to start a state pathway)

bx (y) = P(y|x)

...the conditional probability of emitting character y given state x.

3

The log likelihood ratio (LLR)

Log-likelihood ratiosfor transitions:

logax i−1x i+

ax i−1x i−

i=1

L

∏ = logaxi−1 xi+

axi−1 xi−

i=1

L

∑ = βx i−1x ii=1

L

Comparing two MMs

β A C G T

A -0.740 0.419 0.580 -0.803

C -0.913 0.302 1.812 -0.685

G -0.624 0.461 0.331 -0.730

T -1.169 0.573 0.393 -0.679

Sum the LLRs. If the result is positive, its a CpG island, otherwise not.

LLR(CGCG)=1.812 + 0.461 + 1.812 = 4.085 yes

4

A hidden Markov model can have multiple paths for a sequence

In Hidden Markov models (HMM), there is no one-to-one correspondence between the state and the emitted symbol.

A

C

T

A T

"+" model

"–"model

Transitions between +/-

models

Combining two Markov chains to make a hidden Markov model

G

G

5

Probability of a sequence using a HMM

Nucleotide sequence (S): C G C G

State sequences (Q):

C+ G+ C+ G+

C- G- C- G-

C+ G+ C- G-

C+ G- C- G+

etc....

πC+ aC+G+aG+C+aC+G+

P(sequence,path)

πC- aC-G- aG-C- aC-G-πC+ aC+G+aG+C-aC-G-πC+ aC+G- aG-C- aC-G+

etc....

P(CGCG|λ) = Σ P(Q)All paths Q

Different state sequences can produce the same emitted sequence

Each state sequence has a probability. The sum of all state sequences that emit CGCG is the P(CGCG).

Three HMM Algorithms

1. The Viterbi algorithm: get the optimal state pathway. Maximum joint prob.

2. The Forward/Backward algorithm: get the probability of each state at each position. Sum over all joint probs.

3. Expectation/Maximization: refine the parameters of the model using the data

Parallel HMM: emits sec struct and amino acid

probability distribution == a set of probabilities (0 ≤ p ≤ 1) that sum to 1.

0. 1.

H E

LEach state emits one amino acid from the marblebag, for each visit.

The marble bag represents a probability distribution of amino acids, b. ( a profile )

stacked odds?

bH(i)

Back to secondary structure prediction....

states emit aa and ss.

H E

L

State sequence(secondary structure)

Amino acid Sequence

Given an amino acid sequence, what is the most probable state sequence?

λ

HMM data structure for parallel HMM

in fortran...type HMMNODE

integer :: idtype (HMMEDGE), pointer :: a(:)real :: b(20), emit(:)logical :: emitting

end type HMMNODE type HMMEDGE

integer :: idreal :: ptype (HMMNODE), pointer :: q

end type HMMEDGE type (HMMSTATE), pointer :: hmm_root

23

A linked list...hmm_root should be the “begin” state, with hmm_root%emitting set to .false. If (emitting==.true.) then the state emits amino acid profile b, and optionally something else called emit(:).

3

What is the LLR that this seq is a CpG Island?

LLR = β xi−1xii=1

L

β A C G T

A -0.740 0.419 0.580 -0.803

C -0.913 0.302 1.812 -0.685

G -0.624 0.461 0.331 -0.730

T -1.169 0.573 0.393 -0.679

ATGTCTTAGCGCGATCAGCGAAAGCCACG

= _______________

In class exercise: what’s the LLR?

HMM: assigning the states given the sequence is not as easy.

Typically, when using a HMM, the task is to determine the optimal state pathway given the sequence. The state pathway provides some predictive feature, such as secondary structure, or splice site/not splice site, or CpG island/not CpG island, etc.

In Principle, we can do this task by trying all state pathways Q, and choosing the optimal. In Practice, this is usually impossible, because the number of pathways increases as the number of states to the power of the length, i.e. O(nm).

How do we do it, then?

Joint probability of a sequence and pathwayQ = {q1,q2,q3,…qT} = sequence of Markov states, or pathway

S = {s1,s2,s3,…sT} = sequence of amino acids or nucleotides

T = length of S and Q.

Joint probability of a pathway and sequence, given a HMM λ.

H

E

L

A G P L V D

πHbH(A) aHHbH(G) aHEbE(P) aEEbE(L) aEEbE(V) aELbL(D)

S=

P=

Q=

× × × × ×

H

E

L

H

E

L

H

E

L

H

E

L

H

E

L

Maximize:

Joint probability : general expression

P(S,Q | λ ) = πq1 bqt st( )aqtqt+1t=1,T∏

**when t=T, there is no qt+1. Use a = 1

**

H

E

L

H

E

L

H

E

L

H

E

L

H

E

L

H

E

L

A G P L V D

General expression for pathway Q through HMM λ :

The Viterbi algorithm:the maximum probability pathM

arko

v st

ates

l

T-1

kWhen t = T the last position, the traceback arrow from the MAX give the optimal state sequence.

T

...

1 2 3

sequence position t

Plot state versus position. Each v is a MAX over the whole previous column of v’s.

Recursive. We save the value v and also a traceback arrow Trc as we go along.

vl(i)

vk(t) = MAX vl(t-1) alk bk(st)Trck(t) = ARGMAX vl(t-1) alk bk(st)

l

l

For all states k

Exercise: Write the Viterbi algorithmsta

tes 1

..L

positions 1..T

6

54

3

21

vk(t) = MAX vl(t-1) alk bk(st)Trck(t) = ARGMAX vl(t-1) alk bk(st)

Exercise: Write the Viterbi algorithm

vk(t) = MAX vl(t-1) alk bk(st)Trck(t) = ARGMAX vl(t-1) alk bk(st)

stat

es 1

..L

positions 1..T

6

54

3

21

initialize vk(1)=bk(s1)

for t=2,T {

for k=1,L {

}

}

The Forward algorithm:all paths to a stateM

arko

v st

ates

l Sum of P over all paths up to state k at t

= αk(t)

At the end of the sequence, when t=T, the sum of αk(T) equals the total probability of the seuqence given the model, P(S|λ).

αk(t) = Σ αl(t-1) alk bk(t)l

t-1

kαlt

t

...

1 2 3

...

“Forward” stands for “forward recursion”

sequence position i

After the first row, each α depends on the whole previous row of α’s.

This is alpha, the forward probability This is ‘a’, the ‘arrow’ between states.

α

The Backward algorithm:all paths from a stateM

arkov states l

t+1

k

βlt

Sum over all paths to state k from t+1

= βk(t)

βk(t) = Σ βl(t+1) akl bk(t)l

t

...

T-2 T-1 T

...

sequence position i

Each β depends on the whole next row of β’s.

At the beginning of the sequence, when t=1, the sum of βk(1) equals the total probability of the sequence given the model, P(S|λ).

“Backward” stands for “backward recursion”. The algorithm starts at t=T, the end of the sequence. (The transitions are still forward.)

β

Exercise: Write the Forward algorithm

αk(t) = Σ αl(t-1) alk bk(t)

stat

es 1

..L

positions 1..T

6

54

3

21

initialize αk(1)=πk(s1)

for t=2,T {

for k=1,L {

}

}

Forward/Backward algorithm:all paths through a state.M

arkov states l

t+1

k

βlt

γk(t) = αk(t) *βk(t)

t

...

T-2 T-1 T

Mar

kov

stat

es l

t-1

αlt

...

1 2 3

sequence position t

γk(t) is the total probability of state k at t, given the sequence S and the model, λ.

γ

The bottleneck through which all paths must travel.

Expectation/Maximization: refining the model

Step 1) Sum the probability of state k given aa @ t is G, P(k|G).

Step 2) Normalize it by dividing by the sum of P(k) .

Step 3) Set bk(G) in that value. P(G|k)

Step 4) Repeat for all states k in λ and all 20 amino acids.

Example: refining bk(G) (i.e. the number of Gly’s in the kth marble bag)

First calculate P(k|t) for all states k and all positions t using Forward/Backward.

Recalculate P(k|t) using the new parameters.

Repeat steps 1-4.

Iterate to convergence.

all parameters can be refined simultaneously

Expectation/Maximization: refining the model

To count the Glycines, we calculate the Forward/Backward value for state k at every Glycine in the database. Then sum them.

+ + +

+ + +

+ +

b’k(G) =

Example: refining bk(G)

S D K P H S G L K V S D E

S D K P H S S I K G S D E

S D K P Q G L K V S D E F F S D K P H S E E E G S D E

K P G L K V S D E G Q G QD G L K V S D E G W W N N

K S G I N C L K V H R S D E S D K P H S G M G L K E A S D K P H G L K V S D E

P(k|t,S,λ) = Σ all paths though k at t = γk(t) = αk(t) *βk(t)

bk is then normalized to sum to 1 over all 20 AA’s.

Σ o

ver a

ll G

in a

ll se

quen

ces,

S

Expectation/Maximization: refining the model

Step 1) Get the probability of ending in state j at t --> αj(t)

Step 2) Get the probability of starting in state k at t+1 --> βk(t)

Step 3) Multiply these by the current ajk

Step 4) Do Steps 1-3 for all positions t and all sequences, S. Sum--> a’. Then normalize. Reset ajkin the new model to a’.

Do 1-4 using the new model. Repeat until convergence.

Example: refining ajk, the probability of a transition from state j to state k.

Expectation/Maximization: refining the model

+

= a’

Example: refining ajk, the probability of a transition from state j to state k.

+

+ + +

+ + ...

αj(t)βk(t+1)

Σ Σ αj(t) ajk βk(t+1)S t

ajk

Σ o

ver a

ll t i

n al

l seq

uenc

es, S

After summing all a’, they are normalized to sum to 1.

39

Draw this HMM

What is the probability of this sequence?

Unrolling the Viterbi algorithm: underflow problems solved by going to log space

Algorithm:

Algorithm unrolled:

vk(2) = bq1(s1) aq1k bk(s2)

vk(3) = bq1(s1) aq1q2 bq2

(s2) aq2q3 bq3(s3)

vk(4) = bq1(s1) aq1q2 bq2

(s2) aq2q3 bq3(s3) aq3q4 bq4

(s4)

Why will this calculation fail for vk(100) ?Hint: Try multiplying 200 numbers (all between 0 and 1) together.

...

vk(t) = MAX vl(t-1) alk bk(st)l

Viterbi algorithm underflow: log space solution

Algorithm:

Log space:

vk(t) = MAX vl(t-1) alk bk(st)l

Log(vk(t)) = MAX[ Log(vl(t-1)) + Log(alk) + Log( bk(st)) ]l

42

The Forward algorithm: underflow problem

αk(t) = Σ αl(t-1) alk bk(t)l

αk(2) = bk(2) [α1(1) a1k + α2(1) a2k + α3(1) a3k + α4(1) a4k ]

Algorithm:

Algorithm unrolled:

αk(3) = bk(3) [b1(2) [α1(1) a11 + α2(1) a21 + α3(1) a31 + α4(1) a41 ] a1k + b2(2) [α1(1) a12 + α2(1) a22 + α3(1) a32 + α4(1) a42 ] a2k + b3(2) [α1(1) a13 + α2(1) a23 + α3(1) a33 + α4(1) a43 ] a3k + b4(2) [α1(1) a14 + α2(1) a24 + α3(1) a34 + α4(1) a44 ] a4k ]

...

Can’t do this one in Log space, because Log(a+b) can’t be simplified.

Solution: dynamic scaling to keep numbers within a reasonable range

43

The Forward algorithm: underflow problem, scaling solution

αk(t) = Σ αl(t-1) alk bk(t)l

Let, α’k(2) = c2 Σ αl(1) alk bk(2)

where c1 is any scale factor.

And, α’k(3) = c3 Σ α’l(2) alk bk(3)

where c2 is any scale factor.

And so on, for all sequence positions t.

Then, α’k(t) = ct Σ α’l(t-1) alk bk(t-1)

If we choose, ct = 1/Σ α’k(t) ,

then, ct α’k(t) has a mean value of 1.

44

The Forward / Backward algorithm: scaling solution does not change gamma.

Re-writing, α’k(t) = ctΣl (Πi=1,t-1ci) αl(t-1) alk bk(t-1)

So, α’k(t) = (Πi=1,t ci) αk(t)

If we apply the same scale factors to the backward value β,

then, β’k(t) = (Πi=t,T ci) βk(t)

Then the calculation of the a posteriori value γ, isγ’k(t) = α’k(t) β’k(t) = (Πi=1,t ci) αk(t)(Πi=t,T ci) βk(t)

= (Πi=1,t ci) (Πi=t,T ci) αk(t) βk(t)

= (Πi=1,T ci) ct αk(t) βk(t) = CT ct αk(t) βk(t) = CT ct γk(t)

Since γ is normalized to sum to 1,

γ’k(t) = CT ct γk(t) / Σk CT ct γk(t) = γk(t)

Gamma from scaled summation is the same as gamma unscaled.

where, CT = (Πi=1,T ci)

how much wood would a wood chuck chuck if a wood chuck would chuck wood?

How do you define the TOPOLOGY of the HMM?

can you can a can as a canner can can a can? I wish to wish the wish you wish to

wish, but if you wish the wish the witch wishes, I won’t wish the wish you wish to wish.

1 tgattggtct ctctgccacc gggagatttc cttatttgga ggtgatggag gatttcagga 61 tttgggggat tttaggatta taggattacg ggattttagg gttctaggat tttaggatta

121 tggtatttta ggatttactt gattttggga ttttaggatt gagggatttt agggtttcag 181 gatttcggga tttcaggatt ttaagttttc ttgattttat gattttaaga ttttaggatt 241 tacttgattt tgggatttta ggattacggg attttagggt ttcaggattt cgggatttca 301 ggattttaag ttttcttgat tttatgattt taagatttta ggatttactt gattttggga 361 ttttaggatt acgggatttt agggtgctca ctatttatag aactttcatg gtttaacata 421 ctgaatataa atgctctgct gctctcgctg atgtcattgt tctcataata cgttcctttg

Transposable elements: junk dealers

Barbara McClintock Transposase, transposasome

Transposable elements “jumping genes” lead to rapid germline variation.

“Out standing in her field”

Excision of transposon may leave a “scar”.

TR TRIR IR

cruciform structure

repaired DNA with copied TR and added IR

TR=tandem repeatIR=inverted repeat

Millions of years of accumulated TE “scars”

Some genomes contain a large accumulation of transposon scars.

Estimated Transposable element-associated DNA content in selected genomes

H.sapiens Z. mays Drosophila Arabidopsis C. elegans S. cerevisiae

Everything elseTEs

35%

>50%

2%15% 1.8% 3.1%

How do you recognize a repeat sequence?

•High scoring self-alignments

•High dot plot density

•Compositional bias

A repeat region in a dot plot.

Types of repeat sequencesSatellites -- 1000+ bp in heterochromatin: centromeres, telomeres

Simple Sequence Repeats (SSRs), in euchromatin :

Minisatellites -- ~15bp (VNTR)

Microsatellites -- 2-6 bp

heterochromatin=compact, light bandseuchromatin=loose, dark bands.

microsatellite

541 gagccactag tgcttcattc tctcgctcct actagaatga acccaagatt gcccaggccc 601 aggtgtgtgt gtgtgtgtgt gtgtgtgtgt gtgtgtgtgt gtatagcaga gatggtttcc 661 taaagtaggc agtcagtcaa cagtaagaac ttggtgccgg aggtttgggg tcctggccct 721 gccactggtt ggagagctga tccgcaagct gcaagacctc tctatgcttt ggttctctaa 781 ccgatcaaat aagcataagg tcttccaacc actagcattt ctgtcataaa atgagcactg 841 tcctatttcc aagctgtggg gtcttgagga gatcatttca ctggccggac cccatttcac

a microsatellite in a dog (canis familiaris) gene.

Minisatellite1 tgattggtct ctctgccacc gggagatttc cttatttgga ggtgatggag gatttcagga

61 tttgggggat tttaggatta taggattacg ggattttagg gttctaggat tttaggatta 121 tggtatttta ggatttactt gattttggga ttttaggatt gagggatttt agggtttcag 181 gatttcggga tttcaggatt ttaagttttc ttgattttat gattttaaga ttttaggatt 241 tacttgattt tgggatttta ggattacggg attttagggt ttcaggattt cgggatttca 301 ggattttaag ttttcttgat tttatgattt taagatttta ggatttactt gattttggga 361 ttttaggatt acgggatttt agggtgctca ctatttatag aactttcatg gtttaacata 421 ctgaatataa atgctctgct gctctcgctg atgtcattgt tctcataata cgttcctttg

This 8bp tandem repeat has a consensus sequence AGGATTTT,

but is almost never a perfect match to the consensus.

ACRONYMS for satellites and transposonsSSR Short Sequence RepeatSTR Short Tandem RepeatVNTR Variable Number Tandem RepeatLTR Long Terminal RepeatLINE Long Interspersed Nuclear ElementSINE Short Interspersed Nuclear ElementMITE Miniature Inverted repeat Transposable Element (class III TE)TE Transposable ElementIS Insertion SequenceIR Inverted RepeatRT Reverse Transcriptase TPase TransposaseAlu 11% of primate genome (SINE)LINE1 14.6% of human genomeTn7,Tn3,Tn10,Mu,IS50 transposons or transposable bacteriophageretroposon=retrotransposon

Class I TE, uses RT.Class II TE, uses TPase.Class III TE, MITEs*

*Cl,ass III are now merged with Class II TEs.

fun with bioinformatics jargon

How significant is that?

Please give me a number for...

...how likely the data would not have been the result of chance,...

...as opposed to...

...a specific inference. Thanks.

(How) do you align repeat sequences?

B: Dynamic Programming with special null model. (Use EVD to fit random scores.)

Remember: Low complexity repeat sequences will have high-scoring alignments randomly. For example, random A/T repeat...

ATTTATATAATTAATATATAAATATAATAAATATaligned to

TATTATATATATATATATATTATATATATATATA

Random score is has >50% identity!!

A: Don’t align. Mask them out instead.

How do I align repeat sequences?

• Align using dynamic programming.• Assess significance using extreme value

distribution fit to random data from HMM.• Results are e-values. Use e-values to build

multiple sequence alignments, phylogenetic trees, etc.

57

Generating random low complexity sequences, to estimate significance.

Dinucleotide composition model.Generate random sequences based on dinucleotide model. Align them to generate random score distribution.

A C

GTalignment score

freq

Null model = P(random alignment)

Trinucleotide composition model.

A C GT A C GT

A C GT A C GT

after A after C

after Gafter T

Only the arrows into the 4 “after A” states are shown

Getting expectation values for low complexity/repeat sequences.

Motif null model. (Grammatical model.) Repeats are possibly misspelled words.

A G K V T T T H

N

8 character misspelled-word repeat model, with occasional extra character(s).

Getting expectation values for low complexity/repeat sequences.

Try this: create a HMM for a microsatellite.

•Using Netscape: Go to the NCBI database and download the nucleotide sequence with GenBank identifier (gi) 21912445

•Import it into UGENE.

•Find the microsatellite that starts at around 330. Draw a motif HMM.

•Generate and align random microsatellite sequences. What are the scores?

MARCOIL predicts coiled coils

...

...

MARCOIL. Delorenzi & Speed, 2002

a

bc

de

f

g

MARCOIL consists of 9 groups of 7 states. Each of the 7 models a position in the helix, a-f. There are 4 special pre-coil and 4 post-coil groups, one repeating coil state (5), and one generic

state (0).

Helix positions

groups

connections between groups

HMM for dicodon preferences: gene design

• Codon preferences exist due to differences in [tRNA] in the cell.

• Di-codon preferences exist due to interactions between neighboring tRNAs on the ribosome.

• Di-codon preferences are preserved* in the DNA sequences of ORFs in the genome.

• To find the optimal set of codons for a protein sequence, design a HMM based on codons, emitting amino acids in parallel. A parallel HMM. Maximum likelihood transitions between codons are the dicodon frequencies.

• Use Viterbi to assign codons to an amino acid sequence.

vk(t) = MAX vl(t-1) alk bk(st)l

l= all codonsdicodon frequencies

emissions = 1.00 or 0.00

TMHMM -- transmembrane helices

TM helices Other regions Over-rep AA count freq count freq

I 1826 0.120 2187 0.046 2.61 F 1370 0.090 1854 0.039 2.31 L 2562 0.168 4156 0.087 1.93 V 1751 0.115 2935 0.061 1.89 M 616 0.040 1201 0.025 1.60 W 414 0.027 819 0.017 1.59 A 1657 0.109 3382 0.071 1.54 Y 615 0.040 1616 0.034 1.18 G 1243 0.082 3352 0.070 1.17 C 289 0.019 960 0.020 0.95 T 755 0.050 2852 0.060 0.83 S 806 0.053 3410 0.071 0.75 P 423 0.028 2640 0.055 0.51 H 121 0.008 1085 0.023 0.35 N 250 0.016 2279 0.048 0.33 Q 141 0.009 2054 0.043 0.21 D 104 0.007 2551 0.053 0.13 E 110 0.007 2983 0.062 0.11 K 78 0.005 2651 0.055 0.09 R 83 0.005 2933 0.061 0.08

Tot 15214 1.000 47900 1.000

TMHMM (Krogh, 2001) models transmembrane helices in eukaryotic and inner-bacterial membranes. First attempt was a composition-based model, 2 states.

Example of a prediction result using V, F/B

TM helices

F/B

V, M in red

TMHMM -- 15 state version

Better F/B prediction, but V is worse. Why?

New version has specific pre-helix and post-helix states and a more reasonable M-length distribution, ranging from 14-28+, not 1-28+

Connect HMMs by their begin/end statesHMMs within HMMs

The topology of the helix (H) unit used by Asai et al [1993] to predict secondary structure. The periodicity of amphipathic helices is approximately modeled by the cycle of states. States 1 and 5 represent the start and end of the helix, respectively.

Other topologies were explored for E and L macro-states.

H

EL

We can define a very simple HMM for secondary structure. But here, each state is a macro-state emitting a variable length string of amino acids, from an internal HMM.

H

Modularity...

begin end

begin

end

Applications of metagenomics

• Soil healthA community if microorganisms is required to recycle nutrients. Farmers want to know what is there.

• Water pollutionMicroorganism populations respond to the content of the water.**

• Human health and nutritionThe microbial community of the gut, nose/throat, skin, vagina, are indicators of infected state, risk, and may be biomarkers of cancer.

• Paleobiology, PaleogenomicsDNA from frozen mammoth, iceman reveal their diet, phylogeny.

• Forensic science?

** A stream contaminated by acid coal mine runoff and discarded metal accumulates iron hydroxide and evolves its own bacterial community

Next-gen sequencing of un-amplified, un-cultured DNA.

• nodes = short sequence reads• edge = sequential order• path = genome• Edges with only one occurence can be pruned (errors tend to NOT occur in the same

place twice).• Bubbles and tips may be pruned by the “Tour bus” algorithm. • Ambiguous paths may represent multiple strains of a species, or very similar species,

and may be separated out using abundances.

ACGGCTCGCTAATCGCTAATTGTGACTGC

GACTGCAAAGGCTAGAAAAGGCTAGACCG

Meta genome assembly

Abundance data

Species A1000 copies

Species B100 copies

close homologs

ACGGCTCGCTAATACGGCTCGCTAAT

CGCTAATTGCGACTGCCGCTAATTGTGACTGC

GACTGCAAAGGCTAGAGACTGCAAAGGCTAGA

1000x

100x

100x

1000x

1000x

100x

AAAGGCTAGACCGAAAGGCTAGACGG

1000x100x

Relative abundance of mutations should match relative abundance of species, and can be used to resolve ambiguous assemblies.

This T goes with this G

Using relative abundance for path finding in De Bruijn graphs

Draw an edge only where overlap is exact.Initialize all edge weights = number of occurences (arrow thickness)Identify branched pathways, where there is > 1 way to connect any two verteces.Classify branches by occurrence weight.Find a path that stays within an occurrence class.

GCTAATTGC/T GACTGCAAAGGCTAGACG/CGTCA

CTAATTGT

TAATTGTG

CTAATTGCTAATTGCG

GGGGGG

CCCCC

CCCCC

TTTTT

UGENE exercise: meta genome assembly

1. Download reference genomes (two) and reads from course web site (metagenomics_ref, metagenomics_reads)

2. in UGENE Tools/Align to reference/Align short reads...

1. Under “reference seqeunce” enter the metagenomics_ref filename.

2. Under “Short reads” Add the metagenomics_reads file.

3. Mismatch allowed, set to 10%

4. Check “align reverse complement”, “Use best-mode”

5. Start

3. Determine the number of species present.

1. Random errors generally do not repeat. So, ignore mutations that don’t occur ≥ twice.

2. Find multiple strains by looking at the relative abundances of bases in aligned positions

1. Assuming two strains, how well are the abundances estimated by the sample?

2. What bases go together in change positions?

Profile HMMs: models for MSAs

I = insert state, one character from the background profile

D = delete state, non-emitting. A connector.

M = match state, one character from a specific profile.

Begin = non-emitting. Source state.

End = non-emitting. Sink state.

All π(q)=0, except π(Begin)=1

To get the scores of a sequence to a profile HMM, we use the F/B algorithm to get P(End). This is the measure of how well the sequence fits the model. Then we can test several models.

State emissions:

Make a HMM from Blast data Score ESequences producing significant alignments: (bits) Value

gi|18977279|ref|NP_578636.1| (NC_003413) hypothetical protein [P... 136 5e-32gi|14521217|ref|NP_126692.1| (NC_000868) hypothetical protein [P... 59 8e-09gi|14591052|ref|NP_143127.1| (NC_000961) hypothetical protein [P... 56 8e-08gi|18313751|ref|NP_560418.1| (NC_003364) translation elongation ... 42 9e-04gi|729396|sp|P41203|EF1A_DESMO Elongation factor 1-alpha (EF-1-a... 40 0.007gi|1361925|pir||S54734 translation elongation factor aEF-1 alpha... 39 0.008gi|18312680|ref|NP_559347.1| (NC_003364) translation initiation ... 37 0.060

QUERY 3 GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V 5918977279 2 GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V 5814521217 1 -MLGFFRRKKKEEEEKI---TGKPVGKVKVENILIVGFK-TVI-ICEVLEGMVKVGYK-V 5314591052 1 -MFKFFKRKGEDEKD----VTGKPVGKVKVESILKVGFR-DVI-ICEVLEGIVKVGYK-V 5218313751 243 --------------------------RMPIQDVFTITGAGTVV-VGRVETGVLKVGDR-V 274729396 236 --------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV 2681361925 239 --------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV 27118312680 487 -----------------------------------IVGV-KVL-AGTIKPGVT----L-V 504

QUERY 60 --KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT 10918977279 59 --KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT 10814521217 54 --RKGKKVAGIVS-MEREHKKVEFAIPGDKIGIMLEKNI---G---AEKGDILEVF-- 10014591052 53 --KKGKKVAGIVS-MEREHKKIEFAIPGDRVGMMLEKNI----N--AEKDDILEVY-- 9918313751 275 VIVPPAKVGDVRS-IETHHMKLEQAQPGDNIGVNVRG-I---AKEDVKRGDVL----- 322729396 269 --FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV------ 3141361925 272 --FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV------ 31718312680 505 --KDGREVGRIMQ-IQKTGRAINEAAAGDEVAISIHGDVIVGRQ--IKEGDILYVY-- 555

Make a HMM from Blast data

GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-VGLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V-MLGFFRRKKKEEEEKI---TGKPVGKVKVENILIVGFK-TVI-ICEVLEGMVKVGYK-V-MFKFFKRKGEDEKD----VTGKPVGKVKVESILKVGFR-DVI-ICEVLEGIVKVGYK-V--------------------------RMPIQDVFTITGAGTVV-VGRVETGVLKVGDR-V--------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV--------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV-----------------------------------IVGV-KVL-AGTIKPGVT----L-V

--KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT--KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT--RKGKKVAGIVS-MEREHKKVEFAIPGDKIGIMLEKNI---G---AEKGDILEVF----KKGKKVAGIVS-MEREHKKIEFAIPGDRVGMMLEKNI----N--AEKDDILEVY--VIVPPAKVGDVRS-IETHHMKLEQAQPGDNIGVNVRG-I---AKEDVKRGDVL-------FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV--------FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV--------KDGREVGRIMQ-IQKTGRAINEAAAGDEVAISIHGDVIVGRQ--IKEGDILYVY--

begin

end

Match states

Insert states

Delete states

PathsthroughProfileHMMislikeapaththroughanalignmentmatrix

A path through an edit graph and the corresponding path through a profile HMM

Generating a profile HMM from a multiple sequence alignment

VGA--HV----NVEA--DVKG---VYS--TFNA--NIAGADN

Make four match states

base the model on, say, this one

begin M M M M end

Generating a profile HMM from a multiple sequence alignment

VGA--HV----NVEA--DVKG---VYS--TFNA--NIAGADN

Make four match states

base the model on, say, this one

Generating a profile HMM from a multiple sequence alignment

VGA--HV----NVEA--DVKG---VYS--TFNA--NIAGADN

begin M M M M end

I

Add insertion states where there are insertions. (red)

Generating a profile HMM from a multiple sequence alignment

VGA--HV----NVEA--DVKG---VYS--TFNA--NIAGADN

begin M

D

M

D

M

D

M end

I

Add deletion states where there are deletions. (red dashes)

...now optimize using expectation maximization.

Getting profiles for every Match state

VGA--HV----NVEA--DVKG---VYS--TFNA--NIAGADN

begin M

D

M

D

M

D

M end

I

Count the frequency of each amino acid, scaled by sequence weights, w.

w1

w2

w3

w4

w5

w6

w7

bM1(V) = (w1+w2+w3+w4+w5)/ (w1+w2+w3 +w4+w5 +w6+w7)

P V( ) =wi

si =V∑wi

all i∑

Calculating the probability of a sequence given the model: P(s|λ)

begin M

D

M

D

M

D

M end

I

Sum forward (forward algorithm) using the sequence s.

For each Match state, multiply by the transition (a) and the profile value, bM(si), and increment i

For each Deletion state, multiply by a, do not increment i.

For each Insertion state, multiply by a, increment i.

Picking a parent sequence

• The parent defines the number of Match states

• A Match state should conserve the chemical nature of the sidechain as much as possible.

• A Match state implies structural similarity.

Homolog detection using a library of profile HMMs

MYSEQUENCE

2

1

3

4

P(s|λ2)

P(s|λ3)

P(s|λ4)

P(s|λ1)

Pick the model w

ith the max P

Get P(S|λ) for each λ

Added value

In DP, we assumed insertions and deletions were equally probable, and that the probability was independent of position.

With Profile HMMs we allow insertions and deletions to have different probabilities, and to be dependent on the position.

HMMs give better alignments than DP.

Pfam: Protein families

database sequence alignments from curated?

Pfam-A UniProtKB yes

Pfam-B ADDA (Holm) no

A searchable database of multiple sequence alignments and profile-HMMs.

•Logos for Match states.--AAAs stacked by probability--Color is AA type--Width is relative contribution.--Height is shannon entropy

• Pink bars for insert states-- thin line means no I state --dark pink M->I, -- light pink I->I

PFAM visualization of profile HMM

Bayes Block Alignment (BBA)

M

I

D

M

I

Dstart

M

I

D

M

I

D

end

k=1 k=2 k=K...

Maximum number of indels = K = 20 or L/10, whichever is less.

Find all alignments that have at most K gaps.Sankoff, 1972; Zhu, Liu & Lawrence, 1998

M[k,i,j] = LR(i,j) x ΣM[k,i-1,j-1]

I[k-1,i-1,j-1]

D[k-1,i-1,j-1]

I[k,i,j] = Σ M[k,i,j-1]

I[k,i,j-1]

D[k,i,j] = ΣM[k,i-1,j]

I[k,i-1,j]

D[k,i-1,j]

Algorithm for BBA forward probabilities

LR(i,j) = likelihood ratio = substitution probability

M

I

D

The forward product must be scaled

Underflow problems....

Solution: re-scaling at each step and saving the scale factors.

See slide 5

DI

MID M[k,i,j] = LR(i,j) x Σ

M[k,i-1,j-1]

I[k-1,i-1,j-1]

D[k-1,i-1,j-1]indel after k blocks

match after k gaps

Illustration of the paths into M[k,i,j], alignment of i to j after k gaps.

M

Algorithm for BBA forward sums: M

MID

MID

I[k,i,j] = Σ M[k,i,j-1]

I[k,i,j-1]

D[k,i,j] = ΣM[k,i-1,j]

I[k,i-1,j] D[k,i-1,j]

after k blocks after k blocks

Algorithm for BBA forward sums: I Algorithm for BBA forward sums: D

Start “sampleback” at lower right, M(K,I,J),D(K,I,J), or I(K,I,J) then do the following:i=I; j=J; k=K; histogram(..)=0;While (i>0 and j>0) do If current state is M, add 1 to histogram(i,j) y=D(k-1,i-1,j-1)+I(k-1,i-1,j-1)+M(k,i-1,j-1) x = random number 0≤x≤1. If (x < D(k-1,i-1,j-1)/y) then next state is D(k-1,i-1,j-1) else if (x<(D(k-1,i-1,j-1)+ I(k-1,i-1,j-1))/y) then next state is I(k-1,i-1,j-1) else next state is M(k,i-1,j-1) end if else if current state is I, ... (try filling this in) else if current state is D, ... (try filling this in)end do

“sampleback” instead of traceback.

Illustration of one sample-back

M

I

D

M

I

Dstart

M

I

D

M

I

D

end

MMDDMMMIIMDDDAGCGCGC~~TTCA AG~~CGCCCT~~~

small orange boxes show a path through the blocks

Alignment ends in any lower-right box.

Alignment starts in any upper-left box.

do this 10,000 times

Result is a histogram , P(i,j)We can plot the probability of a match for every ij. In the example below the proteins are NOT homologous, but short stretches of similarity are found nonetheless.

i-->

j-->

Testing DP versus BBA for hard cases

Above: Optimal DP alignments of 1R69 to 1NEQ (distant homolog proteins) using various subsitution matrices. The last line “BAliBase” is the true alignment.

Left: Bayesian Adaptive Alignment results for the same pair. True alignment in black outline. Color indicates (white: zero probability, blue: low P, red: high P) probability of a match between positions in the sequences. Sometimes the true alignment is sub-optimal.

(Huang & Bystroff, 2006)

BBA is better than DP.

coverage

erro

r rat

e

Review • What is a Markov Model?• What is a Hidden Markov Model?

• What kind of problem can a HMM solve?• How do you find the best state pathway given a sequence?

• What does “state pathway” mean?

• How do you find the probability of a state q at sequence position k given a sequence s and a HMM λ?

• What does it mean for a state to “emit”?

• What use are non-emitting states?• How is the joint probability of a sequence and a pathway calculated?

• What algorithm can be used to determine the maximum likelihood value for a parameter of a HMM?

• What is a profile HMM? • How is the topology of a profile HMM initialized? How is it refined?

97


Recommended