Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.1...

1

Prof. Yechiam Yemini (YY)

Computer Science DepartmentColumbia University

Chapter 4: Hidden Markov Models

4.1 Introduction to HMM

2

Overview

Markov models of sequence structures Introduction to Hidden Markov Models (HMM) HMM algorithms; Viterbi decoder

Durbin chapters 3-5

2

3

The ChallengesBiological sequences have modular structure

Genes exons, introns Promoter regions modules, promoters Proteins domains, folds, structural parts, active parts

How do we identify informative regions?How do we find & map genesHow do we find & map promoter regions

4

Mapping Protein Regions

E255 T315

Y253

M351

H396

ATP BindingDomain

ActivationDomain

BindingSite

E236M244

F311E258 Y257

Q252

E450L451

Y440 M472

F486E352E494

Y342V339Q346

L384

E281

E282

L284

E279

L301

D276

F317

L248M278

A269

S265

L266

E286

E275V270

K271E316

Active components

Interfaces

3

5

Statistical Sequence Analysis Example: CpG islands indicate important regions

CG (denoted CpG) is typically transformed by methylation into TGPromoter/start regions of gene suppress methylation This leads to higher CpG densityHow do we find CpG islands?

Example: active protein regions are statistically similarEvolution conserves structural motifs but varies sequences

Simple comparison techniques are insufficientGlobal/local alignmentConsensus sequence

The challenge: analyzing statistical features of regions

6

Review of Markovian ModelingRecall: a Markov chain is described by transition probabilities

π(n+1)=Aπ(n) where π(i,n)=Prob{S(n)=i} is the state probabilityA(i,j)=Prob[S(n+1)=j|S(n)=i] is the transition probability

Markov chains describe statistical evolution In time: evolutionary change depends on previous state only In space: change depends on neighboring sites only

4

7

From Markov To Hidden Markov Models (HMM)

Nature uses different statistics for evolving different regions Gene regions: CpG, promoters, introns/exons… Protein regions: active, interfaces, hydrophobic/philic…

How can we tell regions?Sample sequences have different statisticsModel regions as Markovian states emitting observed sequences…

Example: CpG islandsModel: two connected MCs one for CpG one for normal The MC is hidden; only sample sequences are seenDetect transition to/from CpG MCSimilar to a dishonest casino: transition from fair to biased dice

8

Hidden Markov Models

HMM Basics A Markov Chain: states & transition probabilities A=[a(i,j)] Observable symbols for each state O(i) A probability e(i,X) of emitting the symbol X at state i

e(F,H)=0.5

e(F,T)=0.5

e(B,H)=0.9

e(B,T)=0.1

.4.6.2

.8

F B

5

9

Coin Example Two states MC: {F,B} F=fair coin, B=biased Emission probabilities

Described in state boxesOr through emission boxes

Example: transmembrane proteinsHydrophilic/hydrophobic regions

e(F,H)=0.5

e(F,T)=0.5

e(B,H)=0.9

e(B,T)=0.1

.4.6.2

.8

F.4.6 .2

.8 B

H T

.5 .5 .9 .1

10

HMM Profile Example (Non-gapped)

T A GT A TT A AT A TT A AT A TT A TA A TT A AT A TT A TT A TG A AT A TG A AT C TT A GC A TT A TT A T

0 10 51 9 00 0 1 1 02 0 20 0 08 0 38 0 10

ACGT

0

A=.0C=.0G=.2T=.8

A=1.C=.0G=.0T=.0

A=.1C=.1G=.0T=.8

A=.5C=.0G=.2T=.3

A=.9C=.1G=.0T=.0

A=.0C=.0G=.0T=1.

S E1. 1. 1. 1. 1. 1. 1.

A State per Site

6

11

How Do We Model Gaps?Gap can result from “deletion” or “insertion”Deletion = hidden delete state Insertion= hidden insert state

T A GT A TT A AT C TT A AT - TT A TT - TT A AT - TT A TT - TG A AT - TG A AT - TT A GC - TT A TT - T

A=.6C=.2G=.2T=.0

A=.0C=.0G=.0T=1.

E

A=.25C=.25G=.25T=.25

- =1.

S

A=.5C=.0G=.2T=.3

A=.0C=.0G=.2T=.8

A=1.C=.0G=.0T=.0

A=.1C=.1G=.0T=.8

.7 1.

T A GT A TT A AT C TT A AT A TT A TT A TT A AT G TT A TT - TG A AT A TG A AT - TT A GC - TT A TT C T

S

A=.5C=.0G=.2T=.3

A=.0C=.0G=.2T=.8

A=1.C=.0G=.0T=.0

A=.1C=.1G=.0T=.8

A=.0C=.0G=.0T=1.

E

.3 1.

1.

Delete-State

Insert-State

.8

.2

12

ACA - - - ATGTCA ACT ATCACA C - - AGCAGA - - - ATCACC G - - ATC

Transition probabilitiesOutput Probabilities

insertion

Profile HMM

Profile alignmentE.g., What is the most likely path to generate ACATATC ?How likely is ACATATC to be generated by this profile?

7

13

In General: HMM Sequence Profile

S EM

D D D

M M

i i i i

14

HMM For CpG Islands

A+ G+

C+ T+

A- G-

C- T-

A G C TAG

CT

0.1800.2740.4260.120

0.1710.3680.2740.188

0.1610.3390.3750.125

0.0790.3550.3840.182

+ A G C T

AGC

T

0.3 0.2050.285 0.21

0.3220.2980.0780.302

0.2480.2460.2980.208

0.1770.2390.2920.292

+

A G C T

AG

CT

+

CpG generator Regular Sequence

8

15

Modeling Gene Structure With HMMGenes are organized into sequential functional regionsRegions have distinct statistical behaviors

Splice site

16

HMM Gene Models HMM “state” region ; Markov transitions between regions Emission {A,C,T,G}; regions have different probabilities

A=.29C=.31G=.04T=.36

9

17

Computing Probabilities on HMM Path = a sequence of states

E.g., X=FFBBBFPath probability: 0.5 (0.6)2 0.4(0.2)3 0.8 =4.608 *10-4

Probability of a sequence emitted by a path: p(S|X)E.g., p(HHHHHH|FFBBBF)=p(H|F)p(H|F)p(H|B)p(H|B)p(H|B)p(H|F)

=(0.5)3(0.9)3 =0.09 Note: usually one avoids multiplications and computes

logarithms to minimize error propagation

e(F,H)=0.5

e(F,T)=0.5

e(B,H)=0.9

e(B,T)=0.1

.4.6.2

.8

Start.5.5

18

The Three Computational Problems of HMM

Decoding: what is its most likely sequence of transitions &emissions that generated a given observed sequence?

Likelihood: how likely is an observed sequence to havebeen generated by a given HMM?

Learning: how should transition and emission probabilities belearned from observed sequences?

10

19

The Decoding Problem: Viterbi’s Decoder Input: an observed sequence S Output: a hidden path X maximizing P(S|X)

Key Idea (Viterbi): map to a dynamic programming problemDescribe the problem as optimizing a path over a gridDP search: (a) compute “price” of forward paths (b) backtrackComplexity: O(m2n) (m=number of states, n= sequence size)

1 2 3 4 5 6 7 8 9 1012131415161718x

s1s2s3s4s5

A1

A2

A3

A4

A5

A6

Stat

es

Observed sequence

State sequence

20

Viterbi’s Decoder

F(i,k) = probability of the most likely path to state igenerating S1…Sk

Forward recursion:

F(i,k+1)=e(i,Sk+1)*max j{F(j,k)a(i,j)}

Backtracking: start with highest F(i,n) and backtrack Initialization: F(0,0)=1, F(i,0)=0

i

1 2 3 4A1

A2

A3

A4

A5

A6

k

Sk+1

k+1

F(6,k)a(6,i)

Best path to i

11

21

Example: Dishonest Coin Tossing

what is the most likely sequence of transitions &emissions to explain the observation: S=HHHHHH

.45

.25

1

F

BH

e(F,H)=0.5

e(F,T)=0.5

e(B,H)=0.9

e(B,T)=0.1

.4.6.2

.8

Start.5.5

Start

1

1

0

.6*.5=.3

.4*.9=.36

.2*.9=.18

.8*.5=.4

w=.5*(.9)*(.4)2*(.3)*(.36)2

.09

.18

2

H.065

.054

3

H.019

.026

4

H.006

.008

5

H.0028

.0023

6

H

F

B

22

Example: CpG Islands

Given: observed sequence CGCG what is the likelystate sequence generating it?

A+ G+

C+ T+

A- G-

C- T-

1 2 3 4

T+

C+

G+

A+

T-

C-

G-

A-

C G C G

0Start

A G TC

Start

12

23

Computing probability products propagates errors Instead of multiplying probabilities add log-likelihood Define f(i,k)=log F(i,k)

Or, define the weight w(i,j,k)=log e(i,Sk+1)+ log a(i,j)To get the following standard DP formulation

Computational Note

f(i,k+1)=log e(i,Sk+1) + max j{f(j,k)+log a(i,j)}

f(i,k+1)=max j{f(j,k)+w(i,j,k)}

24

Examplewhat is the most likely sequence of transitions & emissions to

explain the observation: S=HHHHHH (using base 2 log)

F

B

H

lge(F,H)=-1

lge(F,T)=-1

lge(B,H)=-0.15

lge(B,T)=-3.32

-1.32

-0.74 -2.32

-0.32

Start-1-1

Start

-.74-1=-1.74

-1.32-.15=-1.47

-2.32-0.15=-2.47

-.32-1=-1.32

F

B

-2.47-2

-3.47-1.15

H H H H H

f(i,k+1)=max j{f(j,k)+w(i,j,k)}

-1.15-2S

-2.47-1.32B

-1.47-1.74F

BF

W(.,.,H)

13

25

Concluding Notes Viterbi decoding: hidden pathway of an observed sequence Hidden pathway explains the underlying structure

E.g., identify CpG islandsE.g., align a sequence against a profileE.g., determine gene structure…..

This leaves the two other HMM computational problemsHow do we extract an HMM model, from observed sequences?How do we compute the likelihood of a given sequence?

Date post:	15-Apr-2018
Category:	Documents
Upload:	vohuong
View:	229 times
Download:	5 times

Chapter 4: Hidden Markov Models - Columbia University · Chapter 4: Hidden Markov Models 4.1...

Documents