1
Prof. Yechiam Yemini (YY)
Computer Science DepartmentColumbia University
Chapter 4: Hidden Markov Models
4.1 Introduction to HMM
2
Overview
Markov models of sequence structures Introduction to Hidden Markov Models (HMM) HMM algorithms; Viterbi decoder
Durbin chapters 3-5
2
3
The ChallengesBiological sequences have modular structure
Genes exons, introns Promoter regions modules, promoters Proteins domains, folds, structural parts, active parts
How do we identify informative regions?How do we find & map genesHow do we find & map promoter regions
4
Mapping Protein Regions
E255 T315
Y253
M351
H396
ATP BindingDomain
ActivationDomain
BindingSite
E236M244
F311E258 Y257
Q252
E450L451
Y440 M472
F486E352E494
Y342V339Q346
L384
E281
E282
L284
E279
L301
D276
F317
L248M278
A269
S265
L266
E286
E275V270
K271E316
Active components
Interfaces
3
5
Statistical Sequence Analysis Example: CpG islands indicate important regions
CG (denoted CpG) is typically transformed by methylation into TGPromoter/start regions of gene suppress methylation This leads to higher CpG densityHow do we find CpG islands?
Example: active protein regions are statistically similarEvolution conserves structural motifs but varies sequences
Simple comparison techniques are insufficientGlobal/local alignmentConsensus sequence
The challenge: analyzing statistical features of regions
6
Review of Markovian ModelingRecall: a Markov chain is described by transition probabilities
π(n+1)=Aπ(n) where π(i,n)=Prob{S(n)=i} is the state probabilityA(i,j)=Prob[S(n+1)=j|S(n)=i] is the transition probability
Markov chains describe statistical evolution In time: evolutionary change depends on previous state only In space: change depends on neighboring sites only
4
7
From Markov To Hidden Markov Models (HMM)
Nature uses different statistics for evolving different regions Gene regions: CpG, promoters, introns/exons… Protein regions: active, interfaces, hydrophobic/philic…
How can we tell regions?Sample sequences have different statisticsModel regions as Markovian states emitting observed sequences…
Example: CpG islandsModel: two connected MCs one for CpG one for normal The MC is hidden; only sample sequences are seenDetect transition to/from CpG MCSimilar to a dishonest casino: transition from fair to biased dice
8
Hidden Markov Models
HMM Basics A Markov Chain: states & transition probabilities A=[a(i,j)] Observable symbols for each state O(i) A probability e(i,X) of emitting the symbol X at state i
e(F,H)=0.5
e(F,T)=0.5
e(B,H)=0.9
e(B,T)=0.1
.4.6.2
.8
F B
5
9
Coin Example Two states MC: {F,B} F=fair coin, B=biased Emission probabilities
Described in state boxesOr through emission boxes
Example: transmembrane proteinsHydrophilic/hydrophobic regions
e(F,H)=0.5
e(F,T)=0.5
e(B,H)=0.9
e(B,T)=0.1
.4.6.2
.8
F.4.6 .2
.8 B
H T
.5 .5 .9 .1
10
HMM Profile Example (Non-gapped)
T A GT A TT A AT A TT A AT A TT A TA A TT A AT A TT A TT A TG A AT A TG A AT C TT A GC A TT A TT A T
0 10 51 9 00 0 1 1 02 0 20 0 08 0 38 0 10
ACGT
0
A=.0C=.0G=.2T=.8
A=1.C=.0G=.0T=.0
A=.1C=.1G=.0T=.8
A=.5C=.0G=.2T=.3
A=.9C=.1G=.0T=.0
A=.0C=.0G=.0T=1.
S E1. 1. 1. 1. 1. 1. 1.
A State per Site
6
11
How Do We Model Gaps?Gap can result from “deletion” or “insertion”Deletion = hidden delete state Insertion= hidden insert state
T A GT A TT A AT C TT A AT - TT A TT - TT A AT - TT A TT - TG A AT - TG A AT - TT A GC - TT A TT - T
A=.6C=.2G=.2T=.0
A=.0C=.0G=.0T=1.
E
A=.25C=.25G=.25T=.25
- =1.
S
A=.5C=.0G=.2T=.3
A=.0C=.0G=.2T=.8
A=1.C=.0G=.0T=.0
A=.1C=.1G=.0T=.8
.7 1.
T A GT A TT A AT C TT A AT A TT A TT A TT A AT G TT A TT - TG A AT A TG A AT - TT A GC - TT A TT C T
S
A=.5C=.0G=.2T=.3
A=.0C=.0G=.2T=.8
A=1.C=.0G=.0T=.0
A=.1C=.1G=.0T=.8
A=.0C=.0G=.0T=1.
E
.3 1.
1.
Delete-State
Insert-State
.8
.2
12
ACA - - - ATGTCA ACT ATCACA C - - AGCAGA - - - ATCACC G - - ATC
Transition probabilitiesOutput Probabilities
insertion
Profile HMM
Profile alignmentE.g., What is the most likely path to generate ACATATC ?How likely is ACATATC to be generated by this profile?
7
13
In General: HMM Sequence Profile
S EM
D D D
M M
i i i i
14
HMM For CpG Islands
A+ G+
C+ T+
A- G-
C- T-
A G C TAG
CT
0.1800.2740.4260.120
0.1710.3680.2740.188
0.1610.3390.3750.125
0.0790.3550.3840.182
+ A G C T
AGC
T
0.3 0.2050.285 0.21
0.3220.2980.0780.302
0.2480.2460.2980.208
0.1770.2390.2920.292
+
A G C T
AG
CT
+
CpG generator Regular Sequence
8
15
Modeling Gene Structure With HMMGenes are organized into sequential functional regionsRegions have distinct statistical behaviors
Splice site
16
HMM Gene Models HMM “state” region ; Markov transitions between regions Emission {A,C,T,G}; regions have different probabilities
A=.29C=.31G=.04T=.36
9
17
Computing Probabilities on HMM Path = a sequence of states
E.g., X=FFBBBFPath probability: 0.5 (0.6)2 0.4(0.2)3 0.8 =4.608 *10-4
Probability of a sequence emitted by a path: p(S|X)E.g., p(HHHHHH|FFBBBF)=p(H|F)p(H|F)p(H|B)p(H|B)p(H|B)p(H|F)
=(0.5)3(0.9)3 =0.09 Note: usually one avoids multiplications and computes
logarithms to minimize error propagation
e(F,H)=0.5
e(F,T)=0.5
e(B,H)=0.9
e(B,T)=0.1
.4.6.2
.8
Start.5.5
18
The Three Computational Problems of HMM
Decoding: what is its most likely sequence of transitions &emissions that generated a given observed sequence?
Likelihood: how likely is an observed sequence to havebeen generated by a given HMM?
Learning: how should transition and emission probabilities belearned from observed sequences?
10
19
The Decoding Problem: Viterbi’s Decoder Input: an observed sequence S Output: a hidden path X maximizing P(S|X)
Key Idea (Viterbi): map to a dynamic programming problemDescribe the problem as optimizing a path over a gridDP search: (a) compute “price” of forward paths (b) backtrackComplexity: O(m2n) (m=number of states, n= sequence size)
1 2 3 4 5 6 7 8 9 1012131415161718x
s1s2s3s4s5
A1
A2
A3
A4
A5
A6
Stat
es
Observed sequence
State sequence
20
Viterbi’s Decoder
F(i,k) = probability of the most likely path to state igenerating S1…Sk
Forward recursion:
F(i,k+1)=e(i,Sk+1)*max j{F(j,k)a(i,j)}
Backtracking: start with highest F(i,n) and backtrack Initialization: F(0,0)=1, F(i,0)=0
i
1 2 3 4A1
A2
A3
A4
A5
A6
k
Sk+1
k+1
F(6,k)a(6,i)
Best path to i
11
21
Example: Dishonest Coin Tossing
what is the most likely sequence of transitions &emissions to explain the observation: S=HHHHHH
.45
.25
1
F
BH
e(F,H)=0.5
e(F,T)=0.5
e(B,H)=0.9
e(B,T)=0.1
.4.6.2
.8
Start.5.5
Start
1
1
0
.6*.5=.3
.4*.9=.36
.2*.9=.18
.8*.5=.4
w=.5*(.9)*(.4)2*(.3)*(.36)2
.09
.18
2
H.065
.054
3
H.019
.026
4
H.006
.008
5
H.0028
.0023
6
H
F
B
22
Example: CpG Islands
Given: observed sequence CGCG what is the likelystate sequence generating it?
A+ G+
C+ T+
A- G-
C- T-
1 2 3 4
T+
C+
G+
A+
T-
C-
G-
A-
C G C G
0Start
A G TC
Start
12
23
Computing probability products propagates errors Instead of multiplying probabilities add log-likelihood Define f(i,k)=log F(i,k)
Or, define the weight w(i,j,k)=log e(i,Sk+1)+ log a(i,j)To get the following standard DP formulation
Computational Note
f(i,k+1)=log e(i,Sk+1) + max j{f(j,k)+log a(i,j)}
f(i,k+1)=max j{f(j,k)+w(i,j,k)}
24
Examplewhat is the most likely sequence of transitions & emissions to
explain the observation: S=HHHHHH (using base 2 log)
F
B
H
lge(F,H)=-1
lge(F,T)=-1
lge(B,H)=-0.15
lge(B,T)=-3.32
-1.32
-0.74 -2.32
-0.32
Start-1-1
Start
-.74-1=-1.74
-1.32-.15=-1.47
-2.32-0.15=-2.47
-.32-1=-1.32
F
B
-2.47-2
-3.47-1.15
H H H H H
f(i,k+1)=max j{f(j,k)+w(i,j,k)}
-1.15-2S
-2.47-1.32B
-1.47-1.74F
BF
W(.,.,H)
13
25
Concluding Notes Viterbi decoding: hidden pathway of an observed sequence Hidden pathway explains the underlying structure
E.g., identify CpG islandsE.g., align a sequence against a profileE.g., determine gene structure…..
This leaves the two other HMM computational problemsHow do we extract an HMM model, from observed sequences?How do we compute the likelihood of a given sequence?