1 Introduction to Hidden Markov Modelsbio5495.wustl.edu/HMMs/HMMReadings/HMMNotes.pdf · 1...

1 Introduction to Hidden Markov Models

1.1 A quick biological example

Some organisms have a genome that contains a higher GC content in pro-tein coding regions than in non-coding regions. This can be modeled usinga State Machine, which can be used to identify coding regions by their GC con-tent.

Begin

Coding Noncoding

Pr(A)=0.2

Pr(C)=0.3

Pr(G)=0.3

Pr(C)=0.2

Pr(A)=0.3

Pr(C)=0.2

Pr(G)=0.2

Pr(C)=0.3

• From the current state, the machine will pick the next state according toits transition probability distribution.

• It then outputs a base according to the new state’s probability distributionon bases.

• The model is said to “generate” the observations (outputs).

State variables q0 q1 q2 q3 . . .

States B N N N C C C CObservations A T T G A C G

Observation variables O1 O2 O3 . . .

• Because HMMs are often used to model time series, such as speech sounds,the different positions in the input sequence are often referred to as “times”and represented with the subscript t.

1.2 HMM Applications

HMMs and variants of them have many applications in sequence analysis

1

• Predicting Exons and Introns in genomic DNA.

• Identifying functional motifs (domains) in proteins (profile HMM).

• Aligning two sequences (pair HMM).

1.3 Abstract definition of an HMM

An HMM consists of:

• A set of states : { S1, . . . , SN } and a special state Begin,

• An alphabet of observation symbols : { A1, . . . , AM }. E.g. for DNA A, C,G, T.

• A probability distribution on pairs of sequences (O1...T , q1...T ), whereO1...T ∈{A1, . . . , AM}T (an observation sequence) and q1...T ∈ {S1, . . . , SN}T (a statesequence), such that:

1. The observation at time t, Ot, is independent of all other observationsand states, given the state at time t, qt.

2. The state at time t, qt, is independent of all earlier states and observa-tions, given the state at time t-1, qt−1.

In formal notation:

1. Pr(Ot = Ak|q1...T , O1...t−1, Ot+1...T ) = Pr(Ot = Ak|qt)2. Pr(qt = Sk|q1...t−1, O1...t−1) = Pr(qt = Sk|qt−1)

1.4 Graphical Model Representation

• Direct influences (dependencies) are shown as edges.

• If vertex u lies on the path from x to y, then x and y are independent givenu : Pr(x|y, u) = Pr(x|u).

2

1.5 Factoring the joint distribution on state sequences and observedsequences

Pr(O1...T , q1...T ) = Pr(q1) · Pr(O1|q1) · Pr(q2|q1, O1) · Pr(O2|q2, O1, q1) · . . . [CR]

= Pr(q1) · Pr(O1|q1)∏T

t=2 Pr(qt|qt−1) · Pr(Ot|qt) [HMM independence]

=∏T

t=1 Pr(qt|qt−1) · Pr(Ot|qt)

where the value of q0 is always a special state designated BEGIN (with probability one).

So an HMM can be completely specified by:

1. Pr(q1 = Si|q0 = BEGIN) for i = 1, . . . , N , the initial state probabilities.To make the notation more compact, we designate these probabilities withthe symbols π1 . . . πN

2. Pr(Ot = Ak|qt = Si) for i = 1, . . . , N, k = 1, . . . ,M , the probability ofoutputting Ak from state Si. To make the notation more compact, we des-ignate these probabilities with the symbols bi(Ak)

3. Pr(qt = Sj|qt−1 = Si) for i = 1, . . . , N, j = 1, . . . , N , the probability thatthe HMM will transition to state Sj when in state Si. We designate theseprobabilities with the symbols aij.

These probabilities are the same for all t, they do not change!

2 The Decoding Problem

Given a series of observations O1...T , the decoding problem consists in findingthe most likely state sequence, i.e.,

• The sequence of states that maximizes

Pr(q1...T |O1...T ) =Pr(q1...T , O1...T )

Pr(O1...T )

• O1...T is constant, so the same state sequence maximizes Pr(q1...T , O1...T ).

2.1 The Viterbi Algorithm:Mechanics

• Dynamic Programming (DP) algorithm for determining the most likely se-quence of states given a sequence of observations.

3

Figure 1: A simple HMM diagram

Example. A series of coin tosses using one fair coin and one coin with heads onboth sides, switched with probability 0.25.

• What is the most likely sequence of states after you’ve seen T? How aboutTH? THH?

• Can be answered by filling out a dynamic programming table (Viterbi) asshown in Figure 2.

• If δt(i) is the table entry in row i and column t, then

δt(i)← bi(Ot)N

maxj=1

ajiδt−1(j)

and δ1(i)← bi(O1)πi.

• Find a maximum cell in the final column

• Trace back to source of max in the previous column. In Figure 2, thetraceback path from each state is shown as a solid line.

Example. Figure 3 shows the same Viterbi matrix as in Figure 2, but with thecalculations in general notation.

4

Figure 2: A partially completed Viterbi table. The traceback path from each state is shown asa solid line. If the traceback goes to state C1 the red line is solid; if it goes to state C2 the blueline is solid.

5

Figure 3: A partially completed Viterbi table in general notation

6

To obtain the most likely state sequence after the observations THH, find thestate (row) that gives the maximum δ in the the third column. In this case,δ3(1) = 3/26 is larger than δ3(2) = 32/28, so the final state of the most likelystate sequence is state 1 (Coin 1). The second to the last state is also state 1because a1,1δ2(1) > a2,1δ2(2).

Exercise 2.1. For the HMM shown in Figure 1, fill out the 4th column of thetable shown in Figure 2, assuming the 4th outcome was heads. Indicate whichtraceback pointer is most likely from each state, which state the traceback shouldstart in, and what is the most likely state sequence.

Exercise 2.2. Follow the instructions for the previous exercise but assume the4th outcome was tails.

2.2 Viterbi Algorithm: Correctness

Definition (argmax). Let D be a set and f : D → R a function and S ⊆ D.argmaxx∈S f(x) is a member of S on which f takes its max value.

• E.g. argmaxx∈R[−(x − 1)2] = 1, but maxx∈R[−(x − 1)2] = 0, (i.e. whenx = 1).

• Note: For any constant C, argmaxx∈RCf(x) = argmaxx∈R f(x).

Theorem (Viterbi Traceback). The final state of the most likely state sequenceis argmaxi δT (i). If q∗t+1 is the t+ 1th state of the most likely state sequence thenargmaxi Pr(q

∗t+1|qt = Si)δt(i) is the tth state of the most likely state sequence.

To prove this, we must first have another theoreom.

Theorem (Viterbi Recursion).

δt(i) = maxq1...t−1

Pr(q1...t−1, qt = Si, O1...t),

the joint probability of the most likely state sequence ending in state Si at timet and all the observations up to time t.

Proof. By induction. Base case.

δ1(i) = bi(O1)πi

= Pr(O1|q1 = Si)Pr(q1 = Si)

= Pr(q1 = Si, O1)

7

Now suppose the claim is true for times 1 . . . t−1. This supposition is called theinduction hypothesis. Using the induction hypothesis, we will show it is true fortime t. Several lines are left blank as exercises.

δt(i) = bi(Ot)N

maxj=1

ajiδt−1(j) [Defn. of δt(i)]

= bi(Ot)N

maxj=1

aji maxq1...t−2

Pr(qt−1 = Sj, q1...t−2, O1...t−1) [Induction hyp.]

=N

maxj=1

maxq1...t−2

bi(Ot)ajiPr(qt−1 = Sj, q1...t−2, O1...t−1) [Move Constants]

=N

maxj=1

maxq1...t−2

bi(Ot)Pr(qt = Si|qt−1 = Sj)Pr(qt−1 = Sj, q1...t−2, O1...t−1) [Defn aji]

= See exercises [HMM indep.]

=N

maxj=1

maxq1...t−2

bi(Ot)Pr(q1...t−2, qt−1 = Sj, qt = Si, O1...t−1) [Chain rule]

= maxq1...t−1

bi(Ot)Pr(q1...t−1, qt = Si, O1...t−1) [AbsorbN

maxj=1

into maxq1...t−1

]

= maxq1...t−1

Pr(Ot|qt = Si)Pr(q1...t−1, qt = Si, O1...t−1) Defn bi(Ot)

= maxq1...t−1

Pr(Ot|q1...t−1, qt = Si, O1...t−1)Pr(q1...t−1, qt = Si, O1...t−1) [HMM Indep.]

= See exercises [Chain rule]

Proof. Viterbi Traceback. The last state of the most likely state sequence is:

argmaxi maxq1...T−1

Pr(q1...T−1, qT = Si, O1...T ),

which, by the Recursion Theorem, is the same as:

argmaxi δT (i)

This establishes the base case for a proof by induction where we prove somethingis true at an earlier time if it is true at a later time.

Suppose q∗t+1 is the t + 1th state of the most likely state sequence. This isthe induction hypothesis. Given the induction hypothesis, we will prove thatargmaxi Pr(q

∗t+1|qt = Si)δt(i) is the tth state of the most likely state sequence.

8

argmaxi Pr(q∗t+1|qt = Si)δt(i)

= argmaxi Pr(q∗t+1|qt = Si) max

q1...t−1

Pr(q1...t−1, qt = Si, O1...t) [Recursion Thm.]

= argmaxi maxq1...t−1

Pr(q∗t+1|qt = Si)Pr(q1...t−1, qt = Si, O1...t) [Move const.]


Pr(q∗t+1|q1...t−1, qt = Si, O1...t)Pr(q1...t−1, qt = Si, O1...t) [See exercises]


Pr(q∗t+1, q1...t−1, qt = Si, O1...t) [Chain Rule]

= argmaxi[ maxqt+2...T

Pr(qt+2...T , Ot+1...T |q∗t+1)] maxq1...t−1

Pr(q∗t+1, q1...t−1, qt = Si, O1...t) [Const. in argmax]


maxqt+2...T

Pr(qt+2...T , Ot+1...T |q∗t+1)Pr(q∗t+1, q1...t−1, qt = Si, O1...t) [Move const.]

= argmaxi maxq1...t−1,qt+2...T

Pr(qt+2...T , Ot+1...T |q∗t+1, q1...t−1, qt = Si, O1...t)

Pr(q∗t+1, q1...t−1, qt = Si, O1...t) [HMM Indep.]


Pr(qt+2...T , Ot+1...T , q∗t+1, q1...t−1, qt = Si, O1...t) [See exercises]


Pr(qt+2...T , q∗t+1, q1...t−1, qt = Si, O1...T ) Notation

which is the tth state of the most likely state sequence.

Exercise 2.3. Two lines in the proof of the Viterbi Recursion theorem in thecourse notes were left blank, though the justifications were provided. Please fillin those two lines.

Exercise 2.4. Two justification lines in the proof of the Viterbi Traceback the-orem were also left blank. Please fill in those two lines.

Exercise 2.5. There is a casino that sometimes uses a fair die and sometimesuses a loaded die. You get to see the outcomes of the rolls, but not which die isused. For each die, you are given the probabilty of the faces, which are numbered1 to 6. You are also given the probabilities of switching dice and the probabilityof starting with each die type – in short, an HMM.

Suppose you know that an Occasionally Dishonest Casino never uses theloaded die more than 2 times in a row. If you do nothing about this, the statesequence returned by the Viterbi algorithm may be an impossible one with 3 ormore consecutive uses of the loaded die. This may happen if, by chance, somefair rolls that look more likely under the loaded model are adjacent to loadedrolls that also look more likely under the loaded model.

Your friend says, “No problem! As you trace back, if you have already gonethrough 2 consecutive instances of the “loaded”; state, automatically trace backto the “fair” state next, ignoring the calculated deltas.

This is easy to implement, and it will give you a state sequence with no runsof 3 or more “loaded” states. However, it is not guaranteed to be the most likelysequence with no runs of 3 or more “loaded” states.

9

1. Explain in simple English why this may not yield the most likely statesequence without 3 consecutive “loaded” states. Give an example where theresult is not optimal, using the Occasionally Dishonest Casino. Specify thecomplete HMM and demonstrate, using the begin, transition, and emissionprobabilities, that the suggested algorithm does not produce the most likelystate sequence with at most two consecutive “loaded” states.

2. Point out where the proofs of the Traceback Theorem and the Viterbi Re-cursion Theorem break down when the constraint forbidding more than 2consecutive loaded rolls is imposed. I.e., why do the guarantees providedby these theorems no longer hold when this constraint is imposed?

(An alternative algorithm would be to change the transition probabilities onthe forward pass of Viterbi. Whenever the traceback from the loaded stateat time t goes through the loaded state at t − 1 and t − 2, the transitionprobability from loaded at t − 1 to loaded at t is set to 0, forcing thetraceback away from the path through loaded at t − 1 and t − 2. Thismay give a different state sequence than the first algorithm. However, it isagain not guaranteed to be the most likely state sequence with no more 3consecutive loaded states. The reasons are the same.)

3. It is possible to correctly model the casino that never uses the loaded diemore than twice in a row with an HMM, just not the two-state HMM we’vetalked about so far. What is the structure of the correct HMM for thissituation, in terms of the number of states, their interpretations, and theallowable transitions among them?

2.3 Application of 4 rules for manipulating conditional probabilityexpressions

• Goal : given some independence assumptions, simplify a joint probabilityPr(x1, x2, . . . , xn).

1. Use chain rule to factor the joint distribution into a product of condi-tional distributions.

2. Manipulate factors so that the independence relations can be used toreplace complex expressions by simpler ones by:

– Using Bayes Rule to swap a variable behind the conditioning barwith one in front.

– Introducing a new variable in front of the conditioning bar andsumming it out.

10

– Using an exhaustive conditionalization to introduce a new variablebehind the conditioning bar.

3 Summing Out States

Used to compute:

• Pr(O1...T ), the probability of generating O1...T by any state sequence.

• Probability of being in state Si at time t, Pr(qt = Si|O1...T ).

• Most likely state at time t, argmaxi Pr(qt = Si|O1...T ).

Note: Most likely state sequence (Viterbi) 6= sequence of most likely states.

Definition. The (i, t) entry of the forward table, αt(i), is defined by the recursivecalculation

α1(i)← bi(O1)πi

αt(i)← bi(Ot)N∑j=1

ajiαt−1(j), t = 2 . . . T.

Theorem (Correctness of the forward algorithm).

αt(i) = Pr(O1...t, qt = Si), t = 1 . . . T

.

Proof. By induction on t. Base case: t = 1

α1(i) = bi(O1)πi [Defn. α1(i)]

= Pr(O1|q1 = Si)Pr(q1 = Si|q0 = Begin) [Defn. b, π]

= Pr(O1|q1 = Si)Pr(q1 = Si) [Pr(q0 = Begin) = 1)]

= Pr(O1, q1 = Si) [Chain Rule]

Induction hypothesis. Let t be any integer between 2 and T and suppose we have proventhat

αk(i) = Pr(O1...k, qk = Si)

for 1 ≤ k ≤ t− 1. Then we can show that αt(i) = Pr(O1...t, qt = Si) as follows.

11

αt(i) = bi(Ot)N∑j=1

aj,iαt−1(j) [Defn. α]

= Pr(Ot|qt = Si)N∑j=1

Pr(qt = Si|qt−1 = Sj)αt−1(j) [Defn. b, a]

= Pr(Ot|qt = Si)N∑j=1

Pr(qt = Si|qt−1 = Sj) Pr(O1...t−1, qt−1 = Sj) [Induction hyp.]

= [HMM indep.]

= [C.R.]

= Pr(Ot|qt = Si) Pr(qt = Si, O1...t−1) [Summing out qt−1]

= [HMM indep.]

= Pr(qt = Si, O1...t) [C.R.]

Remark. The sum of the final alphas is the probability of the observations.

N∑i

αT (i) =N∑i

Pr(O1...T , qT = Si) = Pr(O1...T )

Exercise 3.1. Compute the proof of the correctness of the forward algorithmby filling in the missing lines in the course notes.

4 Posterior Decoding

• Viterbi finds the state sequence q∗1...T that maximizes Pr(q1...T |O1...T )

• For a given t, q∗t may not maximize Pr(qt|O1...T ).

• We may care only about some parts of the sequence.

– E.g. whether a particular genomic region is transcribed.

– The rest of the genome is just evidence

• Posterior decoding is choosing the most likely state at each point in time,argmaxi Pr(qt = Si|O1...T ), regardless of the states chosen for other times.

• To carry out posterior decoding, we calculate Pr(qt = Si|O1...T ) for eachstate i and time t using the forward-backward algorithm.

12

Definition. The (i, t) entry of the backward table, βt(i), is defined by the re-cursive calculation

βT (i)← 1

βt(i)←N∑j=1

ai,jbj(Ot+1)βt+1(j), for t = 1 . . . T − 1

Theorem (correctness of the backward algorithm).

βt(i) = Pr(Ot+1...T |qt = Si), for t = 1 . . . T − 1

Proof. By induction on t. Base case: t = T − 1.

βT−1(i) =N∑j=1

ai,jbj(OT )βT (j), [Defn. of βt]

=N∑j=1

Pr(qT = Sj|qT−1 = Si) Pr(OT |qT = Sj) [Defn. of a, b, βT ]

=N∑j=1

Pr(qT = Sj|qT−1 = Si) Pr(OT |qT = Sj, qT−1 = Si) [HMM Indep.]

=N∑j=1

Pr(OT , qT = Sj|qT−1 = Si) [C.R.]

= Pr(OT |qT−1 = Si) [summing out qT ]

Induction hypothesis. Let t be any integer between 1 and T − 1 and supposewe have proven that

βk(i) = Pr(Ok+1...T |qk = Si),

for t + 1 ≤ k ≤ T − 1. Then we can show that βt(i) = Pr(Ot+1...T |qt = Si) asfollows.

[See exercises.]

Theorem. The posterior probability of state Si at time t is:

Pr(qt = Si|O1...T ) =βt(i)αt(i)∑k βt(k)αt(k)

(1)

13

Proof.

Pr(qt = Si|O1...T ) =Pr(qt = Si, O1...T )

Pr(O1...T )[Defn. Cond. Prob.]

=Pr(qt = Si, O1...T )∑k Pr(qt = Sk, O1...T )

[Summing out]

Pr(qt = Si, O1...T ) = Pr(Ot+1...T |qt = Si, O1...t) · Pr(qt = Si, O1...t) [Chain Rule]

= Pr(Ot+1...T |qt = Si) · Pr(qt = Si, O1...t) [HMM Indep.]

= βt(i)αt(i), so, [Defn. α, β]

Pr(qt = Si|O1...T ) =βt(i)αt(i)∑k βt(k)αt(k)

• Algorithm: precompute α and β matrices.

• Compute

argmaxNi=1 Pr(qt = Si|O1...T ) = argmaxNi=1 αt(i)βt(i)

Note that the denominator∑

k βt(k)αt(k) is constant that does not dependon i, so it does not affect which i gives the maximum value. Therefore, itcan be dropped from within the argmax

• TIME=O(RT ) where R =# non-zero transitions.

Exercise 4.1. Derive the backward probabilities (β) from scratch. This is verysimilar to the derivation of the forward probabilities that you did in the previousassignment.

Exercise 4.2. Manually compute the forward (alpha), backward (beta), andposterior probabilities of each state of the model in Figure 1 for observation se-quences “TH”, “THH”, and “THHH”. Turn in all three tables for all three inputsequences. Also, note the most likely state for each coin-flip in each sequence.

1. Does seeing more coin-flips change the alphas calculated for the earlier coin-flips? How about the betas?

2. Does seeing more coin-flips change the most likely states for earlier coin-flips?

3. Does posterior decoding give the same state sequence as Viterbi for all threecases? If not, please try to explain any differences between the two. Whydoes it make sense that the most likely state sequence and the sequence ofmost likely states would be different in these cases?

14

5 HMM Parameter Estimation from labeled data

• Given an “example run” (O1...T , q1...T ) of the HMM

– E.g. DNA sequence with known exon-intron structures.

• Use maximum likelihood

bi(Aj) =countTt=1(Ot = Aj, qt = Si)

countTt=1(qt = Si)

aij =countT−1

t=0 (qt = Si, qt+1 = Sj)

countT−1t=0 (qt = Si)

• Or use a Bayesian estimator (e.g + pseudocounts).

6 HMM Parameter Estimation from Unlabeled Data

Use Expectation Maximization. State labels are hidden data.

• Start with arbitrary parameters

• Use them to compute posterior distribution on states at each time.

• Repeat until convergence:

– Expectation: Compute expected counts from posterior distributions.

– Maximization: Use Max. Likelihood to re-estimate parameters.

• Called Forward-Backward or Baum-Welch re-estimation.

• Likelihood of observations guaranteed to increase every cycle.

• Moves toward local (not necessarily global) max of likelihood.

• Define γt(i) = Pr(qt = Si|O1...T ), posterior probability Si at t.

• Define ξt(i, j) = Pr(qt = Si, qt+1 = Sj|O1...T ) posterior of Si → Sj at t

– Note γt(i) =∑N

j=1 ξt(i, j)

– ξt(i, j) =αt(i)aijbj(Ot+1)βt+1(j)∑

αT (i)[Proof is left as an exercise.]

15

•∑T−1

t=0 ξt(i, j) is expected # Si → Sj transitions while generating O1...T

•∑T−1

t=0 γt(i) =∑T−1

t=0

∑Nj=1 ξt(i, j) is expected # times in Si.

• Re-estimation: Same as for labeled data, but use expected counts:

a′

i,j =E[#i→ j transitions]

E[#times in Si]=

∑T−1t=0 ξt(ij)∑T−1t=0 γt(i)

(2)

bi(Ak) =E[#times in Si when Ak is emitted]

E[#times in Si]=

∑{t s.t. Ot=Ak} γt(i)∑T

t=1 γt(i)(3)

Exercise 6.1. Recall that, by definition

ξt(i, j) = Pr(qt = Si, qt+1 = Sj|O1...T ) =Pr(qt = Si, qt+1 = Sj, O1...T )

Pr(O1...T )

Provide a derivation showing

ξt(i, j) =αt(i)aijbj(Ot+1)βt+1(j)∑

αT (i)

16

Date post:	27-Jun-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

1 Introduction to Hidden Markov Modelsbio5495.wustl.edu/HMMs/HMMReadings/HMMNotes.pdf · 1...

Documents