Unit 16: Hidden Markov Models - Numerical...

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1

Computational Statistics withApplication to Bioinformatics

Prof. William H. PressSpring Term, 2008

The University of Texas at Austin

Unit 16: Hidden Markov Models


• Markov models– discrete states, discrete time steps– transition between states at each time by specified probabilities– how to find periodicity or reducibility

• take high power of the matrix by successive squares method– “irreducibility and aperiodicity imply ergodicity”

• Hidden Markov Models– we don’t observe the states, but instead “symbols” from them

• which have a probabilistic distribution for each state– forward-backward algorithm estimates the states from the data

• forward (backward) pass incorporates past (future) data at each time– example: gene finding in the galaxy Zyzyx

• (fewer irrelevant complications than in our galaxy)• use NR3’s HMM class

• Baum-Welch re-estimation– uses the data to improve the estimate of the transition and symbol probabilities– it’s an EM method!– pure “unsupervised learning”– we try it on the Zyzyx data

• can find the right answers starting from amazingly crude guesses!

• Hidden Semi-Markov Models [aka Generalized HMMs (GHMMs)]– give each state a specified residence time distribution– can be implemented by expanding the number of states in an HMM– we try it on Zyzyx and get somewhat better gene finding

Unit 16: Hidden Markov Models (Summary)


Markov Models

• Directed graph– may (usually does) have loops

• Discrete time steps• Each time step, state advances

– with probabilities labeled on outgoing edges– self loops also ok

• “Markov” because no memory– knows only what state in now

• Markov models especially important because exists a fast algorithm for “parsing” their state from observed data

– so-called “Hidden Markov Models” (HMMs)


from i to j

population vector

transpose, because of the way “from/to” are defined

A (right) stochastic matrix has non-negative entries with rows summing to 1


Note the two different ways of drawing the same Markov model:

directed graph with loops adding the time dimension, directed graph, no loops (can’t go backward in time)


Xi

(Xj

Aij)s+i ≥

Xj

s+j

Every Markov model has at least one stationary (equilibrium) state

Is there a nullspace? Yes, because columns all sum to zero, hence linearly dependent. But how do we know that this s has nonnegative components?

AT s = s ⇐⇒ (AT − 1) s = 0

define a vector s+ with components s+i ≡ |si|∀j,

Xi

Aijs+i =

Xi

|Aij | |si|

≥¯̄̄̄¯Xi

Aijsi

¯̄̄̄¯ = |sj | = s+j

summing over j,11

so the ≥ must be an = in the sum, and hence also for each term, implying

(This is actually a simple special case of the Perron-Frobenius theorem.)

Xi

Aijs+i = s

+j

Therefore s+ is the desired (positive) stationary state, qed.


Does every Markov model eventually reach a unique equilibrium from all starting distributions (ergodicity)?

Not necessarily. Two things can go wrong:

1. More than one equilibrium distribution. (Fails test of “irreducibility”.)2. Limit cycles. (Fails test of “aperiodicity”.)

The theorem is: Irreducibility and aperiodicity imply ergodicity.


Easy to diagnose a particular Markov model numerically by taking a high power of its transition matrix (by successive squaring)

Say, take AT to the power 232, which requires 2 x 32 M3 operations.

If the columns of the result are all identical, then it’s ergodic. (Done.)

Otherwise:1. Zero rows are states that become unpopulated. Ignore their corresponding columns.2. See if remaining columns are self-reproducing under AT (eigenvectors of unit eigenvalue). When yes, they are equilbria.3. When no, they are part of limit cycles.

So, this example has an unpopulated stateand no equilibria.


In a Hidden Markov Model, we don’t get to observe the states, but instead we see a “symbol” that each state probabilistically emits when it is entered

sequence of (hidden) states

sequence of (observed) symbols

What can we say about the sequence of states, given a sequence of observations?


Let’s try to estimate the probability of being in a certain state at a certain time

Define as the probability of state i at time t given (only) the data up to and including t . “forward estimate”

huge sum over all possible paths! likelihood (or Bayes probability with uniform prior) of that exact path and the exact observed data

As written, this is computationally unfeasible.But it satisfies an easy recurrence!

The Forward-Backward algorithm.


Define as the probability of state i at time t given (only) the data to the future of t . “backward estimate”

uniform prior: no data to the future of N-1

Now, there is a backward recurrence!

And the grand estimate using all the data is (“forward-backward algorithm”)

Likelihood or Bayes probability of the data. Actually, it’s independent of t !You could use its numerical value to compare different models.

Worried about multiplying the α’s and β’s as independent probabilities?Markov guarantees that they are conditionally independent given i, and Pt (i) ∝ Pt (data | i)


Let’s work a biologically motivated example

In the galaxy Zyzyx, the Qiqiqi lifeform has a genome consisting of a linear sequence of amino acids, each chosen from 26 chemical possibilities, denoted A-Z.

Genes alternate with intergenic regions.

In intergenic regions, all A-Z are equiprobable.In genes, the vowels AEIOU are more frequent.

Genes always end with Z.

The length distribution of genes and intergenicregions is known (has been measured).

Can we find the genes?

On Earth, it’s 20 amino acids, with the additional complication of a genetic code mapping three base-4 codons(ACGT) into one a.a. Our example thus simplifies by having no ambiguity on reading frame, and also no ambiguity of strand.

qnkekxkdlscovjfehvesmdeelnzlzjeknvjgetyuhgvxlvjnvqlmcermojkrtuczgrbmpwrjtynonxveblrjuqiydehpzujdogaensduoermiadaaustihpialkxicilgktottxxwawjvenowzsuacnppiharwpqviuammkpzwwjboofvmrjwrtmzmcxdkclvkyvkizmckmpvwfoorbvvrnvuzfwszqithlkubjruoyyxgwvfgxzlzbkuwmkmzgmnsyb

qnkekxkdlscovjfehvesmdeelnzlzjeknvjgetyuhgvxlvjnvqlmcermojkrtuczgrbmpwrjtynonxveblrjuqiydehpzujdogaensduoermiadaaustihpialkxicilgktottxxwawjvenowzsuacnppiharwpqviuammkpzwwjboofvmrjwrtmzmcxdkclvkyvkizmckmpvwfoorbvvrnvuzfwszqithlkubjruoyyxgwvfgxzlzbkuwmkmzgmnsyb

(pvowell = 0.45)

genes

intergenes


The model:

Int i,j,mstat=3;

MatDoub b(mstat,26,0.), a(mstat,mstat,0.);

a[0][0] = 1.-1./250.;

a[0][1] = 1.-a[0][0];

a[1][1] = 1.-1./50.;

a[1][2] = 1.-a[1][1];

a[2][0] = 1.;

for (i=0;i<26;i++) b[0][i] = 1./26.;

for (i=0;i<5;i++) b[1][i] = pvowel/5.;

for (i=5;i<26;i++) b[1][i] = (1.-pvowel)/21.;

b[2][25] = 1.;

HMM hmm(a,b,seq);

hmm.forwardbackward();

The code looks like this (NR3 C++ fragment)


Embed in a mex-file and return hmm.pstate to Matlab.Matlab has its own HMM functions, but I haven’t mastered them. (Could someone do this and show this example?)

actual start

the right “z”

another “z”

state G

state Z

enough chance excess vowels to make it not completely sure!

The forward-backward results on the previous data are:

[pstate pcorrect loglike merit] = hmmmex(0.45,1);plot(1:260,pstate(1:260,2),'b')hold onplot(1:260,pstate(1:260,3),'r') All the C++ code for this

example is on the course website as “hmmmex.cpp” –but beware, it’s not cleaned up or made pretty!


Bayesian re-estimation of the transition and output matrices(Baum-Welch re-estimation)

Given the data, we can re-estimate A as follows

So, estimating as an average over the data,

the backward recurrence says that these are equal

number of i → j transitions

number of i states

(note that L cancels)


Similarly, re-estimate b

number of i states emitting k

number of i states

Hatted A and b are improved estimates of the hidden parameters.With them, you can go back and re-estimate α and β. And so forth.

Does this remind you of the EM method?It should! It is another special case. Can prove by the same kind of convexity method as previously that Baum-Welch re-estimation always increases the overall likelihood L, iteratively to an (as usual possibly only local) maximum.

Notice that re-estimation doesn’t require any additional information, or any training data. It is pure “unsupervised learning”.


Before (previous result) After re-estimation (data size N=105)

how log-likelihood increases with iteration number

parsing (forward-backward) can work on even small fragments, but re-estimation takes a lot of data


On many problems, re-estimation can hill-climb to the “right” answer from amazingly crude initial guesses

a[0][0] = 1.-1./100.;a[0][1] = 1.-a[0][0];a[1][1] = 1.-1./100.;a[1][2] = 1.-a[1][1];a[2][0] = 1.;for (i=0;i<26;i++) {

b[0][i] = 1./26.;b[1][i] = 1./26.;b[2][i] = 1./26.;

}

genes and intergenes alternate and are each about 100 long

there is a one-symbol gene end marker

but we don’t know anything about which symbols are preferred in genes, end-genes, or intergenes

log-likelihood increases monotonically accuracy (in this example we know the “right” answers!)

a period of stagnation is not unusual

this 1st step is an artifact: it is calling all states as 0, which happens to be true ~90% of the time


0.99397 0.00000 1.00000 0.00603 0.96991 0.00000 0.00000 0.03009 0.00000

A 0.03746 0.09039 0.00245 E 0.03935 0.09196 0.00218 I 0.03684 0.08903 0.00266 O 0.03875 0.08992 0.00109 U 0.03740 0.09214 0.00144 B 0.03772 0.02590 0.00096 C 0.03891 0.02716 0.00686 D 0.03945 0.02792 0.00140 F 0.03862 0.02515 0.00037 G 0.03888 0.02505 0.00057 H 0.03884 0.02874 0.00116 J 0.03652 0.02926 0.00188 K 0.03838 0.02777 0.00069 L 0.03836 0.02673 0.00113 M 0.03823 0.02822 0.00035 N 0.03885 0.02639 0.00005 P 0.03888 0.02677 0.00493 Q 0.03880 0.02743 0.00572 R 0.04055 0.02844 0.00115 S 0.03933 0.02862 0.00769 T 0.03923 0.02393 0.00053 V 0.03924 0.02772 0.00057 W 0.03862 0.02826 0.00308 X 0.03820 0.02945 0.00039 Y 0.03871 0.02759 0.00045 Z 0.03586 0.00006 0.95026

A

bYes, it discovered all the vowels.

It discovered Z, but didn’t quite figure out that state 3 always emits Z

1/0.00603 = 1661/0.03009 = 33.2 why these values?

state I state G state Z

Final estimates of the transition and symbol probability matrices:


An obvious flaw in the model: Self-loops in Markov models must always give (discrete approximation of) exponentially distributed residence times

But Qiqiqi genes and intergenes are roughly gamma-law distributed in length(In fact, they’re exactly gamma-law, because that’s how I constructed the genome – not from a Markov model!)

mug = 50.;sigg = 10.;mui = 250.;ag = SQR(mug/sigg);bg = mug/SQR(sigg);Gammadev rani(2.,2./mui,ran.int32());Gammadev rang(ag,bg,ran.int32());

Can we make the results more accurate by somehow incorporating length info?

exit event

waiting time to an event in a Poisson process is exponentially distributed


Generalized Hidden Markov Model (GHMM)also called Hidden Semi-Markov Model (HSMM)

the idea is to impose (or learn by re-estimation) an arbitrary probability distribution for the residency time τ in each state

can be thought of as an ordinary HMM where every state gets expanded into a “timer” cluster

output symbol probabilities identical for all states in a timer(equal those of the state before it was expanded)

arbitrary distribution with τ ≤ n Gamma-law distribution

τ ∼ [p1, (1− p1)p2, (1− p1)(1− p2)p3, . . .] τ ∼ Gamma(α, p)hτi = α

p


So, our intergene-gene-Z example becomes:

Int i,j,n01=n0+n1,mstat=n01+1,niter=40;Doub len0=250.,len1=50.;MatDoub b(mstat,26,0.), a(mstat,mstat,0.);for (i=0;i<n0;i++) {

a[i][i] = 1.-n0/len0;a[i][i+1] = 1.-a[i][i];

}for (i=n0;i<n01;i++) {

a[i][i] = 1.-n1/len1;a[i][i+1] = 1.-a[i][i];

}a[n01][0] = 1.;for (j=0;j<n01;j++) for (i=0;i<26;i++) b[j][i] = 1./26.;b[n01][25] = 1.;

input values for n0 and n1 (we’ll try various choices)initialize the model like this:

initial guess for lengths (need not be this perfect)

Gamma-law timers

tell it about Z, but not about vowels


HMM hmm(a,b,seq);hmm.forwardbackward();for (i=1;i<niter;i++) {

hmm.baumwelch();collapse_to_ghmm(hmm,n0,n1);hmm.forwardbackward();

}

We can use NR3’s HMM class for GHMMs by the kludge of averaging the output probabilities after each Baum-Welch re-estimation

void collapse_to_ghmm(HMM &hmm, Int n0, Int n1) {Int i,j,n01=n0+n1;Doub sum;for (j=0;j<26;j++) {

for (sum=0.,i=0;i<n0;i++) sum += hmm.b[i][j];sum /= n0;for (sum=0.,i=n0;i<n01;i++) sum += hmm.b[i][j];sum /= n1;for (i=n0;i<n01;i++) hmm.b[i][j] = sum;

}}

with

See hmmmex.cpp. Actually this is not quite right, because it should be a weighted average by the number of times each state is occupied. The right way to do this would be to overload hmm.baumwelch with a slightly modified version that does the average properly. In this example the effect would be negligible.


So how well do we do?

accuracy =0.9629

table =0.1466 0.02270.0144 0.8163

Accuracy wrt genes shown as

TP FNFP TN

n0 = n1 = 1 (previous HMM)

accuracy =0.9690

table =0.1498 0.01950.0115 0.8192

n0 = 2, n1 = 5

accuracy =0.9726

table =0.1518 0.01750.0099 0.8208

n0 = 3, n1 = 8typically, for this example, it’s starting a gene ~5 too late

or ~3 too early

sensitivity = TP/(TP+FN)specificity = TN/(FP+TN)

For whole genes (length ~50), the sensitivity and specificity are basically 1.0000, because, with pvowel=0.45, the gene is highly statistically significant. What the HMM or GHMM does well is to call the boundaries as exactly as possible.

Obviously there’s a theoretical bound on the achievable accuracy, given that the exact sequence of an FP or FN might also occur as a TP or TN. Can you calculate or estimate the bound?

Date post:	25-Jun-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Unit 16: Hidden Markov Models - Numerical...

Documents