The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1
Computational Statistics withApplication to Bioinformatics
Prof. William H. PressSpring Term, 2008
The University of Texas at Austin
Unit 16: Hidden Markov Models
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 2
• Markov models– discrete states, discrete time steps– transition between states at each time by specified probabilities– how to find periodicity or reducibility
• take high power of the matrix by successive squares method– “irreducibility and aperiodicity imply ergodicity”
• Hidden Markov Models– we don’t observe the states, but instead “symbols” from them
• which have a probabilistic distribution for each state– forward-backward algorithm estimates the states from the data
• forward (backward) pass incorporates past (future) data at each time– example: gene finding in the galaxy Zyzyx
• (fewer irrelevant complications than in our galaxy)• use NR3’s HMM class
• Baum-Welch re-estimation– uses the data to improve the estimate of the transition and symbol probabilities– it’s an EM method!– pure “unsupervised learning”– we try it on the Zyzyx data
• can find the right answers starting from amazingly crude guesses!
• Hidden Semi-Markov Models [aka Generalized HMMs (GHMMs)]– give each state a specified residence time distribution– can be implemented by expanding the number of states in an HMM– we try it on Zyzyx and get somewhat better gene finding
Unit 16: Hidden Markov Models (Summary)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 3
Markov Models
• Directed graph– may (usually does) have loops
• Discrete time steps• Each time step, state advances
– with probabilities labeled on outgoing edges– self loops also ok
• “Markov” because no memory– knows only what state in now
• Markov models especially important because exists a fast algorithm for “parsing” their state from observed data
– so-called “Hidden Markov Models” (HMMs)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 4
from i to j
population vector
transpose, because of the way “from/to” are defined
A (right) stochastic matrix has non-negative entries with rows summing to 1
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 5
Note the two different ways of drawing the same Markov model:
directed graph with loops adding the time dimension, directed graph, no loops (can’t go backward in time)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 6
Xi
(Xj
Aij)s+i ≥
Xj
s+j
Every Markov model has at least one stationary (equilibrium) state
Is there a nullspace? Yes, because columns all sum to zero, hence linearly dependent. But how do we know that this s has nonnegative components?
AT s = s ⇐⇒ (AT − 1) s = 0
define a vector s+ with components s+i ≡ |si|∀j,
Xi
Aijs+i =
Xi
|Aij | |si|
≥¯̄̄̄¯Xi
Aijsi
¯̄̄̄¯ = |sj | = s+j
summing over j,11
so the ≥ must be an = in the sum, and hence also for each term, implying
(This is actually a simple special case of the Perron-Frobenius theorem.)
Xi
Aijs+i = s
+j
Therefore s+ is the desired (positive) stationary state, qed.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 7
Does every Markov model eventually reach a unique equilibrium from all starting distributions (ergodicity)?
Not necessarily. Two things can go wrong:
1. More than one equilibrium distribution. (Fails test of “irreducibility”.)2. Limit cycles. (Fails test of “aperiodicity”.)
The theorem is: Irreducibility and aperiodicity imply ergodicity.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 8
Easy to diagnose a particular Markov model numerically by taking a high power of its transition matrix (by successive squaring)
Say, take AT to the power 232, which requires 2 x 32 M3 operations.
If the columns of the result are all identical, then it’s ergodic. (Done.)
Otherwise:1. Zero rows are states that become unpopulated. Ignore their corresponding columns.2. See if remaining columns are self-reproducing under AT (eigenvectors of unit eigenvalue). When yes, they are equilbria.3. When no, they are part of limit cycles.
So, this example has an unpopulated stateand no equilibria.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 9
In a Hidden Markov Model, we don’t get to observe the states, but instead we see a “symbol” that each state probabilistically emits when it is entered
sequence of (hidden) states
sequence of (observed) symbols
What can we say about the sequence of states, given a sequence of observations?
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 10
Let’s try to estimate the probability of being in a certain state at a certain time
Define as the probability of state i at time t given (only) the data up to and including t . “forward estimate”
huge sum over all possible paths! likelihood (or Bayes probability with uniform prior) of that exact path and the exact observed data
As written, this is computationally unfeasible.But it satisfies an easy recurrence!
The Forward-Backward algorithm.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 11
Define as the probability of state i at time t given (only) the data to the future of t . “backward estimate”
uniform prior: no data to the future of N-1
Now, there is a backward recurrence!
And the grand estimate using all the data is (“forward-backward algorithm”)
Likelihood or Bayes probability of the data. Actually, it’s independent of t !You could use its numerical value to compare different models.
Worried about multiplying the α’s and β’s as independent probabilities?Markov guarantees that they are conditionally independent given i, and Pt (i) ∝ Pt (data | i)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 12
Let’s work a biologically motivated example
In the galaxy Zyzyx, the Qiqiqi lifeform has a genome consisting of a linear sequence of amino acids, each chosen from 26 chemical possibilities, denoted A-Z.
Genes alternate with intergenic regions.
In intergenic regions, all A-Z are equiprobable.In genes, the vowels AEIOU are more frequent.
Genes always end with Z.
The length distribution of genes and intergenicregions is known (has been measured).
Can we find the genes?
On Earth, it’s 20 amino acids, with the additional complication of a genetic code mapping three base-4 codons(ACGT) into one a.a. Our example thus simplifies by having no ambiguity on reading frame, and also no ambiguity of strand.
qnkekxkdlscovjfehvesmdeelnzlzjeknvjgetyuhgvxlvjnvqlmcermojkrtuczgrbmpwrjtynonxveblrjuqiydehpzujdogaensduoermiadaaustihpialkxicilgktottxxwawjvenowzsuacnppiharwpqviuammkpzwwjboofvmrjwrtmzmcxdkclvkyvkizmckmpvwfoorbvvrnvuzfwszqithlkubjruoyyxgwvfgxzlzbkuwmkmzgmnsyb
qnkekxkdlscovjfehvesmdeelnzlzjeknvjgetyuhgvxlvjnvqlmcermojkrtuczgrbmpwrjtynonxveblrjuqiydehpzujdogaensduoermiadaaustihpialkxicilgktottxxwawjvenowzsuacnppiharwpqviuammkpzwwjboofvmrjwrtmzmcxdkclvkyvkizmckmpvwfoorbvvrnvuzfwszqithlkubjruoyyxgwvfgxzlzbkuwmkmzgmnsyb
(pvowell = 0.45)
genes
intergenes
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 13
The model:
Int i,j,mstat=3;
MatDoub b(mstat,26,0.), a(mstat,mstat,0.);
a[0][0] = 1.-1./250.;
a[0][1] = 1.-a[0][0];
a[1][1] = 1.-1./50.;
a[1][2] = 1.-a[1][1];
a[2][0] = 1.;
for (i=0;i<26;i++) b[0][i] = 1./26.;
for (i=0;i<5;i++) b[1][i] = pvowel/5.;
for (i=5;i<26;i++) b[1][i] = (1.-pvowel)/21.;
b[2][25] = 1.;
HMM hmm(a,b,seq);
hmm.forwardbackward();
The code looks like this (NR3 C++ fragment)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 14
Embed in a mex-file and return hmm.pstate to Matlab.Matlab has its own HMM functions, but I haven’t mastered them. (Could someone do this and show this example?)
actual start
the right “z”
another “z”
state G
state Z
enough chance excess vowels to make it not completely sure!
The forward-backward results on the previous data are:
[pstate pcorrect loglike merit] = hmmmex(0.45,1);plot(1:260,pstate(1:260,2),'b')hold onplot(1:260,pstate(1:260,3),'r') All the C++ code for this
example is on the course website as “hmmmex.cpp” –but beware, it’s not cleaned up or made pretty!
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 15
Bayesian re-estimation of the transition and output matrices(Baum-Welch re-estimation)
Given the data, we can re-estimate A as follows
So, estimating as an average over the data,
the backward recurrence says that these are equal
number of i → j transitions
number of i states
(note that L cancels)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 16
Similarly, re-estimate b
number of i states emitting k
number of i states
Hatted A and b are improved estimates of the hidden parameters.With them, you can go back and re-estimate α and β. And so forth.
Does this remind you of the EM method?It should! It is another special case. Can prove by the same kind of convexity method as previously that Baum-Welch re-estimation always increases the overall likelihood L, iteratively to an (as usual possibly only local) maximum.
Notice that re-estimation doesn’t require any additional information, or any training data. It is pure “unsupervised learning”.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 17
Before (previous result) After re-estimation (data size N=105)
how log-likelihood increases with iteration number
parsing (forward-backward) can work on even small fragments, but re-estimation takes a lot of data
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 18
On many problems, re-estimation can hill-climb to the “right” answer from amazingly crude initial guesses
a[0][0] = 1.-1./100.;a[0][1] = 1.-a[0][0];a[1][1] = 1.-1./100.;a[1][2] = 1.-a[1][1];a[2][0] = 1.;for (i=0;i<26;i++) {
b[0][i] = 1./26.;b[1][i] = 1./26.;b[2][i] = 1./26.;
}
genes and intergenes alternate and are each about 100 long
there is a one-symbol gene end marker
but we don’t know anything about which symbols are preferred in genes, end-genes, or intergenes
log-likelihood increases monotonically accuracy (in this example we know the “right” answers!)
a period of stagnation is not unusual
this 1st step is an artifact: it is calling all states as 0, which happens to be true ~90% of the time
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 19
0.99397 0.00000 1.00000 0.00603 0.96991 0.00000 0.00000 0.03009 0.00000
A 0.03746 0.09039 0.00245 E 0.03935 0.09196 0.00218 I 0.03684 0.08903 0.00266 O 0.03875 0.08992 0.00109 U 0.03740 0.09214 0.00144 B 0.03772 0.02590 0.00096 C 0.03891 0.02716 0.00686 D 0.03945 0.02792 0.00140 F 0.03862 0.02515 0.00037 G 0.03888 0.02505 0.00057 H 0.03884 0.02874 0.00116 J 0.03652 0.02926 0.00188 K 0.03838 0.02777 0.00069 L 0.03836 0.02673 0.00113 M 0.03823 0.02822 0.00035 N 0.03885 0.02639 0.00005 P 0.03888 0.02677 0.00493 Q 0.03880 0.02743 0.00572 R 0.04055 0.02844 0.00115 S 0.03933 0.02862 0.00769 T 0.03923 0.02393 0.00053 V 0.03924 0.02772 0.00057 W 0.03862 0.02826 0.00308 X 0.03820 0.02945 0.00039 Y 0.03871 0.02759 0.00045 Z 0.03586 0.00006 0.95026
A
bYes, it discovered all the vowels.
It discovered Z, but didn’t quite figure out that state 3 always emits Z
1/0.00603 = 1661/0.03009 = 33.2 why these values?
state I state G state Z
Final estimates of the transition and symbol probability matrices:
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 20
An obvious flaw in the model: Self-loops in Markov models must always give (discrete approximation of) exponentially distributed residence times
But Qiqiqi genes and intergenes are roughly gamma-law distributed in length(In fact, they’re exactly gamma-law, because that’s how I constructed the genome – not from a Markov model!)
mug = 50.;sigg = 10.;mui = 250.;ag = SQR(mug/sigg);bg = mug/SQR(sigg);Gammadev rani(2.,2./mui,ran.int32());Gammadev rang(ag,bg,ran.int32());
Can we make the results more accurate by somehow incorporating length info?
exit event
waiting time to an event in a Poisson process is exponentially distributed
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 21
Generalized Hidden Markov Model (GHMM)also called Hidden Semi-Markov Model (HSMM)
the idea is to impose (or learn by re-estimation) an arbitrary probability distribution for the residency time τ in each state
can be thought of as an ordinary HMM where every state gets expanded into a “timer” cluster
output symbol probabilities identical for all states in a timer(equal those of the state before it was expanded)
arbitrary distribution with τ ≤ n Gamma-law distribution
τ ∼ [p1, (1− p1)p2, (1− p1)(1− p2)p3, . . .] τ ∼ Gamma(α, p)hτi = α
p
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 22
So, our intergene-gene-Z example becomes:
Int i,j,n01=n0+n1,mstat=n01+1,niter=40;Doub len0=250.,len1=50.;MatDoub b(mstat,26,0.), a(mstat,mstat,0.);for (i=0;i<n0;i++) {
a[i][i] = 1.-n0/len0;a[i][i+1] = 1.-a[i][i];
}for (i=n0;i<n01;i++) {
a[i][i] = 1.-n1/len1;a[i][i+1] = 1.-a[i][i];
}a[n01][0] = 1.;for (j=0;j<n01;j++) for (i=0;i<26;i++) b[j][i] = 1./26.;b[n01][25] = 1.;
input values for n0 and n1 (we’ll try various choices)initialize the model like this:
initial guess for lengths (need not be this perfect)
Gamma-law timers
tell it about Z, but not about vowels
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 23
HMM hmm(a,b,seq);hmm.forwardbackward();for (i=1;i<niter;i++) {
hmm.baumwelch();collapse_to_ghmm(hmm,n0,n1);hmm.forwardbackward();
}
We can use NR3’s HMM class for GHMMs by the kludge of averaging the output probabilities after each Baum-Welch re-estimation
void collapse_to_ghmm(HMM &hmm, Int n0, Int n1) {Int i,j,n01=n0+n1;Doub sum;for (j=0;j<26;j++) {
for (sum=0.,i=0;i<n0;i++) sum += hmm.b[i][j];sum /= n0;for (sum=0.,i=n0;i<n01;i++) sum += hmm.b[i][j];sum /= n1;for (i=n0;i<n01;i++) hmm.b[i][j] = sum;
}}
with
See hmmmex.cpp. Actually this is not quite right, because it should be a weighted average by the number of times each state is occupied. The right way to do this would be to overload hmm.baumwelch with a slightly modified version that does the average properly. In this example the effect would be negligible.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 24
So how well do we do?
accuracy =0.9629
table =0.1466 0.02270.0144 0.8163
Accuracy wrt genes shown as
TP FNFP TN
n0 = n1 = 1 (previous HMM)
accuracy =0.9690
table =0.1498 0.01950.0115 0.8192
n0 = 2, n1 = 5
accuracy =0.9726
table =0.1518 0.01750.0099 0.8208
n0 = 3, n1 = 8typically, for this example, it’s starting a gene ~5 too late
or ~3 too early
sensitivity = TP/(TP+FN)specificity = TN/(FP+TN)
For whole genes (length ~50), the sensitivity and specificity are basically 1.0000, because, with pvowel=0.45, the gene is highly statistically significant. What the HMM or GHMM does well is to call the boundaries as exactly as possible.
Obviously there’s a theoretical bound on the achievable accuracy, given that the exact sequence of an FP or FN might also occur as a TP or TN. Can you calculate or estimate the bound?