Introduction to Hidden Markov
Models (HMMs)
But first, some probability and statistics
background
Important Topics
1.! Random Variables
and Probability
2.! Probability
Distributions
3.! Parameter Estimation
4.! Hypothesis Testing
5.! Likelihood
7.! Conditional Probability
8.! Stochastic Processes
9.! Inference for
Stochastic Processes
Probability
The probability of a particular event occurring is
the frequency of that event over a very long series
of repetitions.
•! P(tossing a head) = 0.50
•! P(rolling a 6) = 0.167
•! P(average age in a population sample is greater than 21) = 0.25
Random Variables
A random variable is a quantity that cannot be
measured or predicted with absolute accuracy.
!p = fraction of nucleotides that are either G or C
X = age of an individual
Y = length of a gene
Probability Distributions
•! The distribution of a random variable
describes the possible values of the
variable and the probabilities of each
value.
•! For discrete random variables, the
distribution can be enumerated; for
continuous ones we describe the
distribution with a function.
Examples of Distributions
X ~ Bin( , . )3 0 5 Z ~ N( , )µ !2
x! P(X = x)!
0! 0.125!
1! 0.375!
2! 0.375!
3! 0.125!
P(a < Z < b) = f (z;µ,! 2)dz
a
b
"
f (z;µ,!2) =
1
! 2"e#(z #µ )2
2!2
Binomial Normal
Parameter Estimation
One of the primary goals of statistical inference is to
estimate unknown parameters.
For example, using a sample taken from the target
population, we might estimate the population mean
using several different statistics: the sample mean,
the sample median, or the sample mode. Different
statistics have different sampling properties.
Hypothesis Testing
A second goal of statistical inference is testing the
validity of hypotheses about parameters using sample
data.
H
H
O
A
: .
: .
p
p
=
>
0 5
0 5
If the observed frequency is much greater
than 0.5, we should reject the null
hypothesis in favor of the alternative
hypothesis.
How do we decide what “much greater” is?
Likelihood
For our purposes, it is sufficient to define the
likelihood function as
L( ) Pr( Pr( ; )! != =data;parameter values) X
Analyses based on the likelihood function are well-
studied, and usually have excellent statistical
properties.
Maximum Likelihood
Estimation
The maximum likelihood estimate of an unknown
parameter is defined to be the value of that parameter
that maximizes the likelihood function:
!̂ = argmax!
L(!) = argmax!
Pr(X;!)
We say that is the maximum likelihood estimate of . !! !
Example: Binomial Probability
X ~ Bin( , )n p L pn
x n xp px n x( )
!
!( )!( )=
!!
!1If , then
Some simple calculus shows that the MLE of is ,
the frequency of “successes” in our sample of size n.
p !p
If we had been unable to do the calculus, we could
still have found the MLE by plotting the likelihood:
0
0.0005
0.001
0.0015
0.002
0.0025
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
p p7 31( )!
Likelihood Ratio Tests
H
H
O
A
:
:
! !
! !
=
>
O
O
Consider testing the
hypothesis:
The likelihood ratio test statistic is:
!
"
"
"
"
" "
" "
= ==
>
max
max
( )
( )
( )
( )
O
O
O
A
L
L
L
L
Distribution of the Likelihood
Ratio Test Statistic
Under quite general conditions,
! !21
2ln ~" #
n
where n-1 is the difference between the number of
free parameters in the two hypotheses.
The Parametric Bootstrap
Why we need it: The conditions necessary for the
asymptotic chi-squared distribution are not always
satisfied.!
What it is: A simulation based approach for evaluating the
p-value of a statistical test statistic (often a likelihood ratio)!
Parametric Bootstrap
Procedure 1.! Compute the LRT using the observed data
2.! Use the parameters estimated under the null hypothesis to simulate a new dataset of the same size as the observed data.
3.! Compute the LRT for the simulated dataset.
4.! Repeat steps 2 & 3, say, 1000 times.
5.! Construct a histogram of the simulated LRTs.
6.! The p-value for the test is the frequency of simulated LRTs that exceed the observed LRT.
Conditional Probability
The conditional probability of event A given that
event B has happened is
Pr( |Pr
PrA B) =
(A and B)
(B)
Pr(2|even number rolled) =P(2 and even)
P(even)= =
16
12
1
3
Stochastic Processes
A stochastic process is a series of random variables
measured over time. Values in the future typically
depend on current values.
•! Closing value of the stock market
•! Annual per capita murder rate
•! Current temperature
ACGGTTACGGATTGTCGAA
ACaGTTACGGATTGTCGAA
ACaGTTACGGATgGTCGAA
ACcGTTACGGATgGTCGAA
t = 0
t = 1
t = 2
t = 3
Inference for Stochastic
Processes
We often need to make inferences that involve the
changes in molecular genetic sequences over time.
Given a model for the process of sequence evolution,
likelihood analyses can be performed.
Pr( ; , )sequence sequence i j t! "
Introduction to Hidden Markov
Models (HMMs)
HMM: Hidden Markov Model
•!Does this sequence come from a
particular “class”?
–!Does the sequence contain a beta sheet?
•!What can we determine about the
internal composition of the sequence if it
belongs to this class?
–!Assuming this sequence contains a gene,
where are the splice sites?
Example: A Dishonest Casino
•! Suppose a casino usually uses “fair” dice
(probability of any side is 1/6), but
occasionally changes briefly to unfair dice
(probability of a 6 is 1/2, all others have
probability 1/10)
•! We only observe the results of the tosses
•! Can we identify the tosses with the biased
dice?
The data we actually observe look like the following:
2 6 3 4 4 1 3 6 6 5 6 3 6 6 6 1 3 5 2 6 2 4 5
Which (if any) tosses were made using an unfair
die?
F F F F F F F F F F F F F F F F F F F F F F F
F F F F F F U U U U F F U U U F F F F F F F F
F F F F F F F F F F F F F F F F F F F F F F F
F F F F F F U U U U F F U U U F F F F F F F F
If the tosses were made with a fair die (scenario 1), the
probability of observing the series of tosses is:!
Pr(data) = 1
6( )23
=1.266 !10"18
If the indicated tosses were made with an unfair die
(scenario 2), then the series of tosses has probability!
Pr(data) = 1
6( )16 1
2( )5 1
10( )2
=1.108 !10"16
2 6 3 4 4 1 3 6 6 5 6 3 6 6 6 1 3 5 2 6 2 4 5
The series of tosses is 87.5 times more probable under
scenario 2 than under the scenario 1.!
1! 1/6
2! 1/6
3! 1/6
4! 1/6
5! 1/6
6! 1/6
1! 1/10
2! 1/10
3! 1/10
4! 1/10
5! 1/10
6! 1/2
0.9
0.5
0.1
Fair Unfair
0.5
Hidden states!
Emission Probabilities!Transition Probabilities!
The Likelihood Function
Pr( ; ) Pr( ) Pr( | ; )X X! " " !"
= #paths
Probability of the data
in terms of 1 or more
unknown parameters.
Probability of the
hidden states ( may
depend on 1 or more
unknown parameters).
Probability of the data
GIVEN the hidden
states, in terms of 1 or
more unknown
parameters.
Compute via the
forward algorithm!
Predicting the Hidden States 1. The most probable state path (compute via the Viterbi
algorithm):
!̂ = argmaxpaths !
P(X,!;")
= argmaxpaths !
Pr(! )Pr(X |!;")
2. Posterior state probabilities (compute via the backward
algorithm):
Pr(!i
= k | x;") =P(x,!
i= k;")
P(x;")
Simple Gene Structure Model
5’ UTR Exons Introns 3’ UTR
Start codon! Stop codon!
HMM Example: Gene Regions
Start Stop
Intron
Exon 5’ UTR 3’ UTR
Content sensor: Region of residues
with similar properties (introns,
exons)
Signal sensor: A specific signal
sequence; might be a consensus
sequence (start, stop codons)
Basic Gene-finding HMM
B S D A T F 5’ EI
E
I
EF 3’
ES •!B: Begin Sequence
•!S: Start translation
•!D: Donor splice site
•!A: Acceptor splice site
•!T: Stop translation
•!F: End sequence
•!5’: 5’ Untranslated region
•!EI: Initial exon
•!ES: Single exon
•!E: Exon
•!I: Intron
•!EF: Final exon
•!3’: 3’ untranslated region
OK, What do we do with it
now?
The HMM must first be “trained” using a database of
known genes.
Consensus sequences for all signal sensors are
needed.
Compositional rules (i.e., emission probabilities) and
length distributions are necessary for content sensors.
Transition probabilities between all connected states
must be estimated.
GENSCAN
Chris Burge and Samuel Karlin
J. Mol. Biol (1997) 268:78-94
Prediction of Complete Gene
Structures in Human Genomic
DNA
Basic features of GENSCAN
•!HMM description of human genomic sequences,including: –!Transcriptional, translational, and splicing
signals
–!Length distributions and compositional features of introns, exons, and intergenic regions
–!Distinct model parameters for regions with different GC compositions
Accuracy per
nucleotide!Accuracy per exon!
Sn! Sp! AC! CC! Sn! Sp! Avg! ME! WE!
0.93! 0.93! 0.91! 0.92! 0.78! 0.81! 0.80! 0.09! 0.05!
Sn: Sensitivity = Prob(True nucleotide or exon predicted in a gene)
Sp: Specificity = Prob(Predicted nucleotide or exon is in a gene)
AC and CC are overall measures of accuracy, including both
positive and negative predictions.
ME: Prob(true exon is completely missed in the prediction)
WE: Prob(predicted exon is not in a gene)
A A C T - - C T A
A T C T C - C G A
A G C T - - T G G
T G T T C T C T A
A A C T C - C G A
A G C T C - C G A
PROSITE regular expression [AT][ATG][CT][T][CT]*[CT][GT][AG]
Profile HMMs
A A C T - - C T A
A T C T C - C G A
A G C T - - T G G
T G T T C T C T A
A A C T C - C G A
A G C T C - C G A
[AT][ATG][CT][T][CT]*[CT][GT][AG]
T T T T T T T G
The sequence matches the PROSITE expression at
every position, but does it really “match” the
profile?
A A C T - - C T A
A T C T C - C G A
A G C T - - T G G
T G T T C T C T A
A A C T C - C G A
A .8
C 0
G 0
T .2
A .4
C 0
G .4
T .2
A 0
C .8
G 0
T .2
A 0
C 0
G 0
T 1
A 0
C .8
G 0
T .2
A 0
C 0
G .6
T .4
A .8
C 0
G .2
T 0
A 0
C .75
G 0
T .25
1.0 1.0 1.0 1.0 1.0
0.75 0.6
0.4
0.25
A .8
C 0
G 0
T .2
A .4
C 0
G .4
T .2
A 0
C .8
G 0
T .2
A 0
C 0
G 0
T 1
A 0
C .8
G 0
T .2
A 0
C 0
G .6
T .4
A .8
C 0
G .2
T 0
A 0
C .75
G 0
T .25
1.0 1.0 1.0 1.0 1.0
0.75 0.6
0.4
0.25
T T T T T T T G
. . . . . . . .4 .2 1 0 2 1 0 2 1 1 0 6 0 25 0 75 0 2 1 0 1 0 2! ! ! ! ! ! ! ! ! ! ! ! ! !
T T T GT T T T
General Profile HMM
Structure
Begin End Match
States
Insert
States
Delete
States
Mj!
Ij!
Dj!