Lecture 12: Hidden Markov Models
Machine Learning
Andrew Rosenberg
March 12, 2010
1 / 34
Last Time
Clustering
2 / 34
Today
Hidden Markov Models
3 / 34
Dice Example
Imagine a game of dice.
When the croupier rolls 4,5,6 you win.
When the croupier rolls 1,2,3 you lose.Model the likelihood of winning.
IID multinomials
4 / 34
The Moody Croupier
Now imagine that the croupier cheats.
There are three dice.
One fair (fair)One good for the house (bad)One good for you (good)
5 / 34
The Moody Croupier
Model the Likelihood of winning.
IID multinomials
Latent variable '
&
$
%i ∈ {0..n − 1}
qi
xi
6 / 34
The Moody Croupier
Model the Likelihood of winning.
IID multinomials
Latent variable
Allow a prior over die choices'
&
$
%i ∈ {0..n − 1}
θ
qi
xi
7 / 34
The Moody Croupier
Now what if the dealer is moody?
The dealer doesn’t like to change the die that often
The dealer doesn’t like to switch from the good die to thebad die.
No longer iid!
The die he uses at time t is dependent on the die used at t − 1
θ
q0
x0
q1
x1
q2
x2
qT−1
xT−1
. . .
8 / 34
Sequential Modeling
q0
x0
q1
x1
q2
x2
qT−1
xT−1
. . .
Temporal or sequence model.
Markov Assumption
future ⊥⊥ past|present
p(qt |qt−1, qt−2, qt−3, ..., q0) = p(qt |qt−1)
Get the overall likelihood from the graphical model.
p(x) = p(q0)
T−1∏
t=1
p(qt |qt−1)
T−1∏
t=0
p(xt |qt)
9 / 34
Sequential Modeling
future ⊥⊥ past|present
p(qt |qt−1, qt−2, qt−3, ..., q0) = p(qt |qt−1)
Get the overall likelihood from the graphical model.
p(x) = p(q0)
T−1∏
t=1
p(qt |qt−1)
T−1∏
t=0
p(xt |qt)
p(qt |qt−1)?
10 / 34
HMMs as state machines
HMMs have two variables: state q and emission y
In general the state is an unobserved latent variable.
Can consider HMMs as stochastic automata – weighted finitestate machines.
11 / 34
HMM state machine
good
.5
fair .5
bad
.5
.3
.2
.2
.3
.2
.3
12 / 34
HMMs as state machines
HMMs have two variables: state q and emission x
In general the state is an unobserved latent variable.
No observation of q directly. Only a related emissiondistribution. “doubly-stochastic automaton”.
13 / 34
HMM Applications
Speech Recognition (Rabiner): phonemes from audiocepstral vectors
Language (Jelinek): part of speech tag from words
Biology (Baldi): splice site from gene sequence
Gesture (Starner): word from hand coordinates
Emotion (Picard): emotion from EEG
14 / 34
Types of Variables
Continuous States
E.g. Kalman filtersp(qt |qt − 1) = N(qt |Aqt−1,Q)
Discrete States
E.g., Finite state machine
p(qt |qt − 1) =∏M−1
i=0
∏M−1j=0 [αij ]
qit−1q
jt
Continuous Observations
E.g. time series datap(xt |qt) = N(xt |µqt
,Σqt)
Discrete Observations
E.g. strings
p(xt |qt) =∏M−1
i=0
∏N−1j=0 [ηij ]
qitx
jt
15 / 34
HMM Parameters
M states and N-class observationsComplete likelihood from Graphical Model
p(x) = p(q0)
T−1Y
t=1
p(qt |qt−1)
T−1Y
t=0
p(xt |qt)
Marginalize over unobserved hidden states
p(x) =X
q0
· · ·X
qT−1
p(q, x)
CPTs are reused: θ = {π, η, α}
p(qt |qt−1) =M−1Y
i=0
M−1Y
j=0
[αij ]qit−1x
jt
p(xt |qt) =
M−1Y
i=0
N−1Y
j=0
[ηij ]qitx
jt
p(q0) =M−1Y
i=0
[πi ]qi0
M−1X
j=0
αij = 1
N−1X
j=0
ηij = 1
M−1X
j=0
πi = 1
16 / 34
HMM Operations
Evaluate
Evaluate the likelihood of a model given data.
Decode
Identify the most likely sequence of states
Max Likelihood
Estimate the parameters.
17 / 34
JTA on HMM
Junction Tree
18 / 34
JTA on HMMs
Initialization
ψ(q0, x0) = p(q0)p(x0, q0)
ψ(qt , qt+1) = p(qt+1|qt) = Aqt ,qt+1
ψ(qt , xt) = p(xt |qt)
Z = 1
φ(qt) = 1
ζ(qt) = 1
19 / 34
JTA on HMMs
Update
Collect up from leaves – don’t change zeta separators.
ζ∗(qt) =∑
xt
ψ(qt , xt) =∑
xt
p(xt |qt) = 1
ψ∗(qt−1, qt) =ζ∗(qt)
ζ(qt)ψ(qt−1, qt) = ψ(qt−1, qt)
20 / 34
JTA on HMMs
Update
Collect left-right over phi – state sequence becomes marginals.
φ∗(q0) =∑
x0
ψ(q0, x0) = p(q0)
φ∗(qt) =∑
qt−1
ψ(qt , qt−1) = p(qt)
ψ∗(q0, q1) =φ∗(q0)
φ(q0)ψ(q0, q1) = p(q0, q1)
∗
21 / 34
JTA on HMMs
Distribute
Distribute to separators
ζ∗∗(qt) =
X
qt−1
ψ∗(qt−1, qt) =
X
qt−1
p(qt−1, qt) = p(qt)
ψ∗∗(qt , xt) =
ζ∗∗(qt)
ζ∗(qt)ψ(qt , xt) =
p(qt)
1p(xt |qt) = p(xt , qt)
22 / 34
Introduction of Evidence
p(q|x̄) = p(q0)T−1∏
t=0
p(qt |qt−1)T−1∏
t=0
p(x̄t |qt)
Observe a sequence of data.
Potentials become slices
ψ(qt , x̄t) = p(x̄t |qt)
ζ∗(qt) = ψ(qt , x̄t) = p(x̄t |qt)
ζ∗(qt) 6=∑
xt
ψ(qt , x̄t)
Collect zeta separators bottom upζ∗(qt) = ψ(qt , x̄t) = p(x̄t |qt)
Collect phi separators to the rightφ∗(q0) =
∑x0ψ(q0, x̄0)δ(x0 − x̄0) = p(q0, x̄0)
23 / 34
JTA collect
Collecting up and to the left, updating potentials by left andbottom separators
ψ∗(qt , qt+1) =φ∗(qt)
1
ζ∗(qt+1)
1ψ(qt , qt+1) = φ∗(qt)p(x̄t+1|qt+1)αqtqt+1
φ∗(qt+1) =∑
qt
ψ∗(qt , qt+1) =∑
qt
φ∗(qt)p(x̄t+1|qt+1)αqtqt+1
Note:
φ∗(q0) = p(x̄0, q0)
φ∗(q1) =
X
q0
p(x̄0, q0)p(x̄1|q1)p(q1|q0) = p(x̄0, x̄1, q1)
φ∗(q2) =
X
q1
p(x̄0, x̄1, q0)p(x̄2|q2)p(q2|q1) = p(x̄0, x̄1, x̄2, q2)
φ∗(qt+1) =
X
qt
p(x̄0, . . . , x̄t+1, qt+1)p(x̄t+1|qt+1)p(qt+1|qt) = p(x̄0, . . . , x̄t+1, qt+1)
24 / 34
Evaluation
Compute the likelihood of the sequence.
Collection is sufficient.
From previous slide
φ∗(qt+1) =
X
qt
p(x̄0, . . . , x̄t , qt)p( ¯xt+1, qt+1)p(qt+1|qt) = p(x̄0, . . . , ¯xt+1, qt+1)
So the rightmost node gives:
φ∗(qT−1) = p(x̄0, . . . , ¯xT−1, qT−1)
The likelihood just requires marginalization over qT−1.
p(x̄0, . . . , ¯xT−1) =X
qT−1
p(x̄0, . . . , ¯xT−1, qT−1) =X
qT−1
φ∗(qT−1)
25 / 34
Distribute
But the potentials cannot be read as marginals without theDistribute step of the JTA.
Last state of collection
ψ∗(qT−2, qT−1) =
φ∗(qT−2)
1
ζ∗(qT−1)
1ψ(qT−2, qT−1) = φ
∗(qT−2)p(x̄T−1|qT−1)αqT−2qT
Distribute ** along the state nodes to the left.
Distribute ** down from state nodes to observation nodes.
Update parameters.
ψ∗∗(qT−2, qT−1) = ψ
∗(qT−2, qT−1)
φ∗∗(qt) =
X
qt+1
ψ∗∗(qt , qt+1)
ζ∗∗(qt+1) =
X
qt
ψ∗∗(qt , qt+1)
ψ∗∗(qt , qt+1) =
φ∗∗(qt+1)
φ∗(qt+1)ψ
∗(qt , qt+1)
26 / 34
Decoding
Decode: Given x0, . . . xT−1 identify the most likelyq0, . . . qT−1.
Now that JTA is finished we have marginals in the potentialsand separators
φ∗∗(qt) ∝ p(qt |x̄0, . . . , x̄T−1)
ζ∗∗(qt+1) ∝ p(qt+1|x̄0, . . . , x̄T−1)
ψ∗∗(qt , qt+1) ∝ p(qt , qt+1|x̄0, . . . , x̄T−1)
Need to find the most likely path from q0 to qT−1
Argmax JTA.Run JTA but rather than sums in the update rule, use the maxoperator.Then find the largest entry in the separators
q̂t = argmaxqt
φ∗∗(qt)
27 / 34
Viterbi Decoding
Finding an optimal state sequence can be intractable.
There are MT possible paths, for M states and T time steps.
T can easily be on the order of 1000 in speech recognition.
Construct a Lattice of state transitions
28 / 34
Viterbi Decoding
Only continue to explore paths with likelihood greater thansome threshold, or only continue to explore the top N-paths
Also known as beam search
Polynomial time algorithm to approximately decode a lattice.
Algorithm:
Initialize paths at every state.
For each transition follow only the most likely edge.
or
Initialize paths at every state.
For each transition follow only those paths that have alikelihood over some threshold.
29 / 34
Viterbi decoding
30 / 34
Maximum Likelihood
Training parameters with observed states.
Maximum likelihood (as ever).
l(θ) = log(p(q, x̄))
= log
p(q0)
T−1Y
t=1
p(qt |qt−1)
T−1Y
t=0
p(x̄i |qi )
!
= log p(q0) +T−1X
t=1
log p(qt |qt−1) +T−1X
t=0
log p(x̄i |qi )
= log
M−1Y
i=0
[πi ]qi0 +
T−1X
t=1
log
M−1Y
i=0
M−1Y
j=0
[αij ]qit−1q
jt +
T−1X
t=0
log
M−1Y
i=0
N−1Y
j=0
[ηij ]qit x̄
jt
=
M−1X
i=0
qi0 log πi +
T−1X
t=1
M−1X
i=0
M−1X
j=0
qit−1q
jt logαij +
T−1X
t=0
M−1X
i=0
N−1X
j=0
qit x̄
jt log ηij
Introduce Lagrange multipliers, take partials, set to zero.
31 / 34
Maximum Likelihood
Training parameters with observed states.
Maximum likelihood – as ever.
l(θ) =M−1X
i=0
qi0 log πi +
T−1X
t=1
M−1X
i=0
M−1X
j=0
qit−1q
jt logαij +
T−1X
t=0
M−1X
i=0
N−1X
j=0
qit x̄
jt log ηij
Introduce Lagrange multipliers, take partials, set to zero.
M−1∑
i=0
πi = 1
M−1∑
j=0
αij = 1
N−1∑
j=0
ηij = 1
π̂i = qi0
α̂ij =
∑T−2t=0 qi
tqjt+1∑M−1
k=1
∑T−2t=0 qi
tqkt+1
η̂ij =
∑T−1t=0 qi
t x̄jt∑M−1
k=1
∑T−1t=0 qi
t x̄kt
32 / 34
Expectation Maximization
However, we may not have observed state sequences.
The Moody Croupier
Need to do unsupervised learning (clustering) on the states.
Maximize the Expected likelihood given a guess for p(q)Expectation Maximization – Covered when we move tounsupervised techniques
33 / 34
Bye
Next
Perceptron and Neural Networks
34 / 34