Hidden Markov Models
By Marc Sobel
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)2
Introduction
Modeling dependencies in input; no longer iid Sequences:
Temporal: In speech; phonemes in a word (dictionary), words in a sentence (syntax, semantics of the language). In handwriting, pen movements
Spatial: In a DNA sequence; base pairs
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3
Discrete Markov Process
N states: S1, S2, ..., SN State at “time” t, qt = Si
First-order Markov P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si)
Transition probabilities aij ≡ P(qt+1=Sj | qt=Si) aij ≥ 0 and Σj=1
N aij=1
Initial probabilities πi ≡ P(q1=Si) Σj=1
N πi=1
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)4
Time-based Models
The models typically examined by statistics: Simple parametric distributions Discrete distribution estimates
These are typically based on what is called the “independence assumption”- each data point is independent of the others, and there is no time-sequencing or ordering.
What if the data has correlations based on its order, like a time-series?
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5
Applications of time based models
Sequential pattern recognition is a relevant problem in a number of disciplines Human-computer interaction: Speech recognition Bioengineering: ECG and EEG analysis Robotics: mobile robot navigation Bioinformatics: DNA base sequence alignment
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)6
Andrei Andreyevich Markov
Born: 14 June 1856 in Ryazan, RussiaDied: 20 July 1922 in Petrograd (now St Petersburg), RussiaMarkov is particularly remembered for his study of Markov chains, sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. This work launched the theory of stochastic processes.
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)7
Markov random processes
A random sequence has the Markov property if its distribution is determined solely by its current state. Any random process having this property is called a Markov random process.
For observable state sequences (state is known from data), this leads to a Markov chain model.
For non-observable states, this leads to a Hidden Markov Model (HMM).
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)8
Chain Rule & Markov Property
),...(),...|(),...,( 111111 qqPqqqPqqqP ttttt
),...(),...|(),...|(),...,( 121211111 qqPqqqPqqqPqqqP ttttttt
t
iiitt qqqPqPqqqP
211111 ),...|()(),...,(
1)|(),...|( 111 iforqqPqqqP iiii
)|()...|()()|()(),...,( 11212
1111
tt
t
iiitt qqPqqPqPqqPqPqqqP
Bayes rule
Markov property
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)9
s1 s3
s2
Has N states, called s1, s2 .. sN
There are discrete timesteps, t=0, t=1, …
N = 3
t=0
A Markov System
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)10
Example: Balls and Urns (markov process with a non-hidden observation process – stochastic automoton
Three urns each full of balls of one colorS1: red, S2: blue, S3: green
048080304050
||||
801010
206020
303040
302050
3313111
3313111
3311
.....
aaa
SSPSSPSSPSP,OP
S,S,S,SO
...
...
...
.,.,. T
A
A
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)11
A Plot of 100 observed numbers for the stochastic automoton
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)12
Histogram for the stochastic automaton: the proportions reflect the stationary distribution of the chain
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)13
Hidden Markov Models
States are not observable Discrete observations {v1,v2,...,vM} are recorded;
a probabilistic function of the state Emission probabilities
bj(m) ≡ P(Ot=vm | qt=Sj) Example: In each urn, there are balls of different
colors, but with different probabilities. For each observation sequence, there are
multiple state sequences
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)14
From Markov To Hidden Markov The previous model assumes that each state can be
uniquely associated with an observable event Once an observation is made, the state of the system is then trivially
retrieved This model, however, is too restrictive to be of practical use for most
realistic problems To make the model more flexible, we will assume that the
outcomes or observations of the model are a probabilistic function of each state
Each state can produce a number of outputs according to a unique probability distribution, and each distinct output can potentially be generated at any state
These are known a Hidden Markov Models (HMM), because the state sequence is not directly observable, it can only be approximated from the sequence of observations produced by the system
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)15
The coin-toss problem To illustrate the concept of an HMM consider the following
scenario Assume that you are placed in a room with a curtain Behind the curtain there is a person performing a coin-toss experiment This person selects one of several coins, and tosses it: heads (H) or tails
(T) The person tells you the outcome (H,T), but not which coin was used
each time Your goal is to build a probabilistic model that best
explains a sequence of observations O={o1,o2,o3,o4,…}={H,T,T,H,,…}
The coins represent the states; these are hidden because you do not know which coin was tossed each time
The outcome of each toss represents an observation A “likely” sequence of coins may be inferred from the observations,
but this state sequence will not be unique
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)16
Speech Recognition
We record the sound signals associated with words.
We’d like to identify the ‘speech recognition features associated with pronouncing these words.
The features are the states and the sound signals are the observations.
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)18
The Coin Toss Example – 2 coins
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)19
From Markov to Hidden Markov Model: The Coin Toss Example – 3 coins
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)21
The urn-ball problem To further illustrate the concept of an HMM, consider this
scenario You are placed in the same room with a curtain Behind the curtain there are N urns, each containing a large number of balls
with M different colors The person behind the curtain selects an urn according to an internal
random process, then randomly grabs a ball from the selected urn He shows you the ball, and places it back in the urn This process is repeated over and over
Questions? How would you represent this experiment with an HMM? What are the states? Why are the states hidden? What are the observations?
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)22
Doubly Stochastic SystemThe Urn-and-Ball Model
O = {green, blue, green, yellow, red, ..., blue}
How can we determine the appropriate model for the observation sequence given the system above?
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)23
Four Basic Problems of HMMs
1. Evaluation: Given λ, and O, calculate P (O | λ)2. State sequence: Given λ, and O, find Q* such that
P (Q* | O, λ ) = maxQ P (Q | O , λ )
3. Learning: Given X={Ok}k, find λ* such that
P ( X | λ* )=maxλ P ( X | λ )
4. Statistical Inference: Given X={Ok}k, and given observation distributions P(X | θλ) for different lambda’s, estimate the theta parameters.
(Rabiner, 1989)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)24
Example: Balls and Urns (HMM): Learning I Three urns each full of balls of different colors:
S1: state 1, S2: state 2, S3: state 3: start at urn 1. red green blue urn1 urn2 urn3
( 1) ( 1)
(urn 2) ( 2)
(urn 3) (urn 3)
urn urn
urn
0.5,0.2,0.3 0.4 0.3 0.3
B = 0.2,0.3,0.5 B = 0.2 0.6 0.2
0.2,0.5,0.3 0.1 0
2 0 1 2 3
0 0 1 2 2 3 2 3
1 1 3 2 1 2 23 3,1
1 2 3 [ 1; b 3; b 2; b 1]
( | )
b b
b S S P S S S
B
0 1 3
1 1 1 2 3
1, 11 , 1 ,
.1 0.8
S = S = 1; S ,S ,S ;
P O,S | A,B =
P B[1, ] × P S | S × P B[ ,b ] P S | S P B[ ,b ] × P(B[ ,b ])
= B ×a ×B a ×B a
4 3 2 5 1 = 0.5(0. )0. (. )0.3×(. ). =
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)25
Baum-Welch EM for Hidden Markov Models We use the notation qt for the probability of the
result at time t; ai[t-1],i[t] for the probability of going from the
observed state at time t-1 to the observed state at time t; ni for the observed number of results i, and ni,j for the number of transitions from I to j;
[ 1], [ ]
i , ,1 ,
log log( ) log
= log(q )
t i t i tt tk
i s t s ti s t
q a
n a n
L
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)26
Baum-Welch EM for hmm’s
The constraints are that:
So, differentiating under constraints we get:
s,tt
1; a 1iq
s,t
s,t
,i,j
,
n0 ; 0= ;
a
ˆˆ ; a
i
i
s tii
s
n
q
nnq
n n
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)27
Observed colored balls in the hmm model
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)28
EM results
We have, b̂= 0.2500 0.4300 0.3200
0.4545 0.1818 0.3636
B̂= 0.1875 0.6250 0.1875 ;
0.1475 0.0328 0.8197
.4 .3 .3
.2 .6 .2
.1 .1 .8
B
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)29
More General Elements of an HMM N: Number of states M: Number of observation symbols A = [aij]: N by N state transition probability
matrix B = bj(m): N by M observation probability matrix
Π = [πi]: N by 1 initial state probability vector
λ = (A, B, Π), parameter set of HMM
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)30
Particle Evaluation
At stage t, simulate the new state from the former state
using the distribution, and
Weight the result by, . The resulting weight for the j’th particle is:
We should use standard residual resampling. The result gets 50 percent accuracy [Note: I haven’t perfected good residual sampling].
( )jts( )
1jts ( ) ( )
1,:j jt ts B s
( ) ( ) ,:j jt tw A s
( ) ( ) ( )1
j j jt t tW w W
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)31
Particle Results: based on 50 observations
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)32
Viterbi’s Algorithm
δt(i) ≡ maxq1q2∙∙∙ qt-1 p(q1q2∙∙∙qt-1,qt =Si,O1∙∙∙Ot | λ)
Initialization: δ1(i) = πibi(O1), ψ1(i) = 0
Recursion: δt(j) = maxi δt-1(i)aijbj(Ot), ψt(j) = argmaxi δt-
1(i)aij Termination:
p* = maxi δT(i), qT*= argmaxi δT (i)
Path backtracking:qt
* = ψt+1(qt+1* ), t=T-1, T-2, ..., 1
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)33
Viterbi learning versus the actual state (estimate =3; 62% accuracy)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)34
General EM
At each step assume k states:
With p known and the theta’s unknown. We use the terminology Z1,…,Zt for the (unobserved states).
Then the EM equation: (with the pi’s the stationary probabilities of the states)
1( | ),..., ( | )t t kp x p x
1
1
( , ) log ( | ) ( | , );
( | )( | , )
( | )
kt s t
t i
t s st k
t st stst
Q p X P Z s X
P XP Z s X
P X
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)35
EM Equations
We have,
So, in the Poisson hidden case we have:
log ( | )| , 0 (s=1,...,k)t s
stt s
p XP Z s X
| ,
(s=1,...,k)
| ,
st tt
s
stt
N P Z s X
P Z s X
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)36
Binomial hidden model
We have:
| ,ˆ (s=1,...,k)
| ,
(1 )| ,
(1 )
t t
t t
tt s
ts
t st
N n Ns s s
t s N n Nst st st
st
NP Z s N p
np
P Z s N p
p pP Z s N p
p p
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)37
Coin-Tossing Model
Coin 1: 0.2000 0.8000 Coin 2: 0.7000 0.3000 Coin 3: 0.5000 0.5000
State Matrix: C1 C2 C3 Coin 1 0.4000 0.3000 0.3000 Coin 2 0.2000 0.6000 0.2000 Coin 3 0.1000 0.1000 0.8000
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)38
Coin tossing model: results
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)39
Maximum Likelihood Model
Stationary distribution for states is: 0.1818 0.2727 0.5455 Therefore using a binomial hidden HMM we get:
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)40
MCMC approach
Update the posterior distributions for the parameters and the (unobserved) state variables.
11
1
( | ) | ;
( | , );s
Ti t i s
Z
s s
P X i Z
Z P Z Z
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)41
Continuous Observations
Discrete:
Gaussian mixture (Discretize using k-means):
Continuous: 2~| jjjtt ,,SqOP N
Use EM to learn parameters, e.g.,
otherwise0
if 1 |
1
mttm
rM
mjjtt
vOrmb,SqOP
tm
ll
ljtt
L
ljljtt
,
,,SqOpP,SqOP
N
GG
~
| |1
t t
t ttj j
Ojˆ
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)42
HMM with Input
Input-dependent observations:
Input-dependent transitions (Meila and Jordan, 1996; Bengio and Frasconi, 1996):
Time-delay input:
titjt x,SqSqP |1
2|~| jjt
jt
jtt ,xg,x,SqOP N
1 ttt O,...,Ofx
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)43
Model Selection in HMM
Left-to-right HMMs:
In classification, for each Ci, estimate P (O | λi) by a separate HMM and use Bayes’ rule
44
3433
242322
131211
000
00
0
0
a
aa
aaa
aaa
A
j jj
iii POP
POPOP
||
|