Hidden Markov Models
and
Sequential Data
Sequential Data• Often arise through measurement of time
series• Snowfall measurements on successive days in Buffalo• Rainfall measurements in Chirrapunji• Daily values of currency exchange rate• Acoustic features at successive time frames in speech
recognition
• Non-time series• Nucleotide base pairs in a strand of DNA • Sequence of characters in an English sentence• Parts of speech of successive words
10
Sound Spectrogram of Spoken Words• “Bayes Theorem”
• Plot of the intensity of the spectral coefficients versus time index
• Successive observations of the speech spectrum are highly correlated
Task of making a sequence of decisions• Processes in time, states at time t are influenced by
a state at time t-1
• In many time series applications, eg financial forecasting, wish to predict next value from previous values
• Impractical to consider general dependence of future dependence on all previous observations• Complexity would grow without limit as number of
observations increases• Markov models assume dependence on most recent
observations
10
Model Assuming Independence• Simplest model:
• Treat as independent • Graph without links
Markov Model
• Most general Markov model for observations {xn}• Product rule to express joint distribution of sequence
of observations
),..|(),..( 111
1 −=∏= NN
N
nN xxxpxxp
First Order Markov Model• Chain of observations {xn}
• Distribution p{xn|xn-1} is conditioned on previous observation• Joint distribution for a sequence of n variables
• It can be verified (using product rule from above) that
• If model is used to predict next observation, distribution of prediction will only depend on preceding observation and independent of earlier observations
)|()..|( 111 −− = nnnn xxpxxxp
)|()(),..( 11
11 −=∏= nn
N
nN xxpxpxxp
Second Order Markov Model• Conditional distribution of observation xn
depends on the values of two previous observations xn-1 and xn-2
),|()|()(),..( 211
1211 −−=∏= nnn
N
nN xxxpxxpxpxxp
• Each observation is influenced by previous two observations
Introducing Latent Variables• For each observation xn, introduce a latent variable zn
• zn may be of different type or dimensionality to the observed variable
• Latent variables form the Markov chain• Gives the “state-space model”
Latent variablesLatent variables
ObservationsObservations
)|()|()(),..,,..(1
11
111 nn
N
nnn
N
nnN zxpzzpzpzzxxp ∏∏
=−
=
=
• Joint distribution for this model
Two Models Described by this Graph
Latent variablesLatent variables
ObservationsObservations
1. If latent variables are discrete: Hidden Markov ModelObserved variables in a HMM may be discrete or continuous
2. If both latent and observed variables are Gaussian then we obtain linear dynamical systems
Latent variable with three discrete states
• Transition probabilities aij are represented by a matrix
Not a graphical model since the nodes are not separate Not a graphical model since the nodes are not separate variables but states of a single variablevariables but states of a single variableThis can be unfolded over time to get trellis diagramThis can be unfolded over time to get trellis diagram
Markov model for the production of spoken words
States represent phonemes
Production of the word: “cat”• Represented by states
/k/ /a/ /t/ • Transitions from
• /k/ to /a/• /a/ to /t/• /t/ to a silent state
• Although only the correct cat soundIs represented by model, perhapsother transitions can beintroduced, eg, /k/ followed by /t/
Markov ModelMarkov Modelfor word for word ““catcat””
/k/
/a/ /t/
10
First-order Markov Models (MMs)• State at time t: ω(t)• Particular sequence of length T:
ωT = {ω(1), ω(2), ω(3), …, ω(T)}e.g., ω6 = {ω1, ω4, ω2, ω2, ω1, ω4}
Note: system can revisit a state at different steps and not every state needs to be visited
Model for the production of any sequence is described by the transition probabilities
P(ωj(t + 1) | ωi (t)) = aij
Which is a time-independentprobability of having state ωj at step (t+1) given at time t state was ωi
No requirement transitional probabilities be symmetricParticular modelParticular model
θ = {aij} Given model θ probability thatmodel generated sequence ω6 = {ω1, ω4, ω2, ω2, ω1, ω4}is
P(ω6 | θ) = a14 . a42 . a22 . a21 . a14
Can include a priori probabilityof first state asP(ω(1) = ωi )
Discrete states = nodes, Transition Discrete states = nodes, Transition probsprobs = links= linksIn firstIn first--order discrete time HMM at step t system is in state order discrete time HMM at step t system is in state ωω(t(t))State at step t +1 is a random function that depends on state atState at step t +1 is a random function that depends on state at stesteand transition probabilitiesand transition probabilities
p t p t
10
First Order Hidden Markov Models
• Perceiver does not have access to the states ω(t)
• Instead we measure properties of the emitted sound
• Need to augment Markov model to allow for visible states (symbols)
First Order Hidden Markov Models
• Visible States (symbols) VT = {v(1), v(2), v(3), …, v(T)}• For instance V6 = {v5, v1, v1, v5, v2, v3}
• In any state ωj(t) probability of emitting symbol vk(t) is bjk
Three hidden units in HMMThree hidden units in HMMVisible states and emissionVisible states and emissionprobabilities of visible states in redprobabilities of visible states in red
Hidden Markov Model Computation
• Finite State Machines with transitional probabilities– called Markov Networks
• Strictly causal: probabilities depend only on previous states
• A Markov model is ergodic if every state has non-zero probability of occuring given some starting state
• A final or absorbing state is one which if entered is never left
Hidden Markov Model Computation
Three Basic Problems for HMMs
• Given HMM with transition and symbol probabilities• Problem 1: The evaluation problem
• Determine the probability that a particular sequence of symbols VT was generated by that model
• Problem 2: The decoding problem
• Given a set of symbols VT determine the most likely
sequence of hidden states ωT that led to the observations
• Problem 3: The learning problem• Given a coarse structure of the model (no of states
and no of symbols) but not the probabilities aij and bjk
• determine these parameters
• Probability that model produces a sequence VT of visible states:
where each r indexes a particular sequence• of T hidden states
• In the general case of c hidden states there will be
possible terms
)(P)|V(P)V(P Tr
r
1r
Tr
TTmax
ωω∑=
=
{ })T(),...,2(),1(Tr ωωωω =
Tcr =max
Visible sequenceVisible sequence Hidden statesHidden states
Problem 1: Evaluation Problem
Evaluation Problem FormulaProbability that model produces a sequence VT of visible
states:
)(P)|V(P)V(P Tr
r
1r
Tr
TTmax
ωω∑=
=
∏=
=
−=Tt
t
ttP1
Tr ))1(|)(()P( ωωω ))(|)(()|(
1∏=
=
=Tt
t
Tr
T ttvPVP ωω
Because (1) output probabilities depend only upon hidden states and (2)first order hidden Markov process. Substituting,
(1)(1) (2)(2)
∑∏=
=
=
−=maxr
1r
Tt
1t
T )1t(|)t((P ))t(|)t(v(P)V(P ωωωInterpretation: Probability of sequence VT
is equal to the sum over all rmax possible sequences of hidden states of the conditional probability that the system has made a particular transition multiplied by the probability that it then emitted the visible symbol in the target sequence.
Computationally Simpler Evaluation Algorithm• Calculate • recursively because each term
involves only v(t), ω(t) and ω(t-1)
Define:
)1(|)(())(|)(( −ttP ttvP ωωω
∑∏=
=
=
−=maxr
1r
Tt
1t
T )1t(|)t((P ))t(|)t(v(P)V(P ωωω
where: where: bbjkjkv(tv(t)) means the symbol probability means the symbol probability bbjkjk corresponding to corresponding to v(tv(t))
)(tjα is the probability that the model is in state is the probability that the model is in state ω jj(t(t))and has generated the target sequence and has generated the target sequence uptoupto step step t
ThereforeThereforet
HMM Forward
where: where: bbjkjkv(tv(t)) means the symbol probability means the symbol probability bbjkjk corresponding to corresponding to v(tv(t))
Computational complexity of O(cComputational complexity of O(c22T)T)
TimeTime--reversedreversedVersion of Version of Forward AlgorithmForward Algorithm
Trellis:Trellis:Unfolding ofUnfolding ofHMM throughHMM throughtimetime
Computation of Probabilities by Forward AlgorithmComputation of Probabilities by Forward Algorithm
In the In the evaluationevaluation trellistrelliswe only accumulate we only accumulate valuesvalues
In the In the decodingdecoding trellis trellis we only keep maximum we only keep maximum valuesvalues
Computation of Probabilities by Forward AlgorithmComputation of Probabilities by Forward Algorithm
Example of Hidden Markov ModelFour states with explicit absorber stateFour states with explicit absorber state
ω 0 ω 1 ω 2 ω 3
ω 01 0 0 0
ω 10.2 0.3 0.1 0.4
ω 20.2 0.5 0.2 0.1
ω 30.8 0.1 0 0.1
aaijij==
v0 v1 v2 v3 v4
ω 01 0 0 0 0
ω 10 0.3 0.4 0.1 0.2
ω 20 0.1 0.1 0.7 0.1
ω 30 0.5 0.2 0.1 0.2
bbjkjk==Five symbols with unique null symbolFive symbols with unique null symbol
Example Problem: compute probability of generating sequence VExample Problem: compute probability of generating sequence V44 = {v= {v11, v, v33, v, v22, v, v00} } Assume Assume ω 11 is the start stateis the start state
0]02.0[1)0( )1( :arrow top
)1( ofn calculatio show arrows
1for 0)0(,1)0( thus is state 0at
011010
j
j1
1
=×==
≠===
ba
jt
αα
α
ααω
aij ω 0 ω 1 ω 2 ω 3
ω 0 1 0 0 0
ω 1 0.2 0.3 0.1 0.4
ω 2 0.2 0.5 0.2 0.1
ω 3 0.8 0.1 0 0.1
bjk v0 v1 v2 v3 v4
ω 0 1 0 0 0 0
ω 1 0 0.3 0.4 0.1 0.2
ω 2 0 0.1 0.1 0.7 0.1
ω 3 0 0.5 0.2 0.1 0.2
Visible symbol Visible symbol at each stepat each step
)(tiαjkijba
Problem 2: Decoding Problem
• Given a sequence of visible states VT, the decoding problem is to find the most probable sequence of hidden states.
• Expressed mathematically as:• find the single “best” state sequence (hidden states)
[ ]θωωωωωωωωω
ωωω|)(),...,2(),1(),(),...,2(),1(maxarg)(ˆ),...,2(ˆ),1(ˆ
:)(ˆ),...,2(ˆ),1(ˆ
)(),...,2(),1(TvvvTPT
that such T
T=
Note that summation is changed to argmax, since we want to findthe best case
ViterbiViterbiAlgorithmAlgorithm
Notes:Notes:1. If 1. If aaijij and and bbjkjk are replaced by log probabilities aare replaced by log probabilities awe add terms rather than multiply themwe add terms rather than multiply them2. Best path is maintained for each node2. Best path is maintained for each node3.3. )( TcO T
ViterbiViterbiDecodingDecodingTrellisTrellis
Example of Decoding
ω 0 ω 1 ω 2 ω 3
ω 01 0 0 0
ω 10.2 0.3 0.1 0.4
ω 20.2 0.5 0.2 0.1
ω 30.8 0.1 0 0.1
Four states with explicit absorber stateFour states with explicit absorber state
Five symbols with unique null symbolFive symbols with unique null symbol
aaijij==
What is the most likely What is the most likely state sequence that state sequence that generated the particular generated the particular symbol sequencesymbol sequenceVV44 = {v= {v11, v, v33, v, v22, v, v00}? }? Assume Assume ω 11 is the start is the start state
v0 v1 v2 v3 v4
ω 01 0 0 0 0
ω 10 0.3 0.4 0.1 0.2
ω 20 0.1 0.1 0.7 0.1
ω 30 0.5 0.2 0.1 0.2
bbjkjk==
state
Example 4. HMM Decoding
Note: transition between Note: transition between ω 33 and and ω 22 is forbidden by modelis forbidden by modelYet decoding algorithm gives it a nonYet decoding algorithm gives it a non--zero probabilityzero probability
Problem 3: Learning Problem• Goal: To determine model parameters aij and bjk
from an ensemble of training samples
• No known method for obtaining an optimal or most likely set of parameters
• A good solution is straightforward: Forward-Backward Algorithm
Forward-Backward Algorithm• Instance of generalized expectation maximization
algorithm• We do not know the states that hold when the
symbols are generated• Approach is to iteratively update the weights in
order to better explain the observed training sequences
Backward Probabilities• is the probability that model is in state ω i(t)
and has generated the target sequence upto step t
• Analogously is the probability that that the model is in state ω i(t) and will generate the remainder of the given target sequence from t+1 to T
)(tiα
)(tiβ
Computation proceeds backward through the trellisComputation proceeds backward through the trellis
Backward evaluation algorithm
This is used in learning: parameter estimationThis is used in learning: parameter estimation
Estimating the aij and bjk
• are merely estimates of their true values since we don’t know the actual values of aij and bjk
• The probability of transition between ω i(t-1) and ω j(t) given the model generated the entire training sequence VT by any path is:
)(tiα )(tiβ
Calculating Improved Estimate of aij
• Numerator is the expected number of transitions between state ωi(t-1) and ωj(t)
• Denominator is the total expected number of transitions from ωi
Calculating Improved Estimate of bjk
• Ratio between frequency that any particular symbol vk is emitted and that for any symbol
Learning Algorithm• Start with rough estimates of aij and bjk
• Calculate improved estimate using the formulas above
• Repeat until sufficiently small change in the estimated values of the parameters
Baum-Welch or Forward-Backward Algorithm
Convergence• Requires several presentations of each
training sequence (fewer than 5 common in speech)
• Another stopping criterion:• Overall probability that learning model could
have generated the training data
HMM Word Recognition• Two approaches:
• HMM can model all possible words• Each state corresponds to
each letter of alphabet• Letter transition
probabilities are calculated for each pair of letters
• Letter confusion probabilities are symbol probabilities
• Decoding problem gives most likely word
• Separate HMMs are used to model each word• Evaluation problem gives
probability of observation which is used as a class-conditional probability
LettersLettersaa--zz
HMM spoken word recognition•• Each word, e.g., cat, dog, etc, has an associated HMMEach word, e.g., cat, dog, etc, has an associated HMM•• For a test utterance determine which model has highest probabilFor a test utterance determine which model has highest probabilityity•• HMMsHMMs for speech are leftfor speech are left--toto--right modelsright models
HMM produces aHMM produces aclassclass--conditionalconditionalprobabilityprobability
Thus it is useful to compute probability of the model given the Thus it is useful to compute probability of the model given the sequence using sequence using BayesBayes rulerule
Computed byComputed byForward algorithmForward algorithm
Prior probabilityPrior probabilityof sequenceof sequence
Cursive Word Recognition (not HMM)
4
56
7 82 3
1
1 32 4 5 6 7 8i[.8], l[.8] u[.5], v[.2]
w[.6], m[.3]
w[.7]
i[.7]u[.3]
m[.2]m[.1]
r[.4]
d[.8]o[.5]
Image Segment from 1 to 3 is u with 0.5 confidenceImage Segment 1 to 4 is w with 0.7 confidenceImage Segment 1 to 5 is wwith 0.6 confidence and m with 0.3 confidence
Unit is preUnit is pre--segment (cusp at bottom)segment (cusp at bottom)
Best path in graph from segment 1 to 8: w o r d
Summary: Key Algorithms for HMM• Problem 1: HMM Forward• Problem 2: Viterbi Algorithm
• An algorithm to compute the optimal (most likely) state sequence in a HMM given a sequence of observed outputs.
• Problem 3: Baum-Welch Algorithm• An algorithm to find HMM parameters A, B,
and Π with the maximum likelihood of generating the given symbol sequence in the observation vector