Hidden Markov ModelsReference:A Tutorial on Hidden Markov Models and
Selected Applications in Speech Recognition, L. R. Rabiner, Proceedings of the IEEE, Vol. 77, No. 2, 1989
Outlinen Introductionn Markov Modelsn Hidden Markov Modelsn Forward/Backward Algorithmsn Viterbi Algorithmn Baum-Welch estimation algorithm
Introductionn Input consists of a sequence of signalsn Types of Signal Models
n Deterministic modelsn sine wave, sum of exponentials
n Statistical modelsn Gaussian process, Markov process, hidden Markov process
n Examples of Applicationsn Speech Recognitionn Word-Sense Disambiguationn DNA Sequence Modelingn Text Modeling and Information Extraction
Markov Models
Markov Models
n States are observable
Markov Models
Weather Modeln States
n R: Rainy, C: Cloudy, S: Sunny
n State Transition Probability Matrix
n What is the probability of observing O=SSRRSCS given that today is S?
Weather Modeln Basic Rule: P(A, B) = P(A|B)P(B)n Markov chain rule:
Weather Modeln Observation Sequence O:
O = (S, S, S, R, R, S, C, S)n By Chain Rule
initial probability
πi = P(q1=i)
Durationn What’s the probability that the sequence
remains in state i for exactly d time units?
n duration density is exponentialn expected value of duration d in state i
Duration
Hidden Markov Modelsn States are not observablen Observations are probabilistic functions of
statesn State transitions are still probabilistic
Coin Toss Modelsn Scenario: You are in a room with a barrier
through which you cannot see what is happening. On the other side of barrier another person is performing a coin (or multiple coin) tossing experiment. He will only tell you the result of each coin flip.
n The problem is how do we build a model to explain the observed sequence of heads and tails?
Coin Toss Modelsn Observation:a sequence of heads and tails
n Build a HMM to explain the observed sequencen What the states correspond to?n How many states? (How many coins?)n What are the parameters?
One-coin Modeln Each state corresponds to a side of the
coin (observation generator)n observable Markov model
n Corresponds to a 1-state HMM
Two-coin Modeln Each state corresponds to a biased coinn Hidden Markov Model
n Transition matrix are estimated by a set of independent coin tosses
Three-coin Model
Model Selectionn Which model best matches the actual
observations?n 1-coin model: 1 unknown parametern 2-coin model: 4 unknown parametersn 3-coin model: 9 unknown parametersn Larger HMMs will match better than
smaller HMMsn Impose strong limitation on the size of
models
Urn and Ball Model
Headers of Scientific Paper
•Citation Index
•Citation Database
•Each state corresponds to one component of the paper header.
•application: information extraction
DNA Sequence Modeling
•Each state corresponds to one position.
•application: profile HMM
Elements of a HMMn Q={1,2,… ,N} : set of hidden statesn V={1,2,… ,M} : set of observation symbolsn A: state transition probability matrix
n aij = P(qt+1=j|qt=i)
n B: observation symbol probabilityn bj(k) = P(ot=k|qt=j)
n π: initial state distributionn πi = P(q1=i)
n λ: the entire model λ=(A,B,π)
Sequence Generatorn generate a sequence of T observations
O=(o1,o2,… ,oT)1. Choose an initial state q1 = Si according to
state distribution, and set t = 12. Choose Ot=vk according to the symbol
probability distribution in state Si, i.e. bi(k)3. Transit to a new state qt+1=Sj according to
the state transition probability distribution for state Si, i.e. aij
4. Set t = t+1; go to step 3 if t < T; otherwise terminate the procedure
Execution of HMM
State sequence corresponds to a path in the grids
Three Basic Problemsn compute the probability that the model
generates the observation sequencen find the optimal state sequence that
generates the observation sequencen learn a HMM that best fits the observation
sequences
Basic Problem 1n Given observation O=(o1,o2,… ,oT) and
model λ=(A,B,π), efficiently compute P(O|λ)n P(O|λ) is the probability that O is produced by λn Hidden states complicates the probability
evaluationn Given two models λ1 and λ2, the probability
(score) can be used to choose the better onen λi models some protein familyn O denotes a proteinn find the most probable protein family for On speech recognitionn on-line handwritten character recognition
Basic Problem 2n Given observation O=(o1,o2,… ,oT) and
model λ=(A,B,π), find the optimal state sequence q=(q1,q2,… ,qT)n to uncover the hidden part of the modeln Optimality criterion has to be decided (e.g.
maximum likelihood)n find “explanation” for the data
n O is the header of some scientific papern find title, author, publication date, … of the papern a fundamental problem in citation index
generationn word-sense disambiguation, gene finding
Basic Problem 3n Given observation O=(o1,o2,… ,oT),
estimate model parameters λ=(A,B,π) that maximizes P(O|λ)n to train the modeln find the best topologyn find the best parameters
Word Speech Recognizern the speech signal of each word is
represented as a time sequence of coded spectral vectors
n build a HMM for each word; training sequence consists of sequences of codebook from one for more talkers
n recognition of an unknown word is performed by choosing the word whose model score is highest (i.e. the highest likelihood)
Solution to Problem 1n Problem: compute P(o1,o2,… ,oT|λ)n Consider state sequence q=(q1,q2,… ,qT)n Assume observations are independent
n P(O|q,λ) = Πi=1,… ,T(ot|qt,λ)= bq1(o1) bq2(o2)… bqT(oT)
n P(q|λ) = πq1aq1q2aq2q3… aqT-1qT
n P(O|λ) = Σq P(O|q,λ)P(q|λ)
n NT state sequences each with O(T) timen Complexity O(TNT)n For N=5, T=100, TNT=100x5100 ~ 1072
Forward Algorithm: Intuition
the probability of observing the partial sequence (o1,o2,… ,ot) such that state qt is i
αt(i) = P(o1,o2,… ,ot,qt=i|λ)
Forward Algorithmn forward variable αt(i) = P(o1,o2,… ,ot,qt=i|λ)n αt(i) is the probability of observing the partial
sequence (o1,o2,… ,ot) such that state qt is Si
n Initialization: α1(i) = πibi(o1)n Induction:
n Termination:
n Complexity: O(N2T)
)()()( 11
1 +=
+
= ∑ tj
N
iijtt obaij αα
∑ ==
N
i T iOP1
)()|( αλ
Backward Algorithm: Intuition
The probability of observing the partial sequence (ot+1,ot+2,… ,oT) such that state qt is i
βt(i) = P(ot+1,ot+2,… ,oT|qt=i,λ)
∑=
++=N
jttjijt jobai
111 )()()( ββ
∑ ==
N
i ii iobOP1 11 )()()|( βπλ
Backward Algorithmn backward variable
βt(i) = P(ot+1,ot+2,… ,oT|qt=i,λ)n βt(i) is the probability of observing the partial
sequence (ot+1,ot+2,… ,oT) such that state qtis i
n Initialization: βT(i) = 1n Induction:
n Termination: n Complexity: O(N2T)
∑=
++=N
jttjijt jobai
111 )()()( ββ ∑
=++=
N
jttjijt jobai
111 )()()( ββ
∑ ==
N
i ii iobOP1 11 )()()|( βπλ
Combing Forward and Backward
)()(),|,,()|,,,(
),,,,|,,(
)|,,,()|,,,,,,(
)|,,,()|,(
11
11
11
11
1
iiiqooPiqooP
iqooooP
iqooPooiqooP
iqooPiqOP
tt
tTttt
ttTt
tt
Tttt
tTt
βαλλ
λλ
λλλ
==×==
=×==
=====
−
−
−
−
LLLL
LLL
L
TtiiOPN
i tt ≤≤= ∑ =1 ,)()()|(
1βαλ
Solution to Problem 2n Find the most likely pathn Find the path that maximizes likelihood:
P(q1,q2,… ,qT|O,λ) which is equivalent to maximize P(q1,q2,… ,qT,O|λ)
n definen δt(i) is the highest prob. path ending at
state in by induction,
)|,,,,,,,(max)( 2121,,, 121
λδ ttqqqt oooiqqqPit
LLL
==−
)(])([max)( 11 ++ ⋅= tjijti
t obaij δδ
Viterbi Algorithm)(])([max)( 11 ++ ⋅= tjijt
it obaij δδ
)|,,,,,,,(max)( 2121,,, 121
λδ ttqqqt oooiqqqPit
LLL
==−
)(max1
* iP TNi
δ≤≤
=
Viterbi Algorithmn Initialization:
n Recursion:
n Termination:
n Path (state sequence) backtracking:
)( tj ob
Solution to Problem 3n estimate λ=(A,B,π) to maximize P(O|λ)n no analytic method because of complexity –
iterative methodn is the probability of being in state i at
time t, and in state j at time t+1n
∑ ∑= = ++
++
++
=
=
N
k
N
l ttlklt
ttjijt
ttjijtt
lobak
jobaiOP
jobaiji
1 1 11
11
11
)()()(
)()()(
)|(
)()()(),(
βα
βαλ
βαξ
),( jitξ
Expectation Maximizationn a’ij = (expected number of transitions from
state i to state j) / (expected number of transitions from state i)
n b’j(k) = (expected number of times in state j and observing symbol k) / (expected number of times in state j)
n P(O|λ’)>P(O|λ)
Expectation Maximization
∑ ∑∑
−
= =
−
=
++
=
===
===
1
1 1
1
1
11
),(
),(
),()|(
)|,,(),|,(
T
t
N
k t
T
t tij
ttt
tt
ki
jia
jiOP
OjqiqPOjqiqP
ξ
ξ
ξλ
λλ
Part-of-Speech Tagging
POS Taggingn labeling each word in a sentence with its
appropriate part of speech, i.e. noun, verb, adjective, …n The-AT representative-NN put-VBD chairs-
NNS on-IN the-AT table-NN.
n The-AT representative-JJ put-NN chairs-VBZon-IN the-AT table-NN.
Information Sources in Tagging
n context informationn a new playn play football
n syntagmatic informationn AT JJ NN commonn AT JJ VBP extremely rare
n lexical informationn tag distribution of a work is extremely unevenn basic tag v.s. derived tagsn dumb tagger achieves 90% accuracy
Summary
n Rules-based tagger using syntagmaticpatterns is about 77% (Greene and Rubin, 1977)
n A dumb tagger (basic tag) is 90% (Charbiak, 1993)
n HMM tagger is about 97%
HMM Taggers
n States of the HMM are tagsn transition probability
n emission probability
n tag sequence :
)(),(
)|( j
kjjk
tCttC
ttP =
)(),(
)|( j
kljl
tCtwC
twP =
)|(maxarg ,1,1,1
nnt
wtPn
095413294656758016PRD
152212914764758426072VB
21392614117734247037201067NN
1850173141325043322IN
38018742601973BEZ
19048636000AT
PRDVBNNINBEZATwl tj
)(),(
)|( j
kljl
tCtwC
twP =
Transition Probability
Emission Probability
4880900000.0000069016the
04108000progress
00382000president
000548400on
013336000move
0000100650is
04310000bear
PRDVBNNINBEZAT