Hidden Markov Hidden Markov ModelsModels
IP notice: slides from Dan Jurafsky
OutlineOutline Markov ChainsMarkov Chains Hidden Markov ModelsHidden Markov Models Three Algorithms for HMMsThree Algorithms for HMMs
The Forward Algorithm The Viterbi Algorithm The Baum-Welch (EM Algorithm)
Applications:Applications: The Ice Cream Task Part of Speech Tagging
Chomsky GrammarsChomsky Grammars
Distinguish grammatical English from Distinguish grammatical English from ungrammatical English:ungrammatical English: John thinks Sara hit the boy * The hit thinks Sara John boy John thinks the boy was hit by Sara Who does John think Sara hit? John thinks Sara hit the boy and the girl * Who does John think Sara hit the boy and? John thinks Sara hit the boy with the bat What does John think Sara hit the boy with? Colorless green ideas sleep furiously. * Green sleep furiously ideas colorless.
Acceptors and Acceptors and TransformersTransformers Chomsky’s grammars are about which Chomsky’s grammars are about which
utterances are acceptableutterances are acceptable Other research programs are aimed at Other research programs are aimed at
transformingtransforming utterances utterances Translate a English sentence into Japanese… Transform a speech waveform into transcribed
words… Compress a sentence, summarize a text… Transform a syntactic analysis into a semantic
analysis… Generate a text from a semantic representation…
Strings and TreesStrings and Trees
Early on, trees were realized to be a useful tool in Early on, trees were realized to be a useful tool in describing what is grammaticaldescribing what is grammatical A sentence is a noun phrase (NP) followed by a verb phrase
(VP) A noun phrase is a determiner (DT) followed by a noun (NN) A noun phrase is a noun phrase (NP) followed by a
prepositional phrase (PP) A PP is a preposition (IN) followed by an NP
A string is acceptable if it has an acceptable tree …A string is acceptable if it has an acceptable tree … Transformations may take place at the tree level …Transformations may take place at the tree level …
S
NP VP
NP PP
IN NP
Natural Language Natural Language ProcessingProcessing 1980s1980s: Many tree-based grammatical : Many tree-based grammatical
formalismsformalisms 1990s1990s: Regression to string-based formalisms: Regression to string-based formalisms
Hidden Markov Models (HMM), Finite-State Acceptors (FSAs) and Transducers (FSTs)
N-gram models for accepting sentences [eg, Jelinek 90]
Taggers and other statistical transformations [eg, Church 88]
Machine translation [eg, Brown et al 93] Software toolkits implementing generic weighted
FST operations [eg, Mohri, Pereira, Riley 00]
Natural Language Natural Language ProcessingProcessing 2000s2000s: Emerging interest in tree-based : Emerging interest in tree-based
probabilistic modelsprobabilistic models Machine translation
• [Wu 97, Yamada & Knight 02, Melamed 03, Chiang 05, …] Summarization
• [Knight & Marcu 00, …] Paraphrasing
• [Pang et al 03, …] Question answering
• [Echihabi & Marcu 03, …] Natural language generation
• [Bangalore & Rambow 00, …]
FSA-s and FST-sFSA-s and FST-s
Finite-State Transducer Finite-State Transducer (FST)(FST)
k
n
i
g
h
t
Original input: Transformation:q
i : AY qq2
qfinal
q3 q4
k : ε
n : N
h : εg : ε
t : T
Finite-State (String) Finite-State (String) TransducerTransducer
q2
Original input: Transformation:
k
n
i
g
h
t
qq2
qfinal
q3 q4
k : ε
n : N
h : εg : ε
t : T
i : AY
Finite-State (String) Finite-State (String) TransducerTransducer
N
Original input: Transformation:
k
n
i
g
h
t
qq2
qfinal
q3 q4
k : ε
n : N
h : εg : ε
t : T
i : AY
Finite-State (String) Finite-State (String) TransducerTransducer
q AY
N
Original input: Transformation:
t
qq2
qfinal
q3 q4
k : ε
n : N
h : εg : ε
t : T
i : AY
k
n
i
g
h
t
Finite-State (String) Finite-State (String) TransducerTransducer
q3
AY
N
Original input: Transformation:k
n
i
g
h
t
qq2
qfinal
q3 q4
k : ε
n : N
h : εg : ε
t : T
i : AY
Finite-State (String) Finite-State (String) TransducerTransducer
q4
AY
N
Original input: Transformation:k
n
i
g
h
t
qq2
qfinal
q3 q4
k : ε
n : N
h : εg : ε
t : T
i : AY
Finite-State (String) Finite-State (String) TransducerTransducer
Tqfinal
AY
N
k
n
i
g
h
t
Original input: Transformation:
qq2
qfinal
q3 q4
k : ε
n : N
h : εg : ε
t : T
i : AY
DefinitionsDefinitions A A weighted finite-state automaton (WFSA)weighted finite-state automaton (WFSA)
An FSA with probabilities on the arcs The sum of the probabilities leaving any arc
must sum to one A A Markov chain (or observable Markov Markov chain (or observable Markov
Model)Model) a special case of a WFST in which the input
sequence uniquely determines which states the automaton will go through
Markov chains can’t represent inherently Markov chains can’t represent inherently ambiguous problemsambiguous problems Useful for assigning probabilities to
unambiguous sequences
Weighted Finite State Weighted Finite State TransducerTransducer FST: FSA whose state transitions are labeled FST: FSA whose state transitions are labeled
with both input and output symbols.with both input and output symbols. A weighted A weighted transducer puts weights on transducer puts weights on
transitions in addition to the input and output transitions in addition to the input and output symbolssymbols
Weights may encode probabilities, durations, Weights may encode probabilities, durations, penalties, ...penalties, ...
Used in speech recognitionUsed in speech recognition Tutorial at Tutorial at
http://www.cs.nyu.edu/~mohri/pub/hbka.pdfhttp://www.cs.nyu.edu/~mohri/pub/hbka.pdf
Markov chain for Markov chain for weatherweather
Markov chain for wordsMarkov chain for words
Markov chain = “First-order Markov chain = “First-order observable Markov Model”observable Markov Model”
A set of states A set of states QQ q1, q2…qN sequence of states: state at time t is qt
Transition probabilities: Transition probabilities: a set of probabilities A = a01a02…an1…ann.
Each aij represents the probability of transitioning from state i to state j
Distinguished start and end statesDistinguished start and end states
aij P(qt j | qt 1 i) 1i, j N
aij 1; 1i Nj1
N
Markov chain = “First-order Markov chain = “First-order observable Markov Model”observable Markov Model”
Markov AssumptionMarkov Assumption Current state only depends on previous stateCurrent state only depends on previous state
PP((qqii | | qq11 … … qqii-1-1) = ) = PP((qqii | | qqii-1-1))
Another representation for Another representation for start statestart state Instead of start stateInstead of start state Special initial probability vector Special initial probability vector
An initial distribution over probability of start states
Constraints:Constraints:
i P(q1 i) 1i N
j 1j1
N
The weather model using The weather model using
The weather model: The weather model: specific examplespecific example
Markov chain for weatherMarkov chain for weather
What is the probability of 4 consecutive warm What is the probability of 4 consecutive warm days?days?
Sequence is warm-warm-warm-warmSequence is warm-warm-warm-warm i.e., state sequence is 3-3-3-3i.e., state sequence is 3-3-3-3
PP(3, 3, 3, 3) =(3, 3, 3, 3) =
33aa3333aa3333aa3333aa3333 = 0.2 • (0.6) = 0.2 • (0.6)33 = 0.0432 = 0.0432
How about?How about?
Hot hot hot hotHot hot hot hot Cold hot cold hotCold hot cold hot
What does the difference in these What does the difference in these probabilities tell you about the real world probabilities tell you about the real world weather info encoded in the figure?weather info encoded in the figure?
HMM for Ice CreamHMM for Ice Cream
You are a climatologist in the year 2799You are a climatologist in the year 2799 Studying global warmingStudying global warming You can’t find any records of the weather in You can’t find any records of the weather in
Baltimore, MD for summer of 2008Baltimore, MD for summer of 2008 But you find Jason Eisner’s diaryBut you find Jason Eisner’s diary Which lists how many ice-creams Jason ate Which lists how many ice-creams Jason ate
every date that summerevery date that summer Our job: figure out how hot it wasOur job: figure out how hot it was
Hidden Markov ModelHidden Markov Model
For Markov chains, the output symbols are the same as the For Markov chains, the output symbols are the same as the states.states. See hot weather: we’re in state hot
But in named-entity or part-of-speech tagging (and speech But in named-entity or part-of-speech tagging (and speech recognition and other things)recognition and other things) The output symbols are words But the hidden states are something else
• Part-of-speech tags
• Named entity tags
So we need an extension!So we need an extension! A A Hidden Markov ModelHidden Markov Model is an extension of a Markov chain in is an extension of a Markov chain in
which the input symbols are not the same as the states.which the input symbols are not the same as the states. This means This means we don’t know which state we are inwe don’t know which state we are in..
Hidden Markov ModelsHidden Markov Models
AssumptionsAssumptions
Markov assumption:Markov assumption:
PP((qqii | | qq11 … … qqii-1-1) = ) = PP((qqii | | qqii-1-1))
Output-independence assumptionOutput-independence assumption
P(ot | O1t 1,q1
t ) P(ot |qt )
Eisner taskEisner task
GivenGiven Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
Produce:Produce: Weather Sequence: H,C,H,H,H,C…
HMM for ice creamHMM for ice cream
Different types of HMM Different types of HMM structurestructure
Bakis = left-to-right Ergodic = fully-connected
The Three Basic Problems for The Three Basic Problems for HMMsHMMs
Problem 1 (Problem 1 (EvaluationEvaluation)): : Given the observation sequence Given the observation sequence OO=(=(oo11oo22……ooTT)), and an HMM model , and an HMM model = ( = (AA,,BB)), , how do we how do we
efficiently compute efficiently compute PP((OO| | )), the probability of the , the probability of the observation sequence, given the modelobservation sequence, given the model
Problem 2 (Problem 2 (DecodingDecoding)): : Given the observation sequence Given the observation sequence OO=(=(oo11oo22……ooTT)), and an HMM model , and an HMM model = ( = (AA,,BB)), , how do we how do we
choose a corresponding state sequence choose a corresponding state sequence QQ=(=(qq11qq22……qqTT)) that that
is optimal in some sense (i.e., best explains the is optimal in some sense (i.e., best explains the observations)observations)
Problem 3 (Problem 3 (Learning): Learning): How do we adjust the model How do we adjust the model parameters parameters = ( = (AA,,BB)) to maximize to maximize PP((OO| | ) )??
Jack Ferguson at IDA in the 1960s
Problem 1: computing the Problem 1: computing the observation likelihoodobservation likelihood
Given the following HMM:Given the following HMM:
How likely is the sequence 3 1 3?How likely is the sequence 3 1 3?
How to compute likelihoodHow to compute likelihood
For a Markov chain, we just follow the states For a Markov chain, we just follow the states 3 1 3 and multiply the probabilities3 1 3 and multiply the probabilities
But for an HMM, we don’t know what the But for an HMM, we don’t know what the states are!states are!
So let’s start with a simpler situation.So let’s start with a simpler situation. Computing the observation likelihood for a Computing the observation likelihood for a
givengiven hidden state sequence hidden state sequence Suppose we knew the weather and wanted to
predict how much ice cream Jason would eat. i.e. P( 3 1 3 | H H C)
Computing likelihood of 3 1 3 given Computing likelihood of 3 1 3 given hidden state sequencehidden state sequence
Computing joint probability of Computing joint probability of observation and state sequenceobservation and state sequence
Computing total likelihood of Computing total likelihood of 3 1 33 1 3 We would need to sum overWe would need to sum over
Hot hot cold Hot hot hot Hot cold hot ….
How many possible hidden state sequences are there How many possible hidden state sequences are there for this sequence?for this sequence?
How about in general for an HMM with How about in general for an HMM with NN hidden states hidden states and a sequence of and a sequence of TT observations? observations? NT
So we can’t just do separate computation for each So we can’t just do separate computation for each hidden state sequence.hidden state sequence.
Instead: the Forward Instead: the Forward algorithmalgorithm A kind of A kind of dynamic programmingdynamic programming algorithm algorithm
Just like Minimum Edit Distance Uses a table to store intermediate values
Idea:Idea: Compute the likelihood of the observation
sequence By summing over all possible hidden state
sequences But doing this efficiently
• By folding all the sequences into a single trellis
The forward algorithmThe forward algorithm
The goal of the forward algorithm is to The goal of the forward algorithm is to computecompute
PP((oo11, , oo22 … … ooTT, , qqTT = = qqFF | | ))
We’ll do this by recursionWe’ll do this by recursion
The forward algorithmThe forward algorithm
Each cell of the forward algorithm trellis Each cell of the forward algorithm trellis tt((jj)) Represents the probability of being in state j After seeing the first t observations Given the automaton
Each cell thus expresses the following Each cell thus expresses the following probabilityprobability
tt((jj) = ) = PP((oo11, , oo22 … … oott, , qqtt = = jj | | ))
The Forward RecursionThe Forward Recursion
The Forward TrellisThe Forward Trellis
We update each cellWe update each cell
The Forward AlgorithmThe Forward Algorithm
DecodingDecoding
Given an observation sequenceGiven an observation sequence 3 1 3
And an HMMAnd an HMM The task of the The task of the decoderdecoder
To find the best hidden state sequence
Given the observation sequence Given the observation sequence OO=(=(oo11oo22……ooTT)), ,
and an HMM model and an HMM model = (= (AA,,BB)), , how do we how do we choose a corresponding state sequence choose a corresponding state sequence QQ=(=(qq11qq22……qqTT)) that is optimal in some sense that is optimal in some sense
(i.e., best explains the observations)(i.e., best explains the observations)
DecodingDecoding One possibility:One possibility:
For each hidden state sequence For each hidden state sequence QQ• HHH, HHC, HCH, HHH, HHC, HCH,
Compute Compute PP((OO||QQ)) Pick the highest one Pick the highest one
Why not?Why not? NNTT
Instead:Instead: The Viterbi algorithmThe Viterbi algorithm Is again a Is again a dynamic programmingdynamic programming algorithm algorithm Uses a similar trellis to the Forward algorithmUses a similar trellis to the Forward algorithm
Viterbi intuitionViterbi intuition
We want to compute the joint probability of We want to compute the joint probability of the observation sequence together with the the observation sequence together with the best state sequence best state sequence
maxq 0,q1,...,qT
P(q0,q1,...,qT ,o1,o2,...,oT ,qT qF | )
Viterbi RecursionViterbi Recursion
The Viterbi trellisThe Viterbi trellis
Viterbi intuitionViterbi intuition
Process observation sequence left to rightProcess observation sequence left to right Filling out the trellisFilling out the trellis Each cell:Each cell:
)()(max)( 11
tjijt
N
it obaivjv
Viterbi AlgorithmViterbi Algorithm
Viterbi backtraceViterbi backtrace
Training a HMMTraining a HMM
Forward-backward or Baum-Welch algorithm Forward-backward or Baum-Welch algorithm (Expectation Maximization)(Expectation Maximization)
Backward probabilityBackward probability
tt((ii) = ) = PP((oott+1+1, , oott+2+2……ooT T | | qqtt = = i, i, ))
TT((ii) = ) = aaii,,F F 1 1 ii NN
function function FORWARD-BACKWARD(FORWARD-BACKWARD(observations observations of len of len TT, , output vocabulary output vocabulary VV, , hidden state set Qhidden state set Q) ) returns returns HMMHMM=(=(AA,,BB))
initialize initialize A A and and BB
iterate iterate until convergenceuntil convergence
E-stepE-step
M-stepM-step
returnreturn AA, , BB
Hidden Markov Models for Hidden Markov Models for Part of Speech TaggingPart of Speech Tagging
Part of speech taggingPart of speech tagging
8 (ish) traditional English parts of speech8 (ish) traditional English parts of speech Noun, verb, adjective, preposition, adverb, article, Noun, verb, adjective, preposition, adverb, article,
interjection, pronoun, conjunction, etc.interjection, pronoun, conjunction, etc. This idea has been around for over 2000 years This idea has been around for over 2000 years
(Dionysius Thrax of Alexandria, c. 100 B.C.)(Dionysius Thrax of Alexandria, c. 100 B.C.) Called: parts-of-speech, lexical category, word Called: parts-of-speech, lexical category, word
classes, morphological classes, lexical tags, POSclasses, morphological classes, lexical tags, POS We’ll use POS most frequentlyWe’ll use POS most frequently Assuming that you know what these areAssuming that you know what these are
POS examplesPOS examples
NN noun noun chair, bandwidth, pacingchair, bandwidth, pacing VV verbverb study, debate, munch study, debate, munch ADJADJ adjadj purple, tall, ridiculouspurple, tall, ridiculous ADVADV adverbadverb unfortunately, slowly,unfortunately, slowly, PP prepositionpreposition of, by, toof, by, to PROPRO pronounpronoun I, me, mineI, me, mine DETDET determinerdeterminer the, a, that, thosethe, a, that, those
POS Tagging examplePOS Tagging example
WORDWORD tagtag
thethe DETDETkoalakoala NNput put VVthe the DETDETkeyskeys NNonon PPthethe DETDETtabletable NN
POS TaggingPOS Tagging
Words often have more than one POS: Words often have more than one POS: backback The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the The POS tagging problem is to determine the POS tag for a particular instance of a word.POS tag for a particular instance of a word.
These examples from Dekang Lin
POS tagging as a POS tagging as a sequence classification sequence classification tasktask We are given a sentence (an “observation” or We are given a sentence (an “observation” or
“sequence of observations”)“sequence of observations”) Secretariat is expected to race tomorrow She promised to back the bill
What is the best sequence of tags which What is the best sequence of tags which corresponds to this sequence of corresponds to this sequence of observations?observations?
Probabilistic view:Probabilistic view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag
sequence which is most probable given the observation sequence of n words w1…wn.
Getting to HMMGetting to HMM
We want, out of all sequences of We want, out of all sequences of nn tags tags tt11……ttnn the single tag sequence such that the single tag sequence such that PP((tt11……ttnn||ww11……wwnn)) is highest. is highest.
Hat ^ means “our estimate of the best one”Hat ^ means “our estimate of the best one” ArgmaxArgmaxxx ff((xx)) means “the means “the xx such that such that ff((xx)) is is
maximized”maximized”
Getting to HMMGetting to HMM
This equation is guaranteed to give us the This equation is guaranteed to give us the best tag sequencebest tag sequence
But how to make it operational? How to But how to make it operational? How to compute this value?compute this value?
Intuition of Bayesian classification:Intuition of Bayesian classification: Use Bayes rule to transform into a set of other
probabilities that are easier to compute
Using Bayes RuleUsing Bayes Rule
Likelihood and priorLikelihood and prior
n
Two kinds of probabilities Two kinds of probabilities (1)(1) Tag transition probabilities Tag transition probabilities PP((ttii||ttii-1-1))
Determiners likely to precede adjs and nouns• That/DT flight/NN• The/DT yellow/JJ hat/NN
• So we expect P(NN|DT) and P(JJ|DT) to be high• But P(DT|JJ) to be low
Compute P(NN|DT) by counting in a labeled corpus:
Two kinds of probabilities Two kinds of probabilities (2)(2) Word likelihood probabilities Word likelihood probabilities PP((wwii||ttii))
VBZ (3sg Pres verb) likely to be “is” Compute P(is|VBZ) by counting in a labeled corpus:
POS tagging: likelihood POS tagging: likelihood and priorand prior
n
An Example: the verb An Example: the verb “race”“race” Secretariat/Secretariat/NNPNNP is/ is/VBZVBZ expected/ expected/VBNVBN to/ to/TO TO
racerace//VBVB tomorrow/ tomorrow/NRNR People/People/NNSNNS continue/ continue/VBVB to/ to/TOTO inquire/ inquire/VBVB the/ the/DTDT
reason/reason/NNNN for/ for/ININ the/ the/DTDT racerace//NNNN for/ for/ININ outer/ outer/JJJJ space/space/NNNN
How do we pick the right tag?How do we pick the right tag?
Disambiguating “race”Disambiguating “race”
P(NN|TO) = .00047P(NN|TO) = .00047 P(VB|TO) = .83P(VB|TO) = .83 P(race|NN) = .00057P(race|NN) = .00057 P(race|VB) = .00012P(race|VB) = .00012 P(NR|VB) = .0027P(NR|VB) = .0027 P(NR|NN) = .0012P(NR|NN) = .0012
P(VB|TO)P(race|VB)P(NR|VB) = .00000027P(VB|TO)P(race|VB)P(NR|VB) = .00000027 P(NN|TO)P(race|NN)P(NR|NN) =.00000000032P(NN|TO)P(race|NN)P(NR|NN) =.00000000032
So we (correctly) choose the verb readingSo we (correctly) choose the verb reading
Transitions between the hidden states of Transitions between the hidden states of HMM, showing A probsHMM, showing A probs
B observation likelihoods B observation likelihoods for POS HMMfor POS HMM
The A matrix for the POS The A matrix for the POS HMMHMM
The B matrix for the POS The B matrix for the POS HMMHMM
Viterbi intuition: we are looking Viterbi intuition: we are looking for the best ‘path’for the best ‘path’
promised to back the bill
VBD
VBN
TO
VB
JJ
NN
RB
DT
NNP
VB
NN
promised to back the bill
VBD
VBN
TO
VB
JJ
NN
RB
DT
NNP
VB
NN
S1 S2 S4S3 S5
promised to back the bill
VBD
VBN
TO
VB
JJ
NN
RB
DT
NNP
VB
NN
Slide from Dekang Lin
Viterbi exampleViterbi example
OutlineOutline
Markov ChainsMarkov Chains Hidden Markov ModelsHidden Markov Models Three Algorithms for HMMsThree Algorithms for HMMs
The Forward AlgorithmThe Forward Algorithm The Viterbi AlgorithmThe Viterbi Algorithm The Baum-Welch (EM Algorithm)The Baum-Welch (EM Algorithm)
Applications:Applications: The Ice Cream TaskThe Ice Cream Task Part of Speech TaggingPart of Speech Tagging Next time: Named Entity TaggingNext time: Named Entity Tagging