Hidden Markov Models

Hidden Markov Hidden Markov ModelsModels

IP notice: slides from Dan Jurafsky

OutlineOutline Markov ChainsMarkov Chains Hidden Markov ModelsHidden Markov Models Three Algorithms for HMMsThree Algorithms for HMMs

The Forward Algorithm The Viterbi Algorithm The Baum-Welch (EM Algorithm)

Applications:Applications: The Ice Cream Task Part of Speech Tagging

Chomsky GrammarsChomsky Grammars

Distinguish grammatical English from Distinguish grammatical English from ungrammatical English:ungrammatical English: John thinks Sara hit the boy * The hit thinks Sara John boy John thinks the boy was hit by Sara Who does John think Sara hit? John thinks Sara hit the boy and the girl * Who does John think Sara hit the boy and? John thinks Sara hit the boy with the bat What does John think Sara hit the boy with? Colorless green ideas sleep furiously. * Green sleep furiously ideas colorless.

Acceptors and Acceptors and TransformersTransformers Chomsky’s grammars are about which Chomsky’s grammars are about which

utterances are acceptableutterances are acceptable Other research programs are aimed at Other research programs are aimed at

transformingtransforming utterances utterances Translate a English sentence into Japanese… Transform a speech waveform into transcribed

words… Compress a sentence, summarize a text… Transform a syntactic analysis into a semantic

analysis… Generate a text from a semantic representation…

Strings and TreesStrings and Trees

Early on, trees were realized to be a useful tool in Early on, trees were realized to be a useful tool in describing what is grammaticaldescribing what is grammatical A sentence is a noun phrase (NP) followed by a verb phrase

(VP) A noun phrase is a determiner (DT) followed by a noun (NN) A noun phrase is a noun phrase (NP) followed by a

prepositional phrase (PP) A PP is a preposition (IN) followed by an NP

A string is acceptable if it has an acceptable tree …A string is acceptable if it has an acceptable tree … Transformations may take place at the tree level …Transformations may take place at the tree level …

S

NP VP

NP PP

IN NP

Natural Language Natural Language ProcessingProcessing 1980s1980s: Many tree-based grammatical : Many tree-based grammatical

formalismsformalisms 1990s1990s: Regression to string-based formalisms: Regression to string-based formalisms

Hidden Markov Models (HMM), Finite-State Acceptors (FSAs) and Transducers (FSTs)

N-gram models for accepting sentences [eg, Jelinek 90]

Taggers and other statistical transformations [eg, Church 88]

Machine translation [eg, Brown et al 93] Software toolkits implementing generic weighted

FST operations [eg, Mohri, Pereira, Riley 00]

Natural Language Natural Language ProcessingProcessing 2000s2000s: Emerging interest in tree-based : Emerging interest in tree-based

probabilistic modelsprobabilistic models Machine translation

• [Wu 97, Yamada & Knight 02, Melamed 03, Chiang 05, …] Summarization

• [Knight & Marcu 00, …] Paraphrasing

• [Pang et al 03, …] Question answering

• [Echihabi & Marcu 03, …] Natural language generation

• [Bangalore & Rambow 00, …]

FSA-s and FST-sFSA-s and FST-s

Finite-State Transducer Finite-State Transducer (FST)(FST)

k

n

i

g

h

t

Original input: Transformation:q

i : AY qq2

qfinal

q3 q4

k : ε

n : N

h : εg : ε

t : T

Finite-State (String) Finite-State (String) TransducerTransducer

q2

Original input: Transformation:

k

n

i

g

h

t

qq2

qfinal

q3 q4

k : ε

n : N

h : εg : ε

t : T

i : AY


N


k

n

i

g

h

t

qq2

qfinal

q3 q4

k : ε

n : N

h : εg : ε

t : T

i : AY


q AY

N


t

qq2

qfinal

q3 q4

k : ε

n : N

h : εg : ε

t : T

i : AY

k

n

i

g

h

t


q3

AY

N

Original input: Transformation:k

n

i

g

h

t

qq2

qfinal

q3 q4

k : ε

n : N

h : εg : ε

t : T

i : AY


q4

AY

N

Original input: Transformation:k

n

i

g

h

t

qq2

qfinal

q3 q4

k : ε

n : N

h : εg : ε

t : T

i : AY


Tqfinal

AY

N

k

n

i

g

h

t


qq2

qfinal

q3 q4

k : ε

n : N

h : εg : ε

t : T

i : AY

DefinitionsDefinitions A A weighted finite-state automaton (WFSA)weighted finite-state automaton (WFSA)

An FSA with probabilities on the arcs The sum of the probabilities leaving any arc

must sum to one A A Markov chain (or observable Markov Markov chain (or observable Markov

Model)Model) a special case of a WFST in which the input

sequence uniquely determines which states the automaton will go through

Markov chains can’t represent inherently Markov chains can’t represent inherently ambiguous problemsambiguous problems Useful for assigning probabilities to

unambiguous sequences

Weighted Finite State Weighted Finite State TransducerTransducer FST: FSA whose state transitions are labeled FST: FSA whose state transitions are labeled

with both input and output symbols.with both input and output symbols. A weighted A weighted transducer puts weights on transducer puts weights on

transitions in addition to the input and output transitions in addition to the input and output symbolssymbols

Weights may encode probabilities, durations, Weights may encode probabilities, durations, penalties, ...penalties, ...

Used in speech recognitionUsed in speech recognition Tutorial at Tutorial at

http://www.cs.nyu.edu/~mohri/pub/hbka.pdfhttp://www.cs.nyu.edu/~mohri/pub/hbka.pdf

Markov chain for Markov chain for weatherweather

Markov chain for wordsMarkov chain for words

Markov chain = “First-order Markov chain = “First-order observable Markov Model”observable Markov Model”

A set of states A set of states QQ q1, q2…qN sequence of states: state at time t is qt

Transition probabilities: Transition probabilities: a set of probabilities A = a01a02…an1…ann.

Each aij represents the probability of transitioning from state i to state j

Distinguished start and end statesDistinguished start and end states

aij P(qt j | qt 1 i) 1i, j N

aij 1; 1i Nj1

N

Markov chain = “First-order Markov chain = “First-order observable Markov Model”observable Markov Model”

Markov AssumptionMarkov Assumption Current state only depends on previous stateCurrent state only depends on previous state

PP((qqii | | qq11 … … qqii-1-1) = ) = PP((qqii | | qqii-1-1))

Another representation for Another representation for start statestart state Instead of start stateInstead of start state Special initial probability vector Special initial probability vector

An initial distribution over probability of start states

Constraints:Constraints:

i P(q1 i) 1i N

j 1j1

N

The weather model using The weather model using

The weather model: The weather model: specific examplespecific example

Markov chain for weatherMarkov chain for weather

What is the probability of 4 consecutive warm What is the probability of 4 consecutive warm days?days?

Sequence is warm-warm-warm-warmSequence is warm-warm-warm-warm i.e., state sequence is 3-3-3-3i.e., state sequence is 3-3-3-3

PP(3, 3, 3, 3) =(3, 3, 3, 3) =

33aa3333aa3333aa3333aa3333 = 0.2 • (0.6) = 0.2 • (0.6)33 = 0.0432 = 0.0432

How about?How about?

Hot hot hot hotHot hot hot hot Cold hot cold hotCold hot cold hot

What does the difference in these What does the difference in these probabilities tell you about the real world probabilities tell you about the real world weather info encoded in the figure?weather info encoded in the figure?

HMM for Ice CreamHMM for Ice Cream

You are a climatologist in the year 2799You are a climatologist in the year 2799 Studying global warmingStudying global warming You can’t find any records of the weather in You can’t find any records of the weather in

Baltimore, MD for summer of 2008Baltimore, MD for summer of 2008 But you find Jason Eisner’s diaryBut you find Jason Eisner’s diary Which lists how many ice-creams Jason ate Which lists how many ice-creams Jason ate

every date that summerevery date that summer Our job: figure out how hot it wasOur job: figure out how hot it was

Hidden Markov ModelHidden Markov Model

For Markov chains, the output symbols are the same as the For Markov chains, the output symbols are the same as the states.states. See hot weather: we’re in state hot

But in named-entity or part-of-speech tagging (and speech But in named-entity or part-of-speech tagging (and speech recognition and other things)recognition and other things) The output symbols are words But the hidden states are something else

• Part-of-speech tags

• Named entity tags

So we need an extension!So we need an extension! A A Hidden Markov ModelHidden Markov Model is an extension of a Markov chain in is an extension of a Markov chain in

which the input symbols are not the same as the states.which the input symbols are not the same as the states. This means This means we don’t know which state we are inwe don’t know which state we are in..

Hidden Markov ModelsHidden Markov Models

AssumptionsAssumptions

Markov assumption:Markov assumption:

PP((qqii | | qq11 … … qqii-1-1) = ) = PP((qqii | | qqii-1-1))

Output-independence assumptionOutput-independence assumption

P(ot | O1t 1,q1

t ) P(ot |qt )

Eisner taskEisner task

GivenGiven Ice Cream Observation Sequence: 1,2,3,2,2,2,3…

Produce:Produce: Weather Sequence: H,C,H,H,H,C…

HMM for ice creamHMM for ice cream

Different types of HMM Different types of HMM structurestructure

Bakis = left-to-right Ergodic = fully-connected

The Three Basic Problems for The Three Basic Problems for HMMsHMMs

Problem 1 (Problem 1 (EvaluationEvaluation)): : Given the observation sequence Given the observation sequence OO=(=(oo11oo22……ooTT)), and an HMM model , and an HMM model = ( = (AA,,BB)), , how do we how do we

efficiently compute efficiently compute PP((OO| | )), the probability of the , the probability of the observation sequence, given the modelobservation sequence, given the model

Problem 2 (Problem 2 (DecodingDecoding)): : Given the observation sequence Given the observation sequence OO=(=(oo11oo22……ooTT)), and an HMM model , and an HMM model = ( = (AA,,BB)), , how do we how do we

choose a corresponding state sequence choose a corresponding state sequence QQ=(=(qq11qq22……qqTT)) that that

is optimal in some sense (i.e., best explains the is optimal in some sense (i.e., best explains the observations)observations)

Problem 3 (Problem 3 (Learning): Learning): How do we adjust the model How do we adjust the model parameters parameters = ( = (AA,,BB)) to maximize to maximize PP((OO| | ) )??

Jack Ferguson at IDA in the 1960s

Problem 1: computing the Problem 1: computing the observation likelihoodobservation likelihood

Given the following HMM:Given the following HMM:

How likely is the sequence 3 1 3?How likely is the sequence 3 1 3?

How to compute likelihoodHow to compute likelihood

For a Markov chain, we just follow the states For a Markov chain, we just follow the states 3 1 3 and multiply the probabilities3 1 3 and multiply the probabilities

But for an HMM, we don’t know what the But for an HMM, we don’t know what the states are!states are!

So let’s start with a simpler situation.So let’s start with a simpler situation. Computing the observation likelihood for a Computing the observation likelihood for a

givengiven hidden state sequence hidden state sequence Suppose we knew the weather and wanted to

predict how much ice cream Jason would eat. i.e. P( 3 1 3 | H H C)

Computing likelihood of 3 1 3 given Computing likelihood of 3 1 3 given hidden state sequencehidden state sequence

Computing joint probability of Computing joint probability of observation and state sequenceobservation and state sequence

Computing total likelihood of Computing total likelihood of 3 1 33 1 3 We would need to sum overWe would need to sum over

Hot hot cold Hot hot hot Hot cold hot ….

How many possible hidden state sequences are there How many possible hidden state sequences are there for this sequence?for this sequence?

How about in general for an HMM with How about in general for an HMM with NN hidden states hidden states and a sequence of and a sequence of TT observations? observations? NT

So we can’t just do separate computation for each So we can’t just do separate computation for each hidden state sequence.hidden state sequence.

Instead: the Forward Instead: the Forward algorithmalgorithm A kind of A kind of dynamic programmingdynamic programming algorithm algorithm

Just like Minimum Edit Distance Uses a table to store intermediate values

Idea:Idea: Compute the likelihood of the observation

sequence By summing over all possible hidden state

sequences But doing this efficiently

• By folding all the sequences into a single trellis

The forward algorithmThe forward algorithm

The goal of the forward algorithm is to The goal of the forward algorithm is to computecompute

PP((oo11, , oo22 … … ooTT, , qqTT = = qqFF | | ))

We’ll do this by recursionWe’ll do this by recursion

The forward algorithmThe forward algorithm

Each cell of the forward algorithm trellis Each cell of the forward algorithm trellis tt((jj)) Represents the probability of being in state j After seeing the first t observations Given the automaton

Each cell thus expresses the following Each cell thus expresses the following probabilityprobability

tt((jj) = ) = PP((oo11, , oo22 … … oott, , qqtt = = jj | | ))

The Forward RecursionThe Forward Recursion

The Forward TrellisThe Forward Trellis

We update each cellWe update each cell

The Forward AlgorithmThe Forward Algorithm

DecodingDecoding

Given an observation sequenceGiven an observation sequence 3 1 3

And an HMMAnd an HMM The task of the The task of the decoderdecoder

To find the best hidden state sequence

Given the observation sequence Given the observation sequence OO=(=(oo11oo22……ooTT)), ,

and an HMM model and an HMM model = (= (AA,,BB)), , how do we how do we choose a corresponding state sequence choose a corresponding state sequence QQ=(=(qq11qq22……qqTT)) that is optimal in some sense that is optimal in some sense

(i.e., best explains the observations)(i.e., best explains the observations)

DecodingDecoding One possibility:One possibility:

For each hidden state sequence For each hidden state sequence QQ• HHH, HHC, HCH, HHH, HHC, HCH,

Compute Compute PP((OO||QQ)) Pick the highest one Pick the highest one

Why not?Why not? NNTT

Instead:Instead: The Viterbi algorithmThe Viterbi algorithm Is again a Is again a dynamic programmingdynamic programming algorithm algorithm Uses a similar trellis to the Forward algorithmUses a similar trellis to the Forward algorithm

Viterbi intuitionViterbi intuition

We want to compute the joint probability of We want to compute the joint probability of the observation sequence together with the the observation sequence together with the best state sequence best state sequence

maxq 0,q1,...,qT

P(q0,q1,...,qT ,o1,o2,...,oT ,qT qF | )

Viterbi RecursionViterbi Recursion

The Viterbi trellisThe Viterbi trellis

Viterbi intuitionViterbi intuition

Process observation sequence left to rightProcess observation sequence left to right Filling out the trellisFilling out the trellis Each cell:Each cell:

)()(max)( 11

tjijt

N

it obaivjv

Viterbi AlgorithmViterbi Algorithm

Viterbi backtraceViterbi backtrace

Training a HMMTraining a HMM

Forward-backward or Baum-Welch algorithm Forward-backward or Baum-Welch algorithm (Expectation Maximization)(Expectation Maximization)

Backward probabilityBackward probability

tt((ii) = ) = PP((oott+1+1, , oott+2+2……ooT T | | qqtt = = i, i, ))

TT((ii) = ) = aaii,,F F 1 1 ii NN

function function FORWARD-BACKWARD(FORWARD-BACKWARD(observations observations of len of len TT, , output vocabulary output vocabulary VV, , hidden state set Qhidden state set Q) ) returns returns HMMHMM=(=(AA,,BB))

initialize initialize A A and and BB

iterate iterate until convergenceuntil convergence

E-stepE-step

M-stepM-step

returnreturn AA, , BB

Hidden Markov Models for Hidden Markov Models for Part of Speech TaggingPart of Speech Tagging

Part of speech taggingPart of speech tagging

8 (ish) traditional English parts of speech8 (ish) traditional English parts of speech Noun, verb, adjective, preposition, adverb, article, Noun, verb, adjective, preposition, adverb, article,

interjection, pronoun, conjunction, etc.interjection, pronoun, conjunction, etc. This idea has been around for over 2000 years This idea has been around for over 2000 years

(Dionysius Thrax of Alexandria, c. 100 B.C.)(Dionysius Thrax of Alexandria, c. 100 B.C.) Called: parts-of-speech, lexical category, word Called: parts-of-speech, lexical category, word

classes, morphological classes, lexical tags, POSclasses, morphological classes, lexical tags, POS We’ll use POS most frequentlyWe’ll use POS most frequently Assuming that you know what these areAssuming that you know what these are

POS examplesPOS examples

NN noun noun chair, bandwidth, pacingchair, bandwidth, pacing VV verbverb study, debate, munch study, debate, munch ADJADJ adjadj purple, tall, ridiculouspurple, tall, ridiculous ADVADV adverbadverb unfortunately, slowly,unfortunately, slowly, PP prepositionpreposition of, by, toof, by, to PROPRO pronounpronoun I, me, mineI, me, mine DETDET determinerdeterminer the, a, that, thosethe, a, that, those

POS Tagging examplePOS Tagging example

WORDWORD tagtag

thethe DETDETkoalakoala NNput put VVthe the DETDETkeyskeys NNonon PPthethe DETDETtabletable NN

POS TaggingPOS Tagging

Words often have more than one POS: Words often have more than one POS: backback The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the The POS tagging problem is to determine the POS tag for a particular instance of a word.POS tag for a particular instance of a word.

These examples from Dekang Lin

POS tagging as a POS tagging as a sequence classification sequence classification tasktask We are given a sentence (an “observation” or We are given a sentence (an “observation” or

“sequence of observations”)“sequence of observations”) Secretariat is expected to race tomorrow She promised to back the bill

What is the best sequence of tags which What is the best sequence of tags which corresponds to this sequence of corresponds to this sequence of observations?observations?

Probabilistic view:Probabilistic view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag

sequence which is most probable given the observation sequence of n words w1…wn.

Getting to HMMGetting to HMM

We want, out of all sequences of We want, out of all sequences of nn tags tags tt11……ttnn the single tag sequence such that the single tag sequence such that PP((tt11……ttnn||ww11……wwnn)) is highest. is highest.

Hat ^ means “our estimate of the best one”Hat ^ means “our estimate of the best one” ArgmaxArgmaxxx ff((xx)) means “the means “the xx such that such that ff((xx)) is is

maximized”maximized”

Getting to HMMGetting to HMM

This equation is guaranteed to give us the This equation is guaranteed to give us the best tag sequencebest tag sequence

But how to make it operational? How to But how to make it operational? How to compute this value?compute this value?

Intuition of Bayesian classification:Intuition of Bayesian classification: Use Bayes rule to transform into a set of other

probabilities that are easier to compute

Using Bayes RuleUsing Bayes Rule

Likelihood and priorLikelihood and prior

n

Two kinds of probabilities Two kinds of probabilities (1)(1) Tag transition probabilities Tag transition probabilities PP((ttii||ttii-1-1))

Determiners likely to precede adjs and nouns• That/DT flight/NN• The/DT yellow/JJ hat/NN

• So we expect P(NN|DT) and P(JJ|DT) to be high• But P(DT|JJ) to be low

Compute P(NN|DT) by counting in a labeled corpus:

Two kinds of probabilities Two kinds of probabilities (2)(2) Word likelihood probabilities Word likelihood probabilities PP((wwii||ttii))

VBZ (3sg Pres verb) likely to be “is” Compute P(is|VBZ) by counting in a labeled corpus:

POS tagging: likelihood POS tagging: likelihood and priorand prior

n

An Example: the verb An Example: the verb “race”“race” Secretariat/Secretariat/NNPNNP is/ is/VBZVBZ expected/ expected/VBNVBN to/ to/TO TO

racerace//VBVB tomorrow/ tomorrow/NRNR People/People/NNSNNS continue/ continue/VBVB to/ to/TOTO inquire/ inquire/VBVB the/ the/DTDT

reason/reason/NNNN for/ for/ININ the/ the/DTDT racerace//NNNN for/ for/ININ outer/ outer/JJJJ space/space/NNNN

How do we pick the right tag?How do we pick the right tag?

Disambiguating “race”Disambiguating “race”

P(NN|TO) = .00047P(NN|TO) = .00047 P(VB|TO) = .83P(VB|TO) = .83 P(race|NN) = .00057P(race|NN) = .00057 P(race|VB) = .00012P(race|VB) = .00012 P(NR|VB) = .0027P(NR|VB) = .0027 P(NR|NN) = .0012P(NR|NN) = .0012

P(VB|TO)P(race|VB)P(NR|VB) = .00000027P(VB|TO)P(race|VB)P(NR|VB) = .00000027 P(NN|TO)P(race|NN)P(NR|NN) =.00000000032P(NN|TO)P(race|NN)P(NR|NN) =.00000000032

So we (correctly) choose the verb readingSo we (correctly) choose the verb reading

Transitions between the hidden states of Transitions between the hidden states of HMM, showing A probsHMM, showing A probs

B observation likelihoods B observation likelihoods for POS HMMfor POS HMM

The A matrix for the POS The A matrix for the POS HMMHMM

The B matrix for the POS The B matrix for the POS HMMHMM

Viterbi intuition: we are looking Viterbi intuition: we are looking for the best ‘path’for the best ‘path’

promised to back the bill

VBD

VBN

TO

VB

JJ

NN

RB

DT

NNP

VB

NN


VBD

VBN

TO

VB

JJ

NN

RB

DT

NNP

VB

NN

S1 S2 S4S3 S5


VBD

VBN

TO

VB

JJ

NN

RB

DT

NNP

VB

NN

Slide from Dekang Lin

Viterbi exampleViterbi example

OutlineOutline

Markov ChainsMarkov Chains Hidden Markov ModelsHidden Markov Models Three Algorithms for HMMsThree Algorithms for HMMs

The Forward AlgorithmThe Forward Algorithm The Viterbi AlgorithmThe Viterbi Algorithm The Baum-Welch (EM Algorithm)The Baum-Welch (EM Algorithm)

Applications:Applications: The Ice Cream TaskThe Ice Cream Task Part of Speech TaggingPart of Speech Tagging Next time: Named Entity TaggingNext time: Named Entity Tagging

Date post:	27-Jan-2016
Category:	Documents
Upload:	cullen
View:	79 times
Download:	0 times

Hidden Markov Models

Documents