Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1...

Post on 19-Oct-2020

2 views 0 download

transcript

University of Oslo : Department of Informatics

Hidden Markov Modelsand Dynamic Programming

Stephan Oepen Jonathon Read

Date: October 14 2011 Venue: INF4820Department of InformaticsUniversity of Oslo

Topics

Last weekI Parts-of-speech (POS)I A symbolic approach to POS taggingI Stochastic POS tagging

TodayI Hidden Markov ModelsI Computing likelihoods using the Forward algorithmI Decoding hidden states using the Viterbi algorithm

Topics

Last weekI Parts-of-speech (POS)I A symbolic approach to POS taggingI Stochastic POS tagging

TodayI Hidden Markov ModelsI Computing likelihoods using the Forward algorithmI Decoding hidden states using the Viterbi algorithm

Markov chains

DefinitionQ = q1q2 . . . qN a set of N statesq0, qF special start and final states

A =

a01 . . . a0N...

. . ....

aN1 . . . aNN

a transition probability matrix, whereaij the probability of moving from statei to state j

Markov chains

DefinitionQ = q1q2 . . . qN a set of N statesq0, qF special start and final states

A =

a01 . . . a0N...

. . ....

aN1 . . . aNN

a transition probability matrix, whereaij the probability of moving from statei to state j

Hidden Markov models (HMMs)

Markov chains are useful when computing the probability ofobservable sequences. However, we often interested in eventsthat are hidden.

DefinitionQ = q1q2 . . . qN a set of N statesq0, qF special start and final states

A =

a01 . . . a0N...

. . ....

aN1 . . . aNN

a transition probability matrix, whereaij the probability of moving from statei to state j

O = o1o2 . . . oT a sequence of observationsB = bi(ot) a sequence of observation likelihoods

Ice cream and global warming

Missing records of weather in Baltimore for Summer 2007I likelihood of hot/cold weather given yesterday’s weatherI Jason’s diary, listing how many ice creams he ate each dayI number of ice creams he tends to eat, given the weather

Ice cream and global warming

Missing records of weather in Baltimore for Summer 2007I likelihood of hot/cold weather given yesterday’s weatherI Jason’s diary, listing how many ice creams he ate each dayI number of ice creams he tends to eat, given the weather

Computing likelihoods

TaskGiven a HMM λ = (A,B) and an observation sequence O,determine the likelihood P(O|λ).

Compute the sum over all possible state sequences:

P(O) =∑

Q

P(O,Q) =∑

Q

P(O|Q)P(Q)

For example, the ice cream sequence 3 1 3:

P(3 1 3) = P(3 1 3, cold cold cold) +

P(3 1 3, cold cold hot) +

P(3 1 3,hot hot cold) + . . . ⇒ O(NTT)

Computing likelihoods

TaskGiven a HMM λ = (A,B) and an observation sequence O,determine the likelihood P(O|λ).

Compute the sum over all possible state sequences:

P(O) =∑

Q

P(O,Q) =∑

Q

P(O|Q)P(Q)

For example, the ice cream sequence 3 1 3:

P(3 1 3) = P(3 1 3, cold cold cold) +

P(3 1 3, cold cold hot) +

P(3 1 3,hot hot cold) + . . . ⇒ O(NTT)

The Forward algorithm

Employs dynamic programming—storing and reusing theresults of partial computations in a trellis α.

Each cell in the trellis stores the probability of being in state qjafter seing the first t observations:

αt(j) = P(o1 . . . ot, qt = j)

=

N∑i=1

αt−1(i)aijbj(ot)

The Forward algorithm

Employs dynamic programming—storing and reusing theresults of partial computations in a trellis α.

Each cell in the trellis stores the probability of being in state qjafter seing the first t observations:

αt(j) = P(o1 . . . ot, qt = j)

=

N∑i=1

αt−1(i)aijbj(ot)

Calculating a single element in the trellis

Psuedocode for the Forward algorithm

Input: observations of length T, state-graph of length NOutput: forward-probabilitycreate a probability matrix forward[N + 2,T]foreach state s from 1 to N do

forward[s, 1]← ao,s × bs(o1)endforeach time step t from 2 to T do

foreach state s from 1 to N doforward[s, t]←

∑Ns′=1 forward[s, t − 1] × as′,s × bs(ot)

endendforward[qF,T]←

∑Ns=1 forward[s,T] × as,qF

return forward[qF,T]

An example of the Forward algorithmn

The ice cream HMM

Decoding hidden states

TaskGiven a HMM λ = (A,B) and a sequence of observationsO = o1, o2, . . . , oT, find the most probable correspondingsequence of hidden states Q = q1, q2, . . . , qT.

vt(j) = max(q1,...,qt−1)

P(o1 . . . ot, q1 . . . qt−1, qt = j)

=N

maxi=1

vt−1(i) aij bj(ot)

and additionally keep a backtrace to which ever state was themost probable path to the current state:

βt(j) = argN

maxi=1

vt−1(i) aij

Decoding hidden states

TaskGiven a HMM λ = (A,B) and a sequence of observationsO = o1, o2, . . . , oT, find the most probable correspondingsequence of hidden states Q = q1, q2, . . . , qT.

vt(j) = max(q1,...,qt−1)

P(o1 . . . ot, q1 . . . qt−1, qt = j)

=N

maxi=1

vt−1(i) aij bj(ot)

and additionally keep a backtrace to which ever state was themost probable path to the current state:

βt(j) = argN

maxi=1

vt−1(i) aij

Decoding hidden states

TaskGiven a HMM λ = (A,B) and a sequence of observationsO = o1, o2, . . . , oT, find the most probable correspondingsequence of hidden states Q = q1, q2, . . . , qT.

vt(j) = max(q1,...,qt−1)

P(o1 . . . ot, q1 . . . qt−1, qt = j)

=N

maxi=1

vt−1(i) aij bj(ot)

and additionally keep a backtrace to which ever state was themost probable path to the current state:

βt(j) = argN

maxi=1

vt−1(i) aij

Psuedocode for the Virterbi algorithm

Input: observations of length T, state-graph of length NOutput: best-pathcreate a path probability matrix viterbi[N + 2,T]create a path backpointer matrix backpointer[N + 2,T]foreach state s from 1 to N do

forward[s, 1]← ao,s × bs(o1)backpointer[s, 1]← 0

endforeach time step t from 2 to T do

foreach state s from 1 to N doviterbi[s, t]← maxN

s′=1 viterbi[s′, t − 1] × as′ ,s × bs(ot)backpointer[s, t]← arg maxN

s′=1 viterbi[s′, t − 1] × as′ ,s

endendviterbi[qF,T]← maxN

s=1 viterbi[s,T] × as,qF

backpointer[qF,T]← arg maxNs=1 viterbi[s,T] × as,qF

return the path by following backpointers from backpointer[qF,T]

An example of the Viterbi algorithmn

The ice cream HMM

(A Practical Tip)

I When multiplying many small probabilities, we riskgetting values that are too close to zero to be represented:Underflow.

I It is often helpful to work in “log-space”:

log(max f ) = max(log f )

I Reduces multiplication to addition.

log∏

i

Pi =∑

i

log Pi

(A Practical Tip)

I When multiplying many small probabilities, we riskgetting values that are too close to zero to be represented:Underflow.

I It is often helpful to work in “log-space”:

log(max f ) = max(log f )

I Reduces multiplication to addition.

log∏

i

Pi =∑

i

log Pi

Evaluation

I Using a manually labeled test set as our gold standard, wecan compute the accuracy of our model: The percentage oftags in test set that the tagger gets right.

I Compare the accuracy to some reference models: anupper-bound and a baseline.

I An upper-bound ceiling can be based on e.g. how wellhumans would do on the task or by assuming an “oracle”.

I A lower-bound baseline can be based on the accuracyexpected by e.g. random choice, always picking the tagswith the highest frequency, or applying a unigram model.

I Standard hypothesis tests can be applied to test thestatistical significance of any differences.

Evaluation

I Using a manually labeled test set as our gold standard, wecan compute the accuracy of our model: The percentage oftags in test set that the tagger gets right.

I Compare the accuracy to some reference models: anupper-bound and a baseline.

I An upper-bound ceiling can be based on e.g. how wellhumans would do on the task or by assuming an “oracle”.

I A lower-bound baseline can be based on the accuracyexpected by e.g. random choice, always picking the tagswith the highest frequency, or applying a unigram model.

I Standard hypothesis tests can be applied to test thestatistical significance of any differences.