+ All Categories
Home > Documents > Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Date post: 30-Sep-2014
Category:
Upload: roots999
View: 66 times
Download: 0 times
Share this document with a friend
Popular Tags:
34
Lecture 12: Hidden Markov Models Machine Learning Andrew Rosenberg March 12, 2010 1 / 34
Transcript
Page 1: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Lecture 12: Hidden Markov Models

Machine Learning

Andrew Rosenberg

March 12, 2010

1 / 34

Page 2: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Last Time

Clustering

2 / 34

Page 3: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Today

Hidden Markov Models

3 / 34

Page 4: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Dice Example

Imagine a game of dice.

When the croupier rolls 4,5,6 you win.

When the croupier rolls 1,2,3 you lose.Model the likelihood of winning.

IID multinomials

4 / 34

Page 5: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

The Moody Croupier

Now imagine that the croupier cheats.

There are three dice.

One fair (fair)One good for the house (bad)One good for you (good)

5 / 34

Page 6: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

The Moody Croupier

Model the Likelihood of winning.

IID multinomials

Latent variable '

&

$

%i ∈ {0..n − 1}

qi

xi

6 / 34

Page 7: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

The Moody Croupier

Model the Likelihood of winning.

IID multinomials

Latent variable

Allow a prior over die choices'

&

$

%i ∈ {0..n − 1}

θ

qi

xi

7 / 34

Page 8: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

The Moody Croupier

Now what if the dealer is moody?

The dealer doesn’t like to change the die that often

The dealer doesn’t like to switch from the good die to thebad die.

No longer iid!

The die he uses at time t is dependent on the die used at t − 1

θ

q0

x0

q1

x1

q2

x2

qT−1

xT−1

. . .

8 / 34

Page 9: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Sequential Modeling

q0

x0

q1

x1

q2

x2

qT−1

xT−1

. . .

Temporal or sequence model.

Markov Assumption

future ⊥⊥ past|present

p(qt |qt−1, qt−2, qt−3, ..., q0) = p(qt |qt−1)

Get the overall likelihood from the graphical model.

p(x) = p(q0)

T−1∏

t=1

p(qt |qt−1)

T−1∏

t=0

p(xt |qt)

9 / 34

Page 10: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Sequential Modeling

future ⊥⊥ past|present

p(qt |qt−1, qt−2, qt−3, ..., q0) = p(qt |qt−1)

Get the overall likelihood from the graphical model.

p(x) = p(q0)

T−1∏

t=1

p(qt |qt−1)

T−1∏

t=0

p(xt |qt)

p(qt |qt−1)?

10 / 34

Page 11: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

HMMs as state machines

HMMs have two variables: state q and emission y

In general the state is an unobserved latent variable.

Can consider HMMs as stochastic automata – weighted finitestate machines.

11 / 34

Page 12: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

HMM state machine

good

.5

fair .5

bad

.5

.3

.2

.2

.3

.2

.3

12 / 34

Page 13: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

HMMs as state machines

HMMs have two variables: state q and emission x

In general the state is an unobserved latent variable.

No observation of q directly. Only a related emissiondistribution. “doubly-stochastic automaton”.

13 / 34

Page 14: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

HMM Applications

Speech Recognition (Rabiner): phonemes from audiocepstral vectors

Language (Jelinek): part of speech tag from words

Biology (Baldi): splice site from gene sequence

Gesture (Starner): word from hand coordinates

Emotion (Picard): emotion from EEG

14 / 34

Page 15: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Types of Variables

Continuous States

E.g. Kalman filtersp(qt |qt − 1) = N(qt |Aqt−1,Q)

Discrete States

E.g., Finite state machine

p(qt |qt − 1) =∏M−1

i=0

∏M−1j=0 [αij ]

qit−1q

jt

Continuous Observations

E.g. time series datap(xt |qt) = N(xt |µqt

,Σqt)

Discrete Observations

E.g. strings

p(xt |qt) =∏M−1

i=0

∏N−1j=0 [ηij ]

qitx

jt

15 / 34

Page 16: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

HMM Parameters

M states and N-class observationsComplete likelihood from Graphical Model

p(x) = p(q0)

T−1Y

t=1

p(qt |qt−1)

T−1Y

t=0

p(xt |qt)

Marginalize over unobserved hidden states

p(x) =X

q0

· · ·X

qT−1

p(q, x)

CPTs are reused: θ = {π, η, α}

p(qt |qt−1) =M−1Y

i=0

M−1Y

j=0

[αij ]qit−1x

jt

p(xt |qt) =

M−1Y

i=0

N−1Y

j=0

[ηij ]qitx

jt

p(q0) =M−1Y

i=0

[πi ]qi0

M−1X

j=0

αij = 1

N−1X

j=0

ηij = 1

M−1X

j=0

πi = 1

16 / 34

Page 17: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

HMM Operations

Evaluate

Evaluate the likelihood of a model given data.

Decode

Identify the most likely sequence of states

Max Likelihood

Estimate the parameters.

17 / 34

Page 18: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

JTA on HMM

Junction Tree

18 / 34

Page 19: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

JTA on HMMs

Initialization

ψ(q0, x0) = p(q0)p(x0, q0)

ψ(qt , qt+1) = p(qt+1|qt) = Aqt ,qt+1

ψ(qt , xt) = p(xt |qt)

Z = 1

φ(qt) = 1

ζ(qt) = 1

19 / 34

Page 20: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

JTA on HMMs

Update

Collect up from leaves – don’t change zeta separators.

ζ∗(qt) =∑

xt

ψ(qt , xt) =∑

xt

p(xt |qt) = 1

ψ∗(qt−1, qt) =ζ∗(qt)

ζ(qt)ψ(qt−1, qt) = ψ(qt−1, qt)

20 / 34

Page 21: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

JTA on HMMs

Update

Collect left-right over phi – state sequence becomes marginals.

φ∗(q0) =∑

x0

ψ(q0, x0) = p(q0)

φ∗(qt) =∑

qt−1

ψ(qt , qt−1) = p(qt)

ψ∗(q0, q1) =φ∗(q0)

φ(q0)ψ(q0, q1) = p(q0, q1)

21 / 34

Page 22: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

JTA on HMMs

Distribute

Distribute to separators

ζ∗∗(qt) =

X

qt−1

ψ∗(qt−1, qt) =

X

qt−1

p(qt−1, qt) = p(qt)

ψ∗∗(qt , xt) =

ζ∗∗(qt)

ζ∗(qt)ψ(qt , xt) =

p(qt)

1p(xt |qt) = p(xt , qt)

22 / 34

Page 23: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Introduction of Evidence

p(q|x̄) = p(q0)T−1∏

t=0

p(qt |qt−1)T−1∏

t=0

p(x̄t |qt)

Observe a sequence of data.

Potentials become slices

ψ(qt , x̄t) = p(x̄t |qt)

ζ∗(qt) = ψ(qt , x̄t) = p(x̄t |qt)

ζ∗(qt) 6=∑

xt

ψ(qt , x̄t)

Collect zeta separators bottom upζ∗(qt) = ψ(qt , x̄t) = p(x̄t |qt)

Collect phi separators to the rightφ∗(q0) =

∑x0ψ(q0, x̄0)δ(x0 − x̄0) = p(q0, x̄0)

23 / 34

Page 24: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

JTA collect

Collecting up and to the left, updating potentials by left andbottom separators

ψ∗(qt , qt+1) =φ∗(qt)

1

ζ∗(qt+1)

1ψ(qt , qt+1) = φ∗(qt)p(x̄t+1|qt+1)αqtqt+1

φ∗(qt+1) =∑

qt

ψ∗(qt , qt+1) =∑

qt

φ∗(qt)p(x̄t+1|qt+1)αqtqt+1

Note:

φ∗(q0) = p(x̄0, q0)

φ∗(q1) =

X

q0

p(x̄0, q0)p(x̄1|q1)p(q1|q0) = p(x̄0, x̄1, q1)

φ∗(q2) =

X

q1

p(x̄0, x̄1, q0)p(x̄2|q2)p(q2|q1) = p(x̄0, x̄1, x̄2, q2)

φ∗(qt+1) =

X

qt

p(x̄0, . . . , x̄t+1, qt+1)p(x̄t+1|qt+1)p(qt+1|qt) = p(x̄0, . . . , x̄t+1, qt+1)

24 / 34

Page 25: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Evaluation

Compute the likelihood of the sequence.

Collection is sufficient.

From previous slide

φ∗(qt+1) =

X

qt

p(x̄0, . . . , x̄t , qt)p( ¯xt+1, qt+1)p(qt+1|qt) = p(x̄0, . . . , ¯xt+1, qt+1)

So the rightmost node gives:

φ∗(qT−1) = p(x̄0, . . . , ¯xT−1, qT−1)

The likelihood just requires marginalization over qT−1.

p(x̄0, . . . , ¯xT−1) =X

qT−1

p(x̄0, . . . , ¯xT−1, qT−1) =X

qT−1

φ∗(qT−1)

25 / 34

Page 26: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Distribute

But the potentials cannot be read as marginals without theDistribute step of the JTA.

Last state of collection

ψ∗(qT−2, qT−1) =

φ∗(qT−2)

1

ζ∗(qT−1)

1ψ(qT−2, qT−1) = φ

∗(qT−2)p(x̄T−1|qT−1)αqT−2qT

Distribute ** along the state nodes to the left.

Distribute ** down from state nodes to observation nodes.

Update parameters.

ψ∗∗(qT−2, qT−1) = ψ

∗(qT−2, qT−1)

φ∗∗(qt) =

X

qt+1

ψ∗∗(qt , qt+1)

ζ∗∗(qt+1) =

X

qt

ψ∗∗(qt , qt+1)

ψ∗∗(qt , qt+1) =

φ∗∗(qt+1)

φ∗(qt+1)ψ

∗(qt , qt+1)

26 / 34

Page 27: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Decoding

Decode: Given x0, . . . xT−1 identify the most likelyq0, . . . qT−1.

Now that JTA is finished we have marginals in the potentialsand separators

φ∗∗(qt) ∝ p(qt |x̄0, . . . , x̄T−1)

ζ∗∗(qt+1) ∝ p(qt+1|x̄0, . . . , x̄T−1)

ψ∗∗(qt , qt+1) ∝ p(qt , qt+1|x̄0, . . . , x̄T−1)

Need to find the most likely path from q0 to qT−1

Argmax JTA.Run JTA but rather than sums in the update rule, use the maxoperator.Then find the largest entry in the separators

q̂t = argmaxqt

φ∗∗(qt)

27 / 34

Page 28: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Viterbi Decoding

Finding an optimal state sequence can be intractable.

There are MT possible paths, for M states and T time steps.

T can easily be on the order of 1000 in speech recognition.

Construct a Lattice of state transitions

28 / 34

Page 29: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Viterbi Decoding

Only continue to explore paths with likelihood greater thansome threshold, or only continue to explore the top N-paths

Also known as beam search

Polynomial time algorithm to approximately decode a lattice.

Algorithm:

Initialize paths at every state.

For each transition follow only the most likely edge.

or

Initialize paths at every state.

For each transition follow only those paths that have alikelihood over some threshold.

29 / 34

Page 30: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Viterbi decoding

30 / 34

Page 31: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Maximum Likelihood

Training parameters with observed states.

Maximum likelihood (as ever).

l(θ) = log(p(q, x̄))

= log

p(q0)

T−1Y

t=1

p(qt |qt−1)

T−1Y

t=0

p(x̄i |qi )

!

= log p(q0) +T−1X

t=1

log p(qt |qt−1) +T−1X

t=0

log p(x̄i |qi )

= log

M−1Y

i=0

[πi ]qi0 +

T−1X

t=1

log

M−1Y

i=0

M−1Y

j=0

[αij ]qit−1q

jt +

T−1X

t=0

log

M−1Y

i=0

N−1Y

j=0

[ηij ]qit x̄

jt

=

M−1X

i=0

qi0 log πi +

T−1X

t=1

M−1X

i=0

M−1X

j=0

qit−1q

jt logαij +

T−1X

t=0

M−1X

i=0

N−1X

j=0

qit x̄

jt log ηij

Introduce Lagrange multipliers, take partials, set to zero.

31 / 34

Page 32: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Maximum Likelihood

Training parameters with observed states.

Maximum likelihood – as ever.

l(θ) =M−1X

i=0

qi0 log πi +

T−1X

t=1

M−1X

i=0

M−1X

j=0

qit−1q

jt logαij +

T−1X

t=0

M−1X

i=0

N−1X

j=0

qit x̄

jt log ηij

Introduce Lagrange multipliers, take partials, set to zero.

M−1∑

i=0

πi = 1

M−1∑

j=0

αij = 1

N−1∑

j=0

ηij = 1

π̂i = qi0

α̂ij =

∑T−2t=0 qi

tqjt+1∑M−1

k=1

∑T−2t=0 qi

tqkt+1

η̂ij =

∑T−1t=0 qi

t x̄jt∑M−1

k=1

∑T−1t=0 qi

t x̄kt

32 / 34

Page 33: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Expectation Maximization

However, we may not have observed state sequences.

The Moody Croupier

Need to do unsupervised learning (clustering) on the states.

Maximize the Expected likelihood given a guess for p(q)Expectation Maximization – Covered when we move tounsupervised techniques

33 / 34

Page 34: Andrew Rosenberg- Lecture 12: Hidden Markov Models Machine Learning

Bye

Next

Perceptron and Neural Networks

34 / 34


Recommended