Home >
Documents >
Hidden Markov Models · Hidden Markov Models 1 10-601 Introduction to Machine Learning Matt Gormley...

Share this document with a friend

of 57
/57

Transcript

Hidden Markov Models

1

10-601 Introduction to Machine Learning

Matt GormleyLecture 20

Nov. 7, 2018

Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University

Reminders

• Homework 6: PAC Learning / GenerativeModels– Out: Wed, Oct 31– Due: Wed, Nov 7 at 11:59pm (1 week)

• Homework 7: HMMs– Out: Wed, Nov 7– Due: Mon, Nov 19 at 11:59pm

2

HMM Outline• Motivation

– Time Series Data• Hidden Markov Model (HMM)

– Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld]

– Background: Markov Models– From Mixture Model to HMM– History of HMMs– Higher-order HMMs

• Training HMMs– (Supervised) Likelihood for HMM– Maximum Likelihood Estimation (MLE) for HMM– EM for HMM (aka. Baum-Welch algorithm)

• Forward-Backward Algorithm– Three Inference Problems for HMM– Great Ideas in ML: Message Passing– Example: Forward-Backward on 3-word Sentence– Derivation of Forward Algorithm– Forward-Backward Algorithm– Viterbi algorithm

3

This Lecture

Last Lecture

SUPERVISED LEARNING FOR HMMS

4

HMM Parameters:

Hidden Markov Model

6

X1 X2 X3 X4 X5

Y1 Y2 Y3 Y4 Y5

O S CO .9 .08.02S .2 .7 .1C .9 0 .1

1min

2min

3min

…

O .1 .2 .3S .01 .02.03C 0 0 0

O S CO .9 .08.02S .2 .7 .1C .9 0 .1

1min

2min

3min

…

O .1 .2 .3S .01 .02.03C 0 0 0

O .8S .1C .1

Training HMMs

Whiteboard– (Supervised) Likelihood for an HMM– Maximum Likelihood Estimation (MLE) for HMM

7

Supervised Learning for HMMsLearning an HMM decomposes into solving two (independent) Mixture Models

8

Yt Yt+1

Xt

Yt

HMM Parameters:

Assumption:Generative Story:

Hidden Markov Model

9X1 X2 X3 X4 X5

Y1 Y2 Y3 Y4 Y5Y0

y0 = STARTFor notational

convenience, we fold the initial probabilities C into the transition matrix B by

our assumption.

Joint Distribution:

Hidden Markov Model

10X1 X2 X3 X4 X5

Y1 Y2 Y3 Y4 Y5Y0

y0 = START

Supervised Learning for HMMsLearning an HMM decomposes into solving two (independent) Mixture Models

11

Yt Yt+1

Xt

Yt

HMMs: History

• Markov chains: Andrey Markov (1906)

– Random walks and Brownian motion

• Used in Shannon’s work on information theory (1948)

• Baum-Welsh learning algorithm: late 60’s, early 70’s.

– Used mainly for speech in 60s-70s.

• Late 80’s and 90’s: David Haussler (major player in

learning theory in 80’s) began to use HMMs for

modeling biological sequences

• Mid-late 1990’s: Dayne Freitag/Andrew McCallum

– Freitag thesis with Tom Mitchell on IE from Web

using logic programs, grammar induction, etc.

– McCallum: multinomial Naïve Bayes for text

– With McCallum, IE using HMMs on CORA

• …

13Slide from William Cohen

Higher-order HMMs

• 1st-order HMM (i.e. bigram HMM)

• 2nd-order HMM (i.e. trigram HMM)

• 3rd-order HMM

14

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

BACKGROUND: MESSAGE PASSING

15

Great Ideas in ML: Message Passing

3 behind you

2 behind you

1 behind you

4 behind you

5 behind you

1 beforeyou

2 beforeyou

there's1 of me

3 beforeyou

4 beforeyou

5 beforeyou

Count the soldiers

16adapted from MacKay (2003) textbook

Great Ideas in ML: Message Passing

3 behind you

2 beforeyou

there's1 of me

Belief:Must be2 + 1 + 3 = 6 of us

only seemy incomingmessages

2 31

Count the soldiers

17adapted from MacKay (2003) textbook

2 beforeyou

Great Ideas in ML: Message Passing

4 behind you

1 beforeyou

there's1 of me

only seemy incomingmessages

Count the soldiers

18adapted from MacKay (2003) textbook

Belief:Must be2 + 1 + 3 = 6 of us2 31

Belief:Must be1 + 1 + 4 = 6 of us

1 41

Great Ideas in ML: Message Passing

7 here

3 here

11 here(= 7+3+1)

1 of me

Each soldier receives reports from all branches of tree

19adapted from MacKay (2003) textbook

Great Ideas in ML: Message Passing

3 here

3 here

7 here(= 3+3+1)

Each soldier receives reports from all branches of tree

20adapted from MacKay (2003) textbook

Great Ideas in ML: Message Passing

7 here

3 here

11 here(= 7+3+1)

Each soldier receives reports from all branches of tree

21adapted from MacKay (2003) textbook

Great Ideas in ML: Message Passing

7 here

3 here

3 here

Belief:Must be14 of us

Each soldier receives reports from all branches of tree

22adapted from MacKay (2003) textbook

Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree

7 here

3 here

3 here

Belief:Must be14 of us

23adapted from MacKay (2003) textbook

THE FORWARD-BACKWARD ALGORITHM

24

Inference for HMMs

Whiteboard– Three Inference Problems for an HMM

1. Evaluation: Compute the probability of a given sequence of observations

2. Viterbi Decoding: Find the most-likely sequence of hidden states, given a sequence of observations

3. Marginals: Compute the marginal distribution for a hidden state, given a sequence of observations

25

n n v d nSample 2:

time likeflies an arrow

Dataset for Supervised Part-of-Speech (POS) Tagging

26

n v p d nSample 1:

time likeflies an arrow

p n n v vSample 4:

with youtime will see

n v p n nSample 3:

flies withfly their wings

D = {x(n),y(n)}Nn=1Data:

y(1)

x(1)

y(2)

x(2)

y(3)

x(3)

y(4)

x(4)

time flies like an arrow

n v p d n<START>

Hidden Markov Model

28

A Hidden Markov Model (HMM) provides a joint distribution over the the sentence/tags with an assumption of dependence between adjacent tags.

v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0

v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0

tim

efli

eslik

e…

v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1

tim

efli

eslik

e…

v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1

p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …)

X3X2X1

Y2 Y3Y1

29

find preferred tags

Could be adjective or verb Could be noun or verbCould be verb or noun

Forward-Backward Algorithm

Forward-Backward Algorithm

30

Y2 Y3Y1

X3X2X1find preferred tags

Y2 Y3Y1

X3X2X1find preferred tags

Forward-Backward Algorithm

31

v

n

a

v

n

a

v

n

a

START END

• Let’s show the possible values for each variable• One possible assignment• And what the 7 factors think of it …

Y2 Y3Y1

X3X2X1find preferred tags

Forward-Backward Algorithm

32

v

n

a

v

n

a

v

n

a

START END

• Let’s show the possible values for each variable• One possible assignment• And what the 7 factors think of it …

Y2 Y3Y1

X3X2X1find preferred tags

33

v

n

a

v

n

a

v

n

a

START END

• Let’s show the possible values for each variable• One possible assignment• And what the 7 factors think of it …

Forward-Backward Algorithm

Y2 Y3Y1

X3X2X1find preferred tags

34

v

n

a

v

n

a

v

n

a

START END

• Let’s show the possible values for each variable• One possible assignment• And what the 7 transition / emission factors think of it …

Forward-Backward Algorithm

Y2 Y3Y1

X3X2X1find preferred tags

35

v

n

a

v

n

a

v

n

a

START END

• Let’s show the possible values for each variable• One possible assignment• And what the 7 transition / emission factors think of it …

Forward-Backward Algorithmfin

dp

ref.

tag

s…

v 3 5 3n 4 5 2a 0.1 0.2 0.1

v n av 1 6 4n 8 4 0.1a 0.1 8 0

Y2 Y3Y1

X3X2X1find preferred tags

Viterbi Algorithm: Most Probable Assignment

36

v

n

a

v

n

a

v

n

a

START END

• So p(v a n) = (1/Z) * product of 7 numbers• Numbers associated with edges and nodes of path• Most probable assignment = path with highest product

B(a,END)

A(tags,n)

Y2 Y3Y1

X3X2X1find preferred tags

Viterbi Algorithm: Most Probable Assignment

37

v

n

a

v

n

a

v

n

a

START END

• So p(v a n) = (1/Z) * product weight of one path

B(a,END)

A(tags,n)

Y2 Y3Y1

X3X2X1find preferred tags

Forward-Backward Algorithm: Finds Marginals

38

v

n

a

v

n

a

v

n

a

START END

• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = a)

= (1/Z) * total weight of all paths through a

Y2 Y3Y1

X3X2X1find preferred tags

Forward-Backward Algorithm: Finds Marginals

39

v

n

a

v

n

a

v

n

a

START END

• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = n)

= (1/Z) * total weight of all paths through n

Y2 Y3Y1

X3X2X1find preferred tags

Forward-Backward Algorithm: Finds Marginals

40

v

n

a

v

n

a

v

n

a

START END

• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = v)

= (1/Z) * total weight of all paths through v

Y2 Y3Y1

X3X2X1find preferred tags

Forward-Backward Algorithm: Finds Marginals

41

v

n

a

v

n

a

v

n

a

START END

• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = n)

= (1/Z) * total weight of all paths through n

Y2 Y3Y1

X3X2X1find preferred tags

α2(n) = total weight of thesepath prefixes

(found by dynamic programming: matrix-vector products)

Forward-Backward Algorithm: Finds Marginals

42

v

n

a

v

n

a

v

n

a

START END

Y2 Y3Y1

X3X2X1find preferred tags

= total weight of thesepath suffixes

b2(n)

(found by dynamic programming: matrix-vector products)

Forward-Backward Algorithm: Finds Marginals

43

v

n

a

v

n

a

v

n

a

START END

Y2 Y3Y1

X3X2X1find preferred tags

v

n

a

v

n

a

v

n

a

START END

α2(n) = total weight of thesepath prefixes

= total weight of thesepath suffixes

Forward-Backward Algorithm: Finds Marginals

44

b2(n)(a + b + c) (x + y + z)

Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of paths

Y2 Y3Y1

X3X2X1find preferred tags

v

n

a

v

n

a

v

n

a

START END

Forward-Backward Algorithm: Finds Marginals

45

total weight of all paths through= × ×

n

A(pref., n)

α2(n) b2(n)

α2(n) A(pref., n) b2(n)

“belief that Y2 = n”

Oops! The weight of a path through a state also

includes a weight at that state.

So α(n)·β(n) isn’t enough.

The extra weight is the opinion of the emission

probability at this variable.

Y2 Y3Y1

X3X2X1find preferred tags

v

n

a a

v

n

a

START END

Forward-Backward Algorithm: Finds Marginals

46

total weight of all paths through= × ×

v

α2(v) A(pref., v) b2(v)

n

v

“belief that Y2 = n”α2(v) b2(v)

“belief that Y2 = v”

A(pref., v)

Y2 Y3Y1

X3X2X1find preferred tags

v

n

a

v

n

a

START END

Forward-Backward Algorithm: Finds Marginals

47

total weight of all paths through= × ×

a

α2(a) A(pref., a) b2(a)

n

v

“belief that Y2 = n”α2(a) b2(a)

“belief that Y2 = v”

A(pref., a)

a “belief that Y2 = a”

sum = Z(total weightof all paths)

v 0.1n 0a 0.4

v 0.2n 0a 0.8

divide by Z=0.5

to get marginal

probs

X3X2X1

Y2 Y3Y1

48

find preferred tags

Could be adjective or verb Could be noun or verbCould be verb or noun

Forward-Backward Algorithm

Inference for HMMs

Whiteboard– Derivation of Forward algorithm– Forward-backward algorithm– Viterbi algorithm

49

Derivation of Forward Algorithm

50

Derivation:

Definition:

Forward-Backward Algorithm

51

Viterbi Algorithm

52

Inference in HMMs

What is the computational complexity of inference for HMMs?

• The naïve (brute force) computations for Evaluation, Decoding, and Marginals take exponential time, O(KT)

• The forward-backward algorithm and Viterbialgorithm run in polynomial time, O(T*K2)– Thanks to dynamic programming!

53

Shortcomings of Hidden Markov Models

• HMM models capture dependences between each state and only its corresponding observation

– NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length, indentation, amount of white space, etc.

• Mismatch between learning objective function and prediction objective function

– HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)

© Eric Xing @ CMU, 2005-2015 54

Y1 Y2 … … … Yn

X1 X2 … … … Xn

START

MBR DECODING

55

Inference for HMMs

– Three Inference Problems for an HMM1. Evaluation: Compute the probability of a given

sequence of observations2. Viterbi Decoding: Find the most-likely sequence of

hidden states, given a sequence of observations3. Marginals: Compute the marginal distribution for a

hidden state, given a sequence of observations4. MBR Decoding: Find the lowest loss sequence of

hidden states, given a sequence of observations (Viterbi decoding is a special case)

56

Minimum Bayes Risk Decoding• Suppose we given a loss function l(y’, y) and are

asked for a single tagging• How should we choose just one from our probability

distribution p(y|x)?• A minimum Bayes risk (MBR) decoder h(x) returns

the variable assignment with minimum expected loss under the model’s distribution

57

h✓

(x) = argminy

Ey⇠p✓(·|x)[`(y,y)]

= argminy

X

y

p✓

(y | x)`(y,y)

The 0-1 loss function returns 1 only if the two assignments are identical and 0 otherwise:

The MBR decoder is:

which is exactly the Viterbi decoding problem!

Minimum Bayes Risk Decoding

Consider some example loss functions:

58

`(y,y) = 1� I(y,y)

h✓(x) = argmin

y

X

y

p✓(y | x)(1� I(ˆy,y))

= argmax

yp✓(ˆy | x)

h✓

(x) = argminy

Ey⇠p✓(·|x)[`(y,y)]

= argminy

X

y

p✓

(y | x)`(y,y)

The Hamming loss corresponds to accuracy and returns the number of incorrect variable assignments:

The MBR decoder is:

This decomposes across variables and requires the variable marginals.

Minimum Bayes Risk Decoding

Consider some example loss functions:

59

`(y,y) =VX

i=1

(1� I(yi, yi))

yi = h✓(x)i = argmax

yi

p✓(yi | x)

h✓

(x) = argminy

Ey⇠p✓(·|x)[`(y,y)]

= argminy

X

y

p✓

(y | x)`(y,y)

Learning ObjectivesHidden Markov Models

You should be able to…1. Show that structured prediction problems yield high-computation inference

problems2. Define the first order Markov assumption3. Draw a Finite State Machine depicting a first order Markov assumption4. Derive the MLE parameters of an HMM5. Define the three key problems for an HMM: evaluation, decoding, and

marginal computation6. Derive a dynamic programming algorithm for computing the marginal

probabilities of an HMM7. Interpret the forward-backward algorithm as a message passing algorithm8. Implement supervised learning for an HMM9. Implement the forward-backward algorithm for an HMM10. Implement the Viterbi algorithm for an HMM11. Implement a minimum Bayes risk decoder with Hamming loss for an HMM

60

Recommended