Hidden Markov Models
1
10-601 Introduction to Machine Learning
Matt GormleyLecture 20
Nov. 7, 2018
Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University
Reminders
• Homework 6: PAC Learning / GenerativeModels– Out: Wed, Oct 31– Due: Wed, Nov 7 at 11:59pm (1 week)
• Homework 7: HMMs– Out: Wed, Nov 7– Due: Mon, Nov 19 at 11:59pm
2
HMM Outline• Motivation
– Time Series Data• Hidden Markov Model (HMM)
– Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld]
– Background: Markov Models– From Mixture Model to HMM– History of HMMs– Higher-order HMMs
• Training HMMs– (Supervised) Likelihood for HMM– Maximum Likelihood Estimation (MLE) for HMM– EM for HMM (aka. Baum-Welch algorithm)
• Forward-Backward Algorithm– Three Inference Problems for HMM– Great Ideas in ML: Message Passing– Example: Forward-Backward on 3-word Sentence– Derivation of Forward Algorithm– Forward-Backward Algorithm– Viterbi algorithm
3
This Lecture
Last Lecture
SUPERVISED LEARNING FOR HMMS
4
HMM Parameters:
Hidden Markov Model
6
X1 X2 X3 X4 X5
Y1 Y2 Y3 Y4 Y5
O S CO .9 .08.02S .2 .7 .1C .9 0 .1
1min
2min
3min
…
O .1 .2 .3S .01 .02.03C 0 0 0
O S CO .9 .08.02S .2 .7 .1C .9 0 .1
1min
2min
3min
…
O .1 .2 .3S .01 .02.03C 0 0 0
O .8S .1C .1
Training HMMs
Whiteboard– (Supervised) Likelihood for an HMM– Maximum Likelihood Estimation (MLE) for HMM
7
Supervised Learning for HMMsLearning an HMM decomposes into solving two (independent) Mixture Models
8
Yt Yt+1
Xt
Yt
HMM Parameters:
Assumption:Generative Story:
Hidden Markov Model
9X1 X2 X3 X4 X5
Y1 Y2 Y3 Y4 Y5Y0
y0 = STARTFor notational
convenience, we fold the initial probabilities C into the transition matrix B by
our assumption.
Joint Distribution:
Hidden Markov Model
10X1 X2 X3 X4 X5
Y1 Y2 Y3 Y4 Y5Y0
y0 = START
Supervised Learning for HMMsLearning an HMM decomposes into solving two (independent) Mixture Models
11
Yt Yt+1
Xt
Yt
HMMs: History
• Markov chains: Andrey Markov (1906)
– Random walks and Brownian motion
• Used in Shannon’s work on information theory (1948)
• Baum-Welsh learning algorithm: late 60’s, early 70’s.
– Used mainly for speech in 60s-70s.
• Late 80’s and 90’s: David Haussler (major player in
learning theory in 80’s) began to use HMMs for
modeling biological sequences
• Mid-late 1990’s: Dayne Freitag/Andrew McCallum
– Freitag thesis with Tom Mitchell on IE from Web
using logic programs, grammar induction, etc.
– McCallum: multinomial Naïve Bayes for text
– With McCallum, IE using HMMs on CORA
• …
13Slide from William Cohen
Higher-order HMMs
• 1st-order HMM (i.e. bigram HMM)
• 2nd-order HMM (i.e. trigram HMM)
• 3rd-order HMM
14
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
BACKGROUND: MESSAGE PASSING
15
Great Ideas in ML: Message Passing
3 behind you
2 behind you
1 behind you
4 behind you
5 behind you
1 beforeyou
2 beforeyou
there's1 of me
3 beforeyou
4 beforeyou
5 beforeyou
Count the soldiers
16adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
3 behind you
2 beforeyou
there's1 of me
Belief:Must be2 + 1 + 3 = 6 of us
only seemy incomingmessages
2 31
Count the soldiers
17adapted from MacKay (2003) textbook
2 beforeyou
Great Ideas in ML: Message Passing
4 behind you
1 beforeyou
there's1 of me
only seemy incomingmessages
Count the soldiers
18adapted from MacKay (2003) textbook
Belief:Must be2 + 1 + 3 = 6 of us2 31
Belief:Must be1 + 1 + 4 = 6 of us
1 41
Great Ideas in ML: Message Passing
7 here
3 here
11 here(= 7+3+1)
1 of me
Each soldier receives reports from all branches of tree
19adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
3 here
3 here
7 here(= 3+3+1)
Each soldier receives reports from all branches of tree
20adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
7 here
3 here
11 here(= 7+3+1)
Each soldier receives reports from all branches of tree
21adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
7 here
3 here
3 here
Belief:Must be14 of us
Each soldier receives reports from all branches of tree
22adapted from MacKay (2003) textbook
Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree
7 here
3 here
3 here
Belief:Must be14 of us
23adapted from MacKay (2003) textbook
THE FORWARD-BACKWARD ALGORITHM
24
Inference for HMMs
Whiteboard– Three Inference Problems for an HMM
1. Evaluation: Compute the probability of a given sequence of observations
2. Viterbi Decoding: Find the most-likely sequence of hidden states, given a sequence of observations
3. Marginals: Compute the marginal distribution for a hidden state, given a sequence of observations
25
n n v d nSample 2:
time likeflies an arrow
Dataset for Supervised Part-of-Speech (POS) Tagging
26
n v p d nSample 1:
time likeflies an arrow
p n n v vSample 4:
with youtime will see
n v p n nSample 3:
flies withfly their wings
D = {x(n),y(n)}Nn=1Data:
y(1)
x(1)
y(2)
x(2)
y(3)
x(3)
y(4)
x(4)
time flies like an arrow
n v p d n
Hidden Markov Model
28
A Hidden Markov Model (HMM) provides a joint distribution over the the sentence/tags with an assumption of dependence between adjacent tags.
v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0
v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0
tim
efli
eslik
e…
v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1
tim
efli
eslik
e…
v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1
p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …)
X3X2X1
Y2 Y3Y1
29
find preferred tags
Could be adjective or verb Could be noun or verbCould be verb or noun
Forward-Backward Algorithm
Forward-Backward Algorithm
30
Y2 Y3Y1
X3X2X1find preferred tags
Y2 Y3Y1
X3X2X1find preferred tags
Forward-Backward Algorithm
31
v
n
a
v
n
a
v
n
a
START END
• Let’s show the possible values for each variable• One possible assignment• And what the 7 factors think of it …
Y2 Y3Y1
X3X2X1find preferred tags
Forward-Backward Algorithm
32
v
n
a
v
n
a
v
n
a
START END
• Let’s show the possible values for each variable• One possible assignment• And what the 7 factors think of it …
Y2 Y3Y1
X3X2X1find preferred tags
33
v
n
a
v
n
a
v
n
a
START END
• Let’s show the possible values for each variable• One possible assignment• And what the 7 factors think of it …
Forward-Backward Algorithm
Y2 Y3Y1
X3X2X1find preferred tags
34
v
n
a
v
n
a
v
n
a
START END
• Let’s show the possible values for each variable• One possible assignment• And what the 7 transition / emission factors think of it …
Forward-Backward Algorithm
Y2 Y3Y1
X3X2X1find preferred tags
35
v
n
a
v
n
a
v
n
a
START END
• Let’s show the possible values for each variable• One possible assignment• And what the 7 transition / emission factors think of it …
Forward-Backward Algorithmfin
dp
ref.
tag
s…
v 3 5 3n 4 5 2a 0.1 0.2 0.1
v n av 1 6 4n 8 4 0.1a 0.1 8 0
Y2 Y3Y1
X3X2X1find preferred tags
Viterbi Algorithm: Most Probable Assignment
36
v
n
a
v
n
a
v
n
a
START END
• So p(v a n) = (1/Z) * product of 7 numbers• Numbers associated with edges and nodes of path• Most probable assignment = path with highest product
B(a,END)
A(tags,n)
Y2 Y3Y1
X3X2X1find preferred tags
Viterbi Algorithm: Most Probable Assignment
37
v
n
a
v
n
a
v
n
a
START END
• So p(v a n) = (1/Z) * product weight of one path
B(a,END)
A(tags,n)
Y2 Y3Y1
X3X2X1find preferred tags
Forward-Backward Algorithm: Finds Marginals
38
v
n
a
v
n
a
v
n
a
START END
• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = a)
= (1/Z) * total weight of all paths through a
Y2 Y3Y1
X3X2X1find preferred tags
Forward-Backward Algorithm: Finds Marginals
39
v
n
a
v
n
a
v
n
a
START END
• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = n)
= (1/Z) * total weight of all paths through n
Y2 Y3Y1
X3X2X1find preferred tags
Forward-Backward Algorithm: Finds Marginals
40
v
n
a
v
n
a
v
n
a
START END
• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = v)
= (1/Z) * total weight of all paths through v
Y2 Y3Y1
X3X2X1find preferred tags
Forward-Backward Algorithm: Finds Marginals
41
v
n
a
v
n
a
v
n
a
START END
• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = n)
= (1/Z) * total weight of all paths through n
Y2 Y3Y1
X3X2X1find preferred tags
α2(n) = total weight of thesepath prefixes
(found by dynamic programming: matrix-vector products)
Forward-Backward Algorithm: Finds Marginals
42
v
n
a
v
n
a
v
n
a
START END
Y2 Y3Y1
X3X2X1find preferred tags
= total weight of thesepath suffixes
b2(n)
(found by dynamic programming: matrix-vector products)
Forward-Backward Algorithm: Finds Marginals
43
v
n
a
v
n
a
v
n
a
START END
Y2 Y3Y1
X3X2X1find preferred tags
v
n
a
v
n
a
v
n
a
START END
α2(n) = total weight of thesepath prefixes
= total weight of thesepath suffixes
Forward-Backward Algorithm: Finds Marginals
44
b2(n)(a + b + c) (x + y + z)
Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of paths
Y2 Y3Y1
X3X2X1find preferred tags
v
n
a
v
n
a
v
n
a
START END
Forward-Backward Algorithm: Finds Marginals
45
total weight of all paths through= × ×
n
A(pref., n)
α2(n) b2(n)
α2(n) A(pref., n) b2(n)
“belief that Y2 = n”
Oops! The weight of a path through a state also
includes a weight at that state.
So α(n)·β(n) isn’t enough.
The extra weight is the opinion of the emission
probability at this variable.
Y2 Y3Y1
X3X2X1find preferred tags
v
n
a a
v
n
a
START END
Forward-Backward Algorithm: Finds Marginals
46
total weight of all paths through= × ×
v
α2(v) A(pref., v) b2(v)
n
v
“belief that Y2 = n”α2(v) b2(v)
“belief that Y2 = v”
A(pref., v)
Y2 Y3Y1
X3X2X1find preferred tags
v
n
a
v
n
a
START END
Forward-Backward Algorithm: Finds Marginals
47
total weight of all paths through= × ×
a
α2(a) A(pref., a) b2(a)
n
v
“belief that Y2 = n”α2(a) b2(a)
“belief that Y2 = v”
A(pref., a)
a “belief that Y2 = a”
sum = Z(total weightof all paths)
v 0.1n 0a 0.4
v 0.2n 0a 0.8
divide by Z=0.5
to get marginal
probs
X3X2X1
Y2 Y3Y1
48
find preferred tags
Could be adjective or verb Could be noun or verbCould be verb or noun
Forward-Backward Algorithm
Inference for HMMs
Whiteboard– Derivation of Forward algorithm– Forward-backward algorithm– Viterbi algorithm
49
Derivation of Forward Algorithm
50
Derivation:
Definition:
Forward-Backward Algorithm
51
Viterbi Algorithm
52
Inference in HMMs
What is the computational complexity of inference for HMMs?
• The naïve (brute force) computations for Evaluation, Decoding, and Marginals take exponential time, O(KT)
• The forward-backward algorithm and Viterbialgorithm run in polynomial time, O(T*K2)– Thanks to dynamic programming!
53
Shortcomings of Hidden Markov Models
• HMM models capture dependences between each state and only its corresponding observation
– NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length, indentation, amount of white space, etc.
• Mismatch between learning objective function and prediction objective function
– HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)
© Eric Xing @ CMU, 2005-2015 54
Y1 Y2 … … … Yn
X1 X2 … … … Xn
START
MBR DECODING
55
Inference for HMMs
– Three Inference Problems for an HMM1. Evaluation: Compute the probability of a given
sequence of observations2. Viterbi Decoding: Find the most-likely sequence of
hidden states, given a sequence of observations3. Marginals: Compute the marginal distribution for a
hidden state, given a sequence of observations4. MBR Decoding: Find the lowest loss sequence of
hidden states, given a sequence of observations (Viterbi decoding is a special case)
56
Minimum Bayes Risk Decoding• Suppose we given a loss function l(y’, y) and are
asked for a single tagging• How should we choose just one from our probability
distribution p(y|x)?• A minimum Bayes risk (MBR) decoder h(x) returns
the variable assignment with minimum expected loss under the model’s distribution
57
h✓
(x) = argminŷ
Ey⇠p✓(·|x)[`(ŷ,y)]
= argminŷ
X
y
p✓
(y | x)`(ŷ,y)
The 0-1 loss function returns 1 only if the two assignments are identical and 0 otherwise:
The MBR decoder is:
which is exactly the Viterbi decoding problem!
Minimum Bayes Risk Decoding
Consider some example loss functions:
58
`(ŷ,y) = 1� I(ŷ,y)
h✓(x) = argminŷ
X
y
p✓(y | x)(1� I(ˆy,y))
= argmax
ŷp✓(ˆy | x)
h✓
(x) = argminŷ
Ey⇠p✓(·|x)[`(ŷ,y)]
= argminŷ
X
y
p✓
(y | x)`(ŷ,y)
The Hamming loss corresponds to accuracy and returns the number of incorrect variable assignments:
The MBR decoder is:
This decomposes across variables and requires the variable marginals.
Minimum Bayes Risk Decoding
Consider some example loss functions:
59
`(ŷ,y) =VX
i=1
(1� I(ŷi, yi))
ŷi = h✓(x)i = argmaxŷi
p✓(ŷi | x)
h✓
(x) = argminŷ
Ey⇠p✓(·|x)[`(ŷ,y)]
= argminŷ
X
y
p✓
(y | x)`(ŷ,y)
Learning ObjectivesHidden Markov Models
You should be able to…1. Show that structured prediction problems yield high-computation inference
problems2. Define the first order Markov assumption3. Draw a Finite State Machine depicting a first order Markov assumption4. Derive the MLE parameters of an HMM5. Define the three key problems for an HMM: evaluation, decoding, and
marginal computation6. Derive a dynamic programming algorithm for computing the marginal
probabilities of an HMM7. Interpret the forward-backward algorithm as a message passing algorithm8. Implement supervised learning for an HMM9. Implement the forward-backward algorithm for an HMM10. Implement the Viterbi algorithm for an HMM11. Implement a minimum Bayes risk decoder with Hamming loss for an HMM
60