+ All Categories
Home > Documents > Hidden Markov Models · Hidden Markov Models 1 10-601 Introduction to Machine Learning Matt Gormley...

Hidden Markov Models · Hidden Markov Models 1 10-601 Introduction to Machine Learning Matt Gormley...

Date post: 27-Jun-2020
Category:
Author: others
View: 2 times
Download: 0 times
Share this document with a friend
Embed Size (px)
of 57 /57
Hidden Markov Models 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Nov. 7, 2018 Machine Learning Department School of Computer Science Carnegie Mellon University
Transcript
  • Hidden Markov Models

    1

    10-601 Introduction to Machine Learning

    Matt GormleyLecture 20

    Nov. 7, 2018

    Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University

  • Reminders

    • Homework 6: PAC Learning / GenerativeModels– Out: Wed, Oct 31– Due: Wed, Nov 7 at 11:59pm (1 week)

    • Homework 7: HMMs– Out: Wed, Nov 7– Due: Mon, Nov 19 at 11:59pm

    2

  • HMM Outline• Motivation

    – Time Series Data• Hidden Markov Model (HMM)

    – Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld]

    – Background: Markov Models– From Mixture Model to HMM– History of HMMs– Higher-order HMMs

    • Training HMMs– (Supervised) Likelihood for HMM– Maximum Likelihood Estimation (MLE) for HMM– EM for HMM (aka. Baum-Welch algorithm)

    • Forward-Backward Algorithm– Three Inference Problems for HMM– Great Ideas in ML: Message Passing– Example: Forward-Backward on 3-word Sentence– Derivation of Forward Algorithm– Forward-Backward Algorithm– Viterbi algorithm

    3

    This Lecture

    Last Lecture

  • SUPERVISED LEARNING FOR HMMS

    4

  • HMM Parameters:

    Hidden Markov Model

    6

    X1 X2 X3 X4 X5

    Y1 Y2 Y3 Y4 Y5

    O S CO .9 .08.02S .2 .7 .1C .9 0 .1

    1min

    2min

    3min

    O .1 .2 .3S .01 .02.03C 0 0 0

    O S CO .9 .08.02S .2 .7 .1C .9 0 .1

    1min

    2min

    3min

    O .1 .2 .3S .01 .02.03C 0 0 0

    O .8S .1C .1

  • Training HMMs

    Whiteboard– (Supervised) Likelihood for an HMM– Maximum Likelihood Estimation (MLE) for HMM

    7

  • Supervised Learning for HMMsLearning an HMM decomposes into solving two (independent) Mixture Models

    8

    Yt Yt+1

    Xt

    Yt

  • HMM Parameters:

    Assumption:Generative Story:

    Hidden Markov Model

    9X1 X2 X3 X4 X5

    Y1 Y2 Y3 Y4 Y5Y0

    y0 = STARTFor notational

    convenience, we fold the initial probabilities C into the transition matrix B by

    our assumption.

  • Joint Distribution:

    Hidden Markov Model

    10X1 X2 X3 X4 X5

    Y1 Y2 Y3 Y4 Y5Y0

    y0 = START

  • Supervised Learning for HMMsLearning an HMM decomposes into solving two (independent) Mixture Models

    11

    Yt Yt+1

    Xt

    Yt

  • HMMs: History

    • Markov chains: Andrey Markov (1906)

    – Random walks and Brownian motion

    • Used in Shannon’s work on information theory (1948)

    • Baum-Welsh learning algorithm: late 60’s, early 70’s.

    – Used mainly for speech in 60s-70s.

    • Late 80’s and 90’s: David Haussler (major player in

    learning theory in 80’s) began to use HMMs for

    modeling biological sequences

    • Mid-late 1990’s: Dayne Freitag/Andrew McCallum

    – Freitag thesis with Tom Mitchell on IE from Web

    using logic programs, grammar induction, etc.

    – McCallum: multinomial Naïve Bayes for text

    – With McCallum, IE using HMMs on CORA

    • …

    13Slide from William Cohen

  • Higher-order HMMs

    • 1st-order HMM (i.e. bigram HMM)

    • 2nd-order HMM (i.e. trigram HMM)

    • 3rd-order HMM

    14

    Y1 Y2 Y3 Y4 Y5

    X1 X2 X3 X4 X5

    Y1 Y2 Y3 Y4 Y5

    X1 X2 X3 X4 X5

    Y1 Y2 Y3 Y4 Y5

    X1 X2 X3 X4 X5

  • BACKGROUND: MESSAGE PASSING

    15

  • Great Ideas in ML: Message Passing

    3 behind you

    2 behind you

    1 behind you

    4 behind you

    5 behind you

    1 beforeyou

    2 beforeyou

    there's1 of me

    3 beforeyou

    4 beforeyou

    5 beforeyou

    Count the soldiers

    16adapted from MacKay (2003) textbook

  • Great Ideas in ML: Message Passing

    3 behind you

    2 beforeyou

    there's1 of me

    Belief:Must be2 + 1 + 3 = 6 of us

    only seemy incomingmessages

    2 31

    Count the soldiers

    17adapted from MacKay (2003) textbook

    2 beforeyou

  • Great Ideas in ML: Message Passing

    4 behind you

    1 beforeyou

    there's1 of me

    only seemy incomingmessages

    Count the soldiers

    18adapted from MacKay (2003) textbook

    Belief:Must be2 + 1 + 3 = 6 of us2 31

    Belief:Must be1 + 1 + 4 = 6 of us

    1 41

  • Great Ideas in ML: Message Passing

    7 here

    3 here

    11 here(= 7+3+1)

    1 of me

    Each soldier receives reports from all branches of tree

    19adapted from MacKay (2003) textbook

  • Great Ideas in ML: Message Passing

    3 here

    3 here

    7 here(= 3+3+1)

    Each soldier receives reports from all branches of tree

    20adapted from MacKay (2003) textbook

  • Great Ideas in ML: Message Passing

    7 here

    3 here

    11 here(= 7+3+1)

    Each soldier receives reports from all branches of tree

    21adapted from MacKay (2003) textbook

  • Great Ideas in ML: Message Passing

    7 here

    3 here

    3 here

    Belief:Must be14 of us

    Each soldier receives reports from all branches of tree

    22adapted from MacKay (2003) textbook

  • Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree

    7 here

    3 here

    3 here

    Belief:Must be14 of us

    23adapted from MacKay (2003) textbook

  • THE FORWARD-BACKWARD ALGORITHM

    24

  • Inference for HMMs

    Whiteboard– Three Inference Problems for an HMM

    1. Evaluation: Compute the probability of a given sequence of observations

    2. Viterbi Decoding: Find the most-likely sequence of hidden states, given a sequence of observations

    3. Marginals: Compute the marginal distribution for a hidden state, given a sequence of observations

    25

  • n n v d nSample 2:

    time likeflies an arrow

    Dataset for Supervised Part-of-Speech (POS) Tagging

    26

    n v p d nSample 1:

    time likeflies an arrow

    p n n v vSample 4:

    with youtime will see

    n v p n nSample 3:

    flies withfly their wings

    D = {x(n),y(n)}Nn=1Data:

    y(1)

    x(1)

    y(2)

    x(2)

    y(3)

    x(3)

    y(4)

    x(4)

  • time flies like an arrow

    n v p d n

    Hidden Markov Model

    28

    A Hidden Markov Model (HMM) provides a joint distribution over the the sentence/tags with an assumption of dependence between adjacent tags.

    v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0

    v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0

    tim

    efli

    eslik

    e…

    v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1

    tim

    efli

    eslik

    e…

    v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1

    p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …)

  • X3X2X1

    Y2 Y3Y1

    29

    find preferred tags

    Could be adjective or verb Could be noun or verbCould be verb or noun

    Forward-Backward Algorithm

  • Forward-Backward Algorithm

    30

    Y2 Y3Y1

    X3X2X1find preferred tags

  • Y2 Y3Y1

    X3X2X1find preferred tags

    Forward-Backward Algorithm

    31

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

    • Let’s show the possible values for each variable• One possible assignment• And what the 7 factors think of it …

  • Y2 Y3Y1

    X3X2X1find preferred tags

    Forward-Backward Algorithm

    32

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

    • Let’s show the possible values for each variable• One possible assignment• And what the 7 factors think of it …

  • Y2 Y3Y1

    X3X2X1find preferred tags

    33

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

    • Let’s show the possible values for each variable• One possible assignment• And what the 7 factors think of it …

    Forward-Backward Algorithm

  • Y2 Y3Y1

    X3X2X1find preferred tags

    34

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

    • Let’s show the possible values for each variable• One possible assignment• And what the 7 transition / emission factors think of it …

    Forward-Backward Algorithm

  • Y2 Y3Y1

    X3X2X1find preferred tags

    35

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

    • Let’s show the possible values for each variable• One possible assignment• And what the 7 transition / emission factors think of it …

    Forward-Backward Algorithmfin

    dp

    ref.

    tag

    s…

    v 3 5 3n 4 5 2a 0.1 0.2 0.1

    v n av 1 6 4n 8 4 0.1a 0.1 8 0

  • Y2 Y3Y1

    X3X2X1find preferred tags

    Viterbi Algorithm: Most Probable Assignment

    36

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

    • So p(v a n) = (1/Z) * product of 7 numbers• Numbers associated with edges and nodes of path• Most probable assignment = path with highest product

    B(a,END)

    A(tags,n)

  • Y2 Y3Y1

    X3X2X1find preferred tags

    Viterbi Algorithm: Most Probable Assignment

    37

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

    • So p(v a n) = (1/Z) * product weight of one path

    B(a,END)

    A(tags,n)

  • Y2 Y3Y1

    X3X2X1find preferred tags

    Forward-Backward Algorithm: Finds Marginals

    38

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

    • So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = a)

    = (1/Z) * total weight of all paths through a

  • Y2 Y3Y1

    X3X2X1find preferred tags

    Forward-Backward Algorithm: Finds Marginals

    39

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

    • So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = n)

    = (1/Z) * total weight of all paths through n

  • Y2 Y3Y1

    X3X2X1find preferred tags

    Forward-Backward Algorithm: Finds Marginals

    40

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

    • So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = v)

    = (1/Z) * total weight of all paths through v

  • Y2 Y3Y1

    X3X2X1find preferred tags

    Forward-Backward Algorithm: Finds Marginals

    41

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

    • So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = n)

    = (1/Z) * total weight of all paths through n

  • Y2 Y3Y1

    X3X2X1find preferred tags

    α2(n) = total weight of thesepath prefixes

    (found by dynamic programming: matrix-vector products)

    Forward-Backward Algorithm: Finds Marginals

    42

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

  • Y2 Y3Y1

    X3X2X1find preferred tags

    = total weight of thesepath suffixes

    b2(n)

    (found by dynamic programming: matrix-vector products)

    Forward-Backward Algorithm: Finds Marginals

    43

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

  • Y2 Y3Y1

    X3X2X1find preferred tags

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

    α2(n) = total weight of thesepath prefixes

    = total weight of thesepath suffixes

    Forward-Backward Algorithm: Finds Marginals

    44

    b2(n)(a + b + c) (x + y + z)

    Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of paths

  • Y2 Y3Y1

    X3X2X1find preferred tags

    v

    n

    a

    v

    n

    a

    v

    n

    a

    START END

    Forward-Backward Algorithm: Finds Marginals

    45

    total weight of all paths through= × ×

    n

    A(pref., n)

    α2(n) b2(n)

    α2(n) A(pref., n) b2(n)

    “belief that Y2 = n”

    Oops! The weight of a path through a state also

    includes a weight at that state.

    So α(n)·β(n) isn’t enough.

    The extra weight is the opinion of the emission

    probability at this variable.

  • Y2 Y3Y1

    X3X2X1find preferred tags

    v

    n

    a a

    v

    n

    a

    START END

    Forward-Backward Algorithm: Finds Marginals

    46

    total weight of all paths through= × ×

    v

    α2(v) A(pref., v) b2(v)

    n

    v

    “belief that Y2 = n”α2(v) b2(v)

    “belief that Y2 = v”

    A(pref., v)

  • Y2 Y3Y1

    X3X2X1find preferred tags

    v

    n

    a

    v

    n

    a

    START END

    Forward-Backward Algorithm: Finds Marginals

    47

    total weight of all paths through= × ×

    a

    α2(a) A(pref., a) b2(a)

    n

    v

    “belief that Y2 = n”α2(a) b2(a)

    “belief that Y2 = v”

    A(pref., a)

    a “belief that Y2 = a”

    sum = Z(total weightof all paths)

    v 0.1n 0a 0.4

    v 0.2n 0a 0.8

    divide by Z=0.5

    to get marginal

    probs

  • X3X2X1

    Y2 Y3Y1

    48

    find preferred tags

    Could be adjective or verb Could be noun or verbCould be verb or noun

    Forward-Backward Algorithm

  • Inference for HMMs

    Whiteboard– Derivation of Forward algorithm– Forward-backward algorithm– Viterbi algorithm

    49

  • Derivation of Forward Algorithm

    50

    Derivation:

    Definition:

  • Forward-Backward Algorithm

    51

  • Viterbi Algorithm

    52

  • Inference in HMMs

    What is the computational complexity of inference for HMMs?

    • The naïve (brute force) computations for Evaluation, Decoding, and Marginals take exponential time, O(KT)

    • The forward-backward algorithm and Viterbialgorithm run in polynomial time, O(T*K2)– Thanks to dynamic programming!

    53

  • Shortcomings of Hidden Markov Models

    • HMM models capture dependences between each state and only its corresponding observation

    – NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length, indentation, amount of white space, etc.

    • Mismatch between learning objective function and prediction objective function

    – HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)

    © Eric Xing @ CMU, 2005-2015 54

    Y1 Y2 … … … Yn

    X1 X2 … … … Xn

    START

  • MBR DECODING

    55

  • Inference for HMMs

    – Three Inference Problems for an HMM1. Evaluation: Compute the probability of a given

    sequence of observations2. Viterbi Decoding: Find the most-likely sequence of

    hidden states, given a sequence of observations3. Marginals: Compute the marginal distribution for a

    hidden state, given a sequence of observations4. MBR Decoding: Find the lowest loss sequence of

    hidden states, given a sequence of observations (Viterbi decoding is a special case)

    56

  • Minimum Bayes Risk Decoding• Suppose we given a loss function l(y’, y) and are

    asked for a single tagging• How should we choose just one from our probability

    distribution p(y|x)?• A minimum Bayes risk (MBR) decoder h(x) returns

    the variable assignment with minimum expected loss under the model’s distribution

    57

    h✓

    (x) = argminŷ

    Ey⇠p✓(·|x)[`(ŷ,y)]

    = argminŷ

    X

    y

    p✓

    (y | x)`(ŷ,y)

  • The 0-1 loss function returns 1 only if the two assignments are identical and 0 otherwise:

    The MBR decoder is:

    which is exactly the Viterbi decoding problem!

    Minimum Bayes Risk Decoding

    Consider some example loss functions:

    58

    `(ŷ,y) = 1� I(ŷ,y)

    h✓(x) = argminŷ

    X

    y

    p✓(y | x)(1� I(ˆy,y))

    = argmax

    ŷp✓(ˆy | x)

    h✓

    (x) = argminŷ

    Ey⇠p✓(·|x)[`(ŷ,y)]

    = argminŷ

    X

    y

    p✓

    (y | x)`(ŷ,y)

  • The Hamming loss corresponds to accuracy and returns the number of incorrect variable assignments:

    The MBR decoder is:

    This decomposes across variables and requires the variable marginals.

    Minimum Bayes Risk Decoding

    Consider some example loss functions:

    59

    `(ŷ,y) =VX

    i=1

    (1� I(ŷi, yi))

    ŷi = h✓(x)i = argmaxŷi

    p✓(ŷi | x)

    h✓

    (x) = argminŷ

    Ey⇠p✓(·|x)[`(ŷ,y)]

    = argminŷ

    X

    y

    p✓

    (y | x)`(ŷ,y)

  • Learning ObjectivesHidden Markov Models

    You should be able to…1. Show that structured prediction problems yield high-computation inference

    problems2. Define the first order Markov assumption3. Draw a Finite State Machine depicting a first order Markov assumption4. Derive the MLE parameters of an HMM5. Define the three key problems for an HMM: evaluation, decoding, and

    marginal computation6. Derive a dynamic programming algorithm for computing the marginal

    probabilities of an HMM7. Interpret the forward-backward algorithm as a message passing algorithm8. Implement supervised learning for an HMM9. Implement the forward-backward algorithm for an HMM10. Implement the Viterbi algorithm for an HMM11. Implement a minimum Bayes risk decoder with Hamming loss for an HMM

    60


Recommended