Lecture 13: Hidden Markov Model - Shuai Li

Post on 02-Jan-2022

5 views 0 download

transcript

Lecture 13: Hidden MarkovModel

Shuai Li

John Hopcroft Center, Shanghai Jiao Tong University

https://shuaili8.github.io

https://shuaili8.github.io/Teaching/VE445/index.html

1

A Markov system

β€’ There are 𝑁 states 𝑆1, 𝑆2, … , 𝑆𝑁, and the time steps are discrete, 𝑑 =0,1,2, …

β€’ On the t-th time step the system is in exactly one of the available states. Call it π‘žπ‘‘

β€’ Between each time step, the next state is chosen only based on the information provided by the current state π‘žπ‘‘

β€’ The current state determines the probability distribution for the next state

2

Example

β€’ Three states

β€’ Current state: 𝑆3

3

Example (cont.)

β€’ Three states

β€’ Current state: 𝑆2

4

Example (cont.)

β€’ Three states

β€’ The transition matrix

5

Example (cont.)

Markovian property

β€’ π‘žπ‘‘+1is independent of π‘žπ‘‘βˆ’1, π‘žπ‘‘βˆ’2, … , π‘ž0 given π‘žπ‘‘

β€’ In other words:

6

Example 2

7

Markovian property

8

Example

β€’ A human and a robot wander around randomly on a grid

9

Note: N (num.states) = 18 * 18 = 324

Example (cont.)

β€’ Each time step the human/robot moves randomly to an adjacent cell

β€’ Typical Questions:β€’ β€œWhat’s the expected time until the human is crushed like a bug?”

β€’ β€œWhat’s the probability that the robot will hit the left wall before it hits the human?”

β€’ β€œWhat’s the probability Robot crushes human on next time step?”

10

Example (cont.)

β€’ The currently time is 𝑑, and human remains uncrushed. What’s the probability of crushing occurring at time 𝑑 + 1?

β€’ If robot is blind:β€’ We can compute this in advance

β€’ If robot is omnipotent (i.e. if robot knows current state):β€’ can compute directly

β€’ If robot has some sensors, but incomplete state informationβ€’ Hidden Markov Models are applicable

11

𝑃 π‘žπ‘‘ = 𝑠 -- A clumsy solution

β€’ Step 1: Work out how to compute 𝑃(𝑄) for any path 𝑄 = π‘ž1π‘ž2β‹―π‘žπ‘‘

β€’ Step 2: Use this knowledge to get 𝑃 π‘žπ‘‘ = 𝑠

12

𝑃 π‘žπ‘‘ = 𝑠 -- A cleverer solution

β€’ For each state 𝑆𝑖, define 𝑝𝑑 𝑖 = 𝑃 π‘žπ‘‘ = 𝑆𝑖 to be the probability of state 𝑆𝑖 at time t

β€’ Easy to do inductive computation

13

π‘Žπ‘–π‘— = 𝑃 π‘žπ‘‘+1 = 𝑆𝑗|π‘žπ‘‘ = 𝑆𝑖

𝑃 π‘žπ‘‘ = 𝑠 -- A cleverer solution

β€’ For each state 𝑆𝑖, define 𝑝𝑑 𝑖 = 𝑃 π‘žπ‘‘ = 𝑆𝑖 to be the probability of state 𝑆𝑖 at time t

β€’ Easy to do inductive computation

14

Complexity comparison

β€’ Cost of computing 𝑝𝑑 𝑖 for all states 𝑆𝑖 is now 𝑂 𝑑𝑁2

β€’ Why?

β€’ The first method has 𝑂 𝑁𝑑

β€’ Why?

β€’ This is the power of dynamic programming that is widely used in HMM

15

Example (cont.)

β€’ It’s currently time t, and human remains uncrushed. What’s the probability of crushing occurring at time t + 1

β€’ If robot is blind:β€’ We can compute this in advance

β€’ If robot is omnipotent (I.E. If robot knows state at time t):β€’ can compute directly

β€’ If robot has some sensors, but incomplete state informationβ€’ Hidden Markov Models are applicable

16

Hidden state

β€’ The previous example tries to estimate 𝑃 π‘žπ‘‘ = 𝑆𝑖 unconditionally (no other information)

β€’ Suppose we can observe something that’s affected by the true state

17

What the robot see (uncorrupted data)

What the robot see (corrupted data)

Noisy observation of hidden state

β€’ Let’s denote the observation at time 𝑑 by 𝑂𝑑

β€’ 𝑂𝑑 is noisily determined depending on the current state

β€’ Assume that 𝑂𝑑 is conditionally independent of π‘žπ‘‘βˆ’1, π‘žπ‘‘βˆ’2, … , π‘ž0, π‘‚π‘‘βˆ’1, π‘‚π‘‘βˆ’2, … , 𝑂1, 𝑂0 given π‘žπ‘‘

β€’ In other words

18

Example

19

Example (cont.)

20

Hidden Markov models

β€’ The robot with noisy sensors is a good example

β€’ Question 1: (Evaluation) State estimation: β€’ what is 𝑃 π‘žπ‘‘ = 𝑆𝑖|𝑂1, … , 𝑂𝑑

β€’ Question 2: (Inference) Most probable path: β€’ Given 𝑂1, … , 𝑂𝑑, what is the most probable path of the states? And what is

the probability?

β€’ Question 3: (Leaning) Learning HMMs: β€’ Given 𝑂1, … , 𝑂𝑑, what is the maximum likelihood HMM that could have

produced this string of observations?

β€’ MLE21

Application of HMM

β€’ Robot planning + sensing when there’s uncertainty

β€’ Speech recognition/understandingβ€’ Phones β†’ Words, Signal β†’ phones

β€’ Human genome project

β€’ Consumer decision modeling

β€’ Economics and finance

β€’ …

22

Basic operations in HMMs

β€’ For an observation sequence 𝑂 = 𝑂1, … , 𝑂𝑇, three basic HMM operations are:

23

T = # timesteps, N = # states

Formal definition of HMM

β€’ The states are labeled 𝑆1, 𝑆2, … , 𝑆𝑁‒ For a particular trial, let

β€’ 𝑇 be the number of observations

β€’ 𝑁 be the number of states

β€’ 𝑀 be the number of possible observations

β€’ πœ‹1, πœ‹2, … , πœ‹π‘ is the starting state probabilities

β€’ 𝑂 = 𝑂1…𝑂𝑇 is a sequence of observations

β€’ 𝑄 = π‘ž1π‘ž2β‹―π‘žπ‘‘ is a path of states

β€’ Then is the specification of an HMM➒The definition of π‘Žπ‘–π‘— and 𝑏𝑖(𝑗) will be introduced in next page

24

Formal definition of HMM (cont.)

β€’ The definition of π‘Žπ‘–π‘— and 𝑏𝑖(𝑗)

25

Example

β€’ Start randomly in state 1 or 2

β€’ Choose one of the output symbols in each state at random

26

Example (cont.)

27

β€’ Start randomly in state 1 or 2

β€’ Choose one of the output symbols in each state at random.

β€’ Let’s generate a sequence of observations:

Example (cont.)

28

β€’ Start randomly in state 1 or 2

β€’ Choose one of the output symbols in each state at random.

β€’ Let’s generate a sequence of observations:

Example (cont.)

29

β€’ Start randomly in state 1 or 2

β€’ Choose one of the output symbols in each state at random.

β€’ Let’s generate a sequence of observations:

Example (cont.)

30

β€’ Start randomly in state 1 or 2

β€’ Choose one of the output symbols in each state at random.

β€’ Let’s generate a sequence of observations:

Example (cont.)

31

β€’ Start randomly in state 1 or 2

β€’ Choose one of the output symbols in each state at random.

β€’ Let’s generate a sequence of observations:

Example (cont.)

32

β€’ Start randomly in state 1 or 2

β€’ Choose one of the output symbols in each state at random.

β€’ Let’s generate a sequence of observations:

Example (cont.)

33

β€’ Start randomly in state 1 or 2

β€’ Choose one of the output symbols in each state at random.

β€’ Let’s generate a sequence of observations:

Example (cont.)

34

β€’ Start randomly in state 1 or 2

β€’ Choose one of the output symbols in each state at random.

Probability of a series of observations

β€’ What is 𝑃 𝑂 = 𝑃 𝑂1𝑂2𝑂3 = 𝑃 𝑂1 = 𝑋 ∧ 𝑂2 = 𝑋 ∧ 𝑂3 = 𝑍 ?

β€’ Slow, stupid way:

β€’ How do we compute 𝑃(𝑄) for an arbitrary path 𝑄?

β€’ How do we compute 𝑃(𝑂|𝑄) for an arbitrary path 𝑄?

35

Probability of a series of observations (cont.)

β€’ 𝑃(𝑄) for an arbitrary path 𝑄

36

Probability of a series of observations (cont.)

β€’ 𝑃(𝑂|𝑄) for an arbitrary path 𝑄

37

Probability of a series of observations (cont.)

β€’ Computation complexity of the slow stupid answer:β€’ 𝑃(𝑂) would require 27 𝑃(𝑄) and 27 𝑃(𝑂|𝑄)

β€’ A sequence of 20 observations would need 320=3.5 billion 𝑃(𝑄) and 3.5 billion 𝑃(𝑂|𝑄)

β€’ So we have to find some smarter answer

38

Probability of a series of observations (cont.)

β€’ Smart answer (based on dynamic programming)

β€’ Given observations 𝑂1𝑂2…𝑂𝑇‒ Define:

β€’ In the example, what is 𝛼2(3) ?

39

𝛼𝑑(𝑖) : easy to define recursively

40

𝛼𝑑(𝑖) in the example

β€’ We see 𝑂1𝑂2𝑂3 = 𝑋𝑋𝑍

41

Easy question

β€’ We can cheaply compute

β€’ (How) can we cheaply compute

β€’ (How) can we cheaply compute

42

Easy question (cont.)

β€’ We can cheaply compute

β€’ (How) can we cheaply compute

β€’ (How) can we cheaply compute

43

Recall: Hidden Markov models

β€’ The robot with noisy sensors is a good example

β€’ Question 1: (Evaluation) State estimation: β€’ what is 𝑃 π‘žπ‘‘ = 𝑆𝑖|𝑂1, … , 𝑂𝑑

β€’ Question 2: (Inference) Most probable path: β€’ Given 𝑂1, … , 𝑂𝑑, what is the most probable path of the states? And what is

the probability?

β€’ Question 3: (Leaning) Learning HMMs: β€’ Given 𝑂1, … , 𝑂𝑑, what is the maximum likelihood HMM that could have

produced this string of observations?

β€’ MLE44

Most probable path (MPP) given observations

45

Efficient MPP computation

β€’ We’re going to compute the following variables

β€’ It’s the probability of the path of length 𝑑 βˆ’ 1 with the maximum chance of doing all these things OCCURING and ENDING UP IN STATE Si and PRODUCING OUTPUT O1…Ot

β€’ DEFINE: mppt(i) = that path

β€’ So: Ξ΄t(i)= Prob(mppt(i))

46

The Viterbi algorithm

47

The Viterbi algorithm (cont.)

48

The Viterbi algorithm (cont.)

49

The Viterbi algorithm (cont.)

β€’ Summary

50

Recall: Hidden Markov models

β€’ The robot with noisy sensors is a good example

β€’ Question 1: (Evaluation) State estimation: β€’ what is 𝑃 π‘žπ‘‘ = 𝑆𝑖|𝑂1, … , 𝑂𝑑

β€’ Question 2: (Inference) Most probable path: β€’ Given 𝑂1, … , 𝑂𝑑, what is the most probable path of the states? And what is

the probability?

β€’ Question 3: (Leaning) Learning HMMs: β€’ Given 𝑂1, … , 𝑂𝑑, what is the maximum likelihood HMM that could have

produced this string of observations?

β€’ MLE51

Inferring an HMM

β€’ Remember, we’ve been doing things like

β€’ That β€œπœ†β€ is the notation for our HMM parameters

β€’ Now we want to estimate πœ† from the observations

β€’ AS USUAL: We could use

52

Max likelihood HMM estimation

β€’ Define:

53

Max likelihood HMM estimation

54

Max likelihood HMM estimation

55

Max likelihood HMM estimation

56

EM for HMMs

β€’ If we knew πœ† we could estimate EXPECTATIONS of quantities such asβ€’ Expected number of times in state 𝑖

β€’ Expected number of transitions 𝑖 β†’ 𝑗

β€’ If we knew the quantities such asβ€’ Expected number of times in state 𝑖

β€’ Expected number of transitions 𝑖 β†’ 𝑗

β€’ We could compute the MAX LIKELIHOOD estimate of

57

EM for HMMs

58

EM for HMMs

β€’ Bad newsβ€’ There are lots of local minima

β€’ Good newsβ€’ The local minima are usually adequate models of the data

β€’ Noticeβ€’ EM does not estimate the number of states. That must be given.

β€’ Often, HMMs are forced to have some links with zero probability. This is done by setting π‘Žπ‘–π‘— = 0 in initial estimate πœ†(0)

β€’ Easy extension of everything seen today: β€’ HMMs with real valued outputs

59