Download - Hidden Markov Models The three basic HMM problems (note ...cis391/Lectures/hmm... · (This and later slides follow classic formulation by Rabiner and Juang following Ferguson, as

Hidden Markov Models

The three basic HMM problems

(note: change in notation)

Mitch Marcus

CSE 391

CIS 391 - Intro to AI2

Parameters of an HMM

States: A set of states S=s1, … sn

Transition probabilities: A= a1,1, a1,2, …, an,n

Each ai,j represents the probability of

transitioning from state si to sj.

Emission probabilities: a set B of functions of

the form bi(ot) which is the probability of

observation ot being emitted by si

Initial state distribution: is the probability that

si is a start state

i

(This and later slides follow classic formulation by Rabiner and Juang

following Ferguson, as adapted by Manning and Schutze. Slides

adapted from Dorr. Note the change in notation!!)


The Three Basic HMM Problems

Problem 1 (Evaluation): Given the observation

sequence O=o1,…,oT and an HMM model

, how do we compute the

probability of O given the model?

Problem 2 (Decoding): Given the observation

sequence O=o1,…,oT and an HMM model

, how do we find the state

sequence that best explains the observations?

Problem 3 (Learning): How do we adjust the

model parameters , to maximize

?

(A,B,)

(A,B,)

(A,B,)

P(O |)


Problem 1: Probability of an Observation Sequence

Q: What is ?

A: the sum of the probabilities of all possible

state sequences in the HMM.

• The probability of each state sequence is itself the product of

the state transitions and emit probabilities

Naïve computation is very expensive. Given T

observations and N states, there are NT possible

state sequences.

• (for T=10 and N=10, 10 billion different paths!!)

Solution: linear time dynamic programming!

P(O |)


The Crucial Data Structure: The Trellis


Forward Probabilities:

For a given HMM ,

given that the state is i at time t (with change of

notation: some arbitrary time),

what is the probability that the partial

observation o1 … ot has been generated?

Forward algorithm computes t(i) 1<i<N, 1<t<T

in time 0(N2T) using the trellis

t (i) P(o1...ot, qt si | )


Forward Algorithm: Induction step

t ( j) t1(i)aij

i1

N

b j (ot )

t (i) P(o1...ot , qt si | )


Forward Algorithm

Initialization:

Induction:

Termination:

NjTtobaij tj

N

i

ijtt

1,2)()()(1

1

1(i) ibi(o1) 1 i N

P(O | ) T (i)i1

N


Forward Algorithm Complexity

Naïve approach requires exponential time to

evaluate all NT state sequences

Forward algorithm using dynamic programming

takes O(N2T) computations


Backward Probabilities:

For a given HMM ,

given that the state is i at time t,

what is the probability that the partial observation

ot+1 … oT will be generated?

Analogous to forward probability, just in the

other direction:

Backward algorithm computes t(i) 1<i<N, 1<t<T

in time 0(N2T) using the trellis

t (i) P(ot1...oT | qt si,)


Backward Probabilities

N

j

ttjijt jobai1

11 )()()(

t (i) P(ot1...oT | qt si,)


Backward Algorithm

Initialization:

Induction :

Termination:

T (i) 1, 1 i N

1 1

1

( ) ( ) ( ) 1 1,1N

t ij j t t

j

i a b o j T t i N

N

i

ii iobOP1

11 )()()|(


Problem 2: Decoding

The Forward algorithm gives the sum of all

paths through an HMM efficiently.

Here, we want to find the highest probability

path.

We want to find the state sequence Q=q1…qT,

such that

Q argmaxQ'

P(Q' |O,)


Viterbi Algorithm

Just like the forward algorithm, but instead of

summing over transitions from incoming

states, compute the maximum

Forward:

Viterbi Recursion:

1

1

( ) ( ) ( )N

t t ij j t

i

j i a b o

11max ( ) ( )( ) t ijt j t

i Ni a b oj


Core Idea of Viterbi Algorithm

Not quite what we want….

Viterbi recursion computes the maximum

probability path to state j at time t given that the

partial observation o1 … ot has been generated

But we want the path itself that gives the maximum

probability

Solution:

1. Keep backpointers

2. Find

3. Chase backpointers from state j at time T to

find state sequence (backwards) CIS 391 - Intro to AI

16

11max ( ) ( )( ) t ijt j t

i Ni a b oj

arg max ( )Tj

j


Viterbi Algorithm

Initialization:

Induction:

(Backpointers)

`

Termination: (Final state!)

Backpointer path:

1(i) ib j (o1) 1 i N

11

( ) max ( ) ( )t t ij j ti N

j i a b o

NjTt 1,2

*

1

maxarg ( )Ti N

Tq i

qt

* t1(qt1

* ) t T 1,...,1

11

arg( ) max ( )t t iji N

j i a


Problem 3: Learning

Up to now we’ve assumed that we know the

underlying model

Often these parameters are estimated on

annotated training data, but:

Annotation is often difficult and/or expensive

Training data is different from the current data

We want to maximize the parameters with

respect to the current data, i.e., we’re looking

for a model , such that

(A,B,)

'

' argmax

P(O | )


Problem 3: Learning (If Time Allows…)

Unfortunately, there is no known way to

analytically find a global maximum, i.e., a

model , such that

But it is possible to find a local maximum

Given an initial model , we can always find a

model , such that

'

' argmax

P(O | )

'

P(O |') P(O |)


Forward-Backward (Baum-Welch) algorithm

Key Idea: parameter re-estimation by hill-climbing

FB algorithm iteratively re-estimates the parameters yielding a new at each iteration

1. Initialize to a random set of values

2. Estimate , filling out the trellis for both the Forward and the Backward algorithms

3. Reestimate using both trellises, yielding a new estimate

Theorem:

'

P(O |') P(O |)

( | )P O

'


Parameter Re-estimation

Three parameters need to be re-estimated:

• Initial state distribution:

• Transition probabilities: ai,j

• Emission probabilities: bi(ot)

i


Re-estimating Transition Probabilities: Step 1

What’s the probability of being in state si at time t

and going to state sj, given the current model and

parameters?

t (i, j) P(qt si, qt1 s j |O,)



t (i, j) t (i) ai, j b j (ot1) t1( j)

t (i) ai, j b j (ot1) t1( j)j1

N

i1

N

t (i, j) P(qt si, qt1 s j |O,)



The intuition behind the re-estimation

equation for transition probabilities is

Formally:

i

ji

j,i s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i, j

t (i, j)t1

T1

t (i, j ')j '1

N

t1

T1


Re-estimating Transition Probabilities

Defining

As the probability of being in state si, given the

complete observation O

We can say:

ˆ a i, j

t (i, j)t1

T1

t (i)t1

T1

t (i) t (i, j)j1

N


Re-estimating Initial State Probabilities

Initial state distribution: is the probability

that si is a start state

Re-estimation is easy:

Formally:

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)


Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally:

where

Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi

algorithm!!

i

ki

i s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k)

(ot,vk ) t (i)t1

T

t (i)t1

T

(ot,vk ) 1, if ot vk, and 0 otherwise


The Updated Model

Coming from we get to

by the following update rules:

(A,B,)

' ( ˆ A , ˆ B , ˆ )

ˆ b i(k)

(ot,vk ) t (i)t1

T

t (i)t1

T

ˆ a i, j

t (i, j)t1

T1

t (i)t1

T1

ˆ i 1(i)


Expectation Maximization

The forward-backward algorithm is an instance of

the more general EM algorithm

• The E Step: Compute the forward and backward

probabilities for a give model

• The M Step: Re-estimate the model parameters