Hidden Markov Models and Sequential Datasrihari/CSE555/Chap3.Part8.pdf · First-order Markov Models...

Post on 25-Sep-2020

4 views 0 download

transcript

Hidden Markov Models

and

Sequential Data

Sequential Data• Often arise through measurement of time

series• Snowfall measurements on successive days in Buffalo• Rainfall measurements in Chirrapunji• Daily values of currency exchange rate• Acoustic features at successive time frames in speech

recognition

• Non-time series• Nucleotide base pairs in a strand of DNA • Sequence of characters in an English sentence• Parts of speech of successive words

10

Sound Spectrogram of Spoken Words• “Bayes Theorem”

• Plot of the intensity of the spectral coefficients versus time index

• Successive observations of the speech spectrum are highly correlated

Task of making a sequence of decisions• Processes in time, states at time t are influenced by

a state at time t-1

• In many time series applications, eg financial forecasting, wish to predict next value from previous values

• Impractical to consider general dependence of future dependence on all previous observations• Complexity would grow without limit as number of

observations increases• Markov models assume dependence on most recent

observations

10

Model Assuming Independence• Simplest model:

• Treat as independent • Graph without links

Markov Model

• Most general Markov model for observations {xn}• Product rule to express joint distribution of sequence

of observations

),..|(),..( 111

1 −=∏= NN

N

nN xxxpxxp

First Order Markov Model• Chain of observations {xn}

• Distribution p{xn|xn-1} is conditioned on previous observation• Joint distribution for a sequence of n variables

• It can be verified (using product rule from above) that

• If model is used to predict next observation, distribution of prediction will only depend on preceding observation and independent of earlier observations

)|()..|( 111 −− = nnnn xxpxxxp

)|()(),..( 11

11 −=∏= nn

N

nN xxpxpxxp

Second Order Markov Model• Conditional distribution of observation xn

depends on the values of two previous observations xn-1 and xn-2

),|()|()(),..( 211

1211 −−=∏= nnn

N

nN xxxpxxpxpxxp

• Each observation is influenced by previous two observations

Introducing Latent Variables• For each observation xn, introduce a latent variable zn

• zn may be of different type or dimensionality to the observed variable

• Latent variables form the Markov chain• Gives the “state-space model”

Latent variablesLatent variables

ObservationsObservations

)|()|()(),..,,..(1

11

111 nn

N

nnn

N

nnN zxpzzpzpzzxxp ∏∏

=−

=

=

• Joint distribution for this model

Two Models Described by this Graph

Latent variablesLatent variables

ObservationsObservations

1. If latent variables are discrete: Hidden Markov ModelObserved variables in a HMM may be discrete or continuous

2. If both latent and observed variables are Gaussian then we obtain linear dynamical systems

Latent variable with three discrete states

• Transition probabilities aij are represented by a matrix

Not a graphical model since the nodes are not separate Not a graphical model since the nodes are not separate variables but states of a single variablevariables but states of a single variableThis can be unfolded over time to get trellis diagramThis can be unfolded over time to get trellis diagram

Markov model for the production of spoken words

States represent phonemes

Production of the word: “cat”• Represented by states

/k/ /a/ /t/ • Transitions from

• /k/ to /a/• /a/ to /t/• /t/ to a silent state

• Although only the correct cat soundIs represented by model, perhapsother transitions can beintroduced, eg, /k/ followed by /t/

Markov ModelMarkov Modelfor word for word ““catcat””

/k/

/a/ /t/

10

First-order Markov Models (MMs)• State at time t: ω(t)• Particular sequence of length T:

ωT = {ω(1), ω(2), ω(3), …, ω(T)}e.g., ω6 = {ω1, ω4, ω2, ω2, ω1, ω4}

Note: system can revisit a state at different steps and not every state needs to be visited

Model for the production of any sequence is described by the transition probabilities

P(ωj(t + 1) | ωi (t)) = aij

Which is a time-independentprobability of having state ωj at step (t+1) given at time t state was ωi

No requirement transitional probabilities be symmetricParticular modelParticular model

θ = {aij} Given model θ probability thatmodel generated sequence ω6 = {ω1, ω4, ω2, ω2, ω1, ω4}is

P(ω6 | θ) = a14 . a42 . a22 . a21 . a14

Can include a priori probabilityof first state asP(ω(1) = ωi )

Discrete states = nodes, Transition Discrete states = nodes, Transition probsprobs = links= linksIn firstIn first--order discrete time HMM at step t system is in state order discrete time HMM at step t system is in state ωω(t(t))State at step t +1 is a random function that depends on state atState at step t +1 is a random function that depends on state at stesteand transition probabilitiesand transition probabilities

p t p t

10

First Order Hidden Markov Models

• Perceiver does not have access to the states ω(t)

• Instead we measure properties of the emitted sound

• Need to augment Markov model to allow for visible states (symbols)

First Order Hidden Markov Models

• Visible States (symbols) VT = {v(1), v(2), v(3), …, v(T)}• For instance V6 = {v5, v1, v1, v5, v2, v3}

• In any state ωj(t) probability of emitting symbol vk(t) is bjk

Three hidden units in HMMThree hidden units in HMMVisible states and emissionVisible states and emissionprobabilities of visible states in redprobabilities of visible states in red

Hidden Markov Model Computation

• Finite State Machines with transitional probabilities– called Markov Networks

• Strictly causal: probabilities depend only on previous states

• A Markov model is ergodic if every state has non-zero probability of occuring given some starting state

• A final or absorbing state is one which if entered is never left

Hidden Markov Model Computation

Three Basic Problems for HMMs

• Given HMM with transition and symbol probabilities• Problem 1: The evaluation problem

• Determine the probability that a particular sequence of symbols VT was generated by that model

• Problem 2: The decoding problem

• Given a set of symbols VT determine the most likely

sequence of hidden states ωT that led to the observations

• Problem 3: The learning problem• Given a coarse structure of the model (no of states

and no of symbols) but not the probabilities aij and bjk

• determine these parameters

• Probability that model produces a sequence VT of visible states:

where each r indexes a particular sequence• of T hidden states

• In the general case of c hidden states there will be

possible terms

)(P)|V(P)V(P Tr

r

1r

Tr

TTmax

ωω∑=

=

{ })T(),...,2(),1(Tr ωωωω =

Tcr =max

Visible sequenceVisible sequence Hidden statesHidden states

Problem 1: Evaluation Problem

Evaluation Problem FormulaProbability that model produces a sequence VT of visible

states:

)(P)|V(P)V(P Tr

r

1r

Tr

TTmax

ωω∑=

=

∏=

=

−=Tt

t

ttP1

Tr ))1(|)(()P( ωωω ))(|)(()|(

1∏=

=

=Tt

t

Tr

T ttvPVP ωω

Because (1) output probabilities depend only upon hidden states and (2)first order hidden Markov process. Substituting,

(1)(1) (2)(2)

∑∏=

=

=

−=maxr

1r

Tt

1t

T )1t(|)t((P ))t(|)t(v(P)V(P ωωωInterpretation: Probability of sequence VT

is equal to the sum over all rmax possible sequences of hidden states of the conditional probability that the system has made a particular transition multiplied by the probability that it then emitted the visible symbol in the target sequence.

Computationally Simpler Evaluation Algorithm• Calculate • recursively because each term

involves only v(t), ω(t) and ω(t-1)

Define:

)1(|)(())(|)(( −ttP ttvP ωωω

∑∏=

=

=

−=maxr

1r

Tt

1t

T )1t(|)t((P ))t(|)t(v(P)V(P ωωω

where: where: bbjkjkv(tv(t)) means the symbol probability means the symbol probability bbjkjk corresponding to corresponding to v(tv(t))

)(tjα is the probability that the model is in state is the probability that the model is in state ω jj(t(t))and has generated the target sequence and has generated the target sequence uptoupto step step t

ThereforeThereforet

HMM Forward

where: where: bbjkjkv(tv(t)) means the symbol probability means the symbol probability bbjkjk corresponding to corresponding to v(tv(t))

Computational complexity of O(cComputational complexity of O(c22T)T)

TimeTime--reversedreversedVersion of Version of Forward AlgorithmForward Algorithm

Trellis:Trellis:Unfolding ofUnfolding ofHMM throughHMM throughtimetime

Computation of Probabilities by Forward AlgorithmComputation of Probabilities by Forward Algorithm

In the In the evaluationevaluation trellistrelliswe only accumulate we only accumulate valuesvalues

In the In the decodingdecoding trellis trellis we only keep maximum we only keep maximum valuesvalues

Computation of Probabilities by Forward AlgorithmComputation of Probabilities by Forward Algorithm

Example of Hidden Markov ModelFour states with explicit absorber stateFour states with explicit absorber state

ω 0 ω 1 ω 2 ω 3

ω 01 0 0 0

ω 10.2 0.3 0.1 0.4

ω 20.2 0.5 0.2 0.1

ω 30.8 0.1 0 0.1

aaijij==

v0 v1 v2 v3 v4

ω 01 0 0 0 0

ω 10 0.3 0.4 0.1 0.2

ω 20 0.1 0.1 0.7 0.1

ω 30 0.5 0.2 0.1 0.2

bbjkjk==Five symbols with unique null symbolFive symbols with unique null symbol

Example Problem: compute probability of generating sequence VExample Problem: compute probability of generating sequence V44 = {v= {v11, v, v33, v, v22, v, v00} } Assume Assume ω 11 is the start stateis the start state

0]02.0[1)0( )1( :arrow top

)1( ofn calculatio show arrows

1for 0)0(,1)0( thus is state 0at

011010

j

j1

1

=×==

≠===

ba

jt

αα

α

ααω

aij ω 0 ω 1 ω 2 ω 3

ω 0 1 0 0 0

ω 1 0.2 0.3 0.1 0.4

ω 2 0.2 0.5 0.2 0.1

ω 3 0.8 0.1 0 0.1

bjk v0 v1 v2 v3 v4

ω 0 1 0 0 0 0

ω 1 0 0.3 0.4 0.1 0.2

ω 2 0 0.1 0.1 0.7 0.1

ω 3 0 0.5 0.2 0.1 0.2

Visible symbol Visible symbol at each stepat each step

)(tiαjkijba

Problem 2: Decoding Problem

• Given a sequence of visible states VT, the decoding problem is to find the most probable sequence of hidden states.

• Expressed mathematically as:• find the single “best” state sequence (hidden states)

[ ]θωωωωωωωωω

ωωω|)(),...,2(),1(),(),...,2(),1(maxarg)(ˆ),...,2(ˆ),1(ˆ

:)(ˆ),...,2(ˆ),1(ˆ

)(),...,2(),1(TvvvTPT

that such T

T=

Note that summation is changed to argmax, since we want to findthe best case

ViterbiViterbiAlgorithmAlgorithm

Notes:Notes:1. If 1. If aaijij and and bbjkjk are replaced by log probabilities aare replaced by log probabilities awe add terms rather than multiply themwe add terms rather than multiply them2. Best path is maintained for each node2. Best path is maintained for each node3.3. )( TcO T

ViterbiViterbiDecodingDecodingTrellisTrellis

Example of Decoding

ω 0 ω 1 ω 2 ω 3

ω 01 0 0 0

ω 10.2 0.3 0.1 0.4

ω 20.2 0.5 0.2 0.1

ω 30.8 0.1 0 0.1

Four states with explicit absorber stateFour states with explicit absorber state

Five symbols with unique null symbolFive symbols with unique null symbol

aaijij==

What is the most likely What is the most likely state sequence that state sequence that generated the particular generated the particular symbol sequencesymbol sequenceVV44 = {v= {v11, v, v33, v, v22, v, v00}? }? Assume Assume ω 11 is the start is the start state

v0 v1 v2 v3 v4

ω 01 0 0 0 0

ω 10 0.3 0.4 0.1 0.2

ω 20 0.1 0.1 0.7 0.1

ω 30 0.5 0.2 0.1 0.2

bbjkjk==

state

Example 4. HMM Decoding

Note: transition between Note: transition between ω 33 and and ω 22 is forbidden by modelis forbidden by modelYet decoding algorithm gives it a nonYet decoding algorithm gives it a non--zero probabilityzero probability

Problem 3: Learning Problem• Goal: To determine model parameters aij and bjk

from an ensemble of training samples

• No known method for obtaining an optimal or most likely set of parameters

• A good solution is straightforward: Forward-Backward Algorithm

Forward-Backward Algorithm• Instance of generalized expectation maximization

algorithm• We do not know the states that hold when the

symbols are generated• Approach is to iteratively update the weights in

order to better explain the observed training sequences

Backward Probabilities• is the probability that model is in state ω i(t)

and has generated the target sequence upto step t

• Analogously is the probability that that the model is in state ω i(t) and will generate the remainder of the given target sequence from t+1 to T

)(tiα

)(tiβ

Computation proceeds backward through the trellisComputation proceeds backward through the trellis

Backward evaluation algorithm

This is used in learning: parameter estimationThis is used in learning: parameter estimation

Estimating the aij and bjk

• are merely estimates of their true values since we don’t know the actual values of aij and bjk

• The probability of transition between ω i(t-1) and ω j(t) given the model generated the entire training sequence VT by any path is:

)(tiα )(tiβ

Calculating Improved Estimate of aij

• Numerator is the expected number of transitions between state ωi(t-1) and ωj(t)

• Denominator is the total expected number of transitions from ωi

Calculating Improved Estimate of bjk

• Ratio between frequency that any particular symbol vk is emitted and that for any symbol

Learning Algorithm• Start with rough estimates of aij and bjk

• Calculate improved estimate using the formulas above

• Repeat until sufficiently small change in the estimated values of the parameters

Baum-Welch or Forward-Backward Algorithm

Convergence• Requires several presentations of each

training sequence (fewer than 5 common in speech)

• Another stopping criterion:• Overall probability that learning model could

have generated the training data

HMM Word Recognition• Two approaches:

• HMM can model all possible words• Each state corresponds to

each letter of alphabet• Letter transition

probabilities are calculated for each pair of letters

• Letter confusion probabilities are symbol probabilities

• Decoding problem gives most likely word

• Separate HMMs are used to model each word• Evaluation problem gives

probability of observation which is used as a class-conditional probability

LettersLettersaa--zz

HMM spoken word recognition•• Each word, e.g., cat, dog, etc, has an associated HMMEach word, e.g., cat, dog, etc, has an associated HMM•• For a test utterance determine which model has highest probabilFor a test utterance determine which model has highest probabilityity•• HMMsHMMs for speech are leftfor speech are left--toto--right modelsright models

HMM produces aHMM produces aclassclass--conditionalconditionalprobabilityprobability

Thus it is useful to compute probability of the model given the Thus it is useful to compute probability of the model given the sequence using sequence using BayesBayes rulerule

Computed byComputed byForward algorithmForward algorithm

Prior probabilityPrior probabilityof sequenceof sequence

Cursive Word Recognition (not HMM)

4

56

7 82 3

1

1 32 4 5 6 7 8i[.8], l[.8] u[.5], v[.2]

w[.6], m[.3]

w[.7]

i[.7]u[.3]

m[.2]m[.1]

r[.4]

d[.8]o[.5]

Image Segment from 1 to 3 is u with 0.5 confidenceImage Segment 1 to 4 is w with 0.7 confidenceImage Segment 1 to 5 is wwith 0.6 confidence and m with 0.3 confidence

Unit is preUnit is pre--segment (cusp at bottom)segment (cusp at bottom)

Best path in graph from segment 1 to 8: w o r d

Summary: Key Algorithms for HMM• Problem 1: HMM Forward• Problem 2: Viterbi Algorithm

• An algorithm to compute the optimal (most likely) state sequence in a HMM given a sequence of observed outputs.

• Problem 3: Baum-Welch Algorithm• An algorithm to find HMM parameters A, B,

and Π with the maximum likelihood of generating the given symbol sequence in the observation vector