An introduction to Hidden Markov Models - UNamur · Introduction HMM Overview Stochastic process...

An introduction to Hidden Markov Models

Arnaud Hubaux

Computer Science InstituteFUNDP - Namur

March 20, 2007

Academic year 2006-2007

Arnaud Hubaux An introduction to Hidden Markov Models

Outline

Part I ConceptsPart II Basic problemsPart III Case studyPart IV Conclusion


IntroductionHMM

Part I

Concepts


IntroductionHMM

OverviewStochastic processMarkov chain

Outline

1 IntroductionOverviewStochastic processMarkov chain

2 HMM


IntroductionHMM


Overview

Firstly described in statistical papers in the mid 60s

stochastic process → Markov process → HMM

One of the first application was speech recognition in the mid 70s

Applications to the analysis of biological sequences (DNA) started inthe mid 80s

Nowadays they are used in:1 gesture and body motion recognition2 optical character recognition3 bioinformatics4 information extraction5 . . .


IntroductionHMM


Stochastic process

Definition ([Bil02])

A discrete-time stochastic process is a collection{Xt : 1 ≤ t ≤ T t ∈ IN} of random variables ordered by the discretetime index t.

In general, the distribution for each of the variables Xt can bearbitrary and different for each t.

There may also be arbitrary conditional independence relationshipsbetween different subsets of variables of the process.


IntroductionHMM


Markov chainDefinition (cont’d)

Definition ([Bil02])

A collection of discrete-valued random variables {Qt : t ≥ 1} forms annth-order Markov chain if it only depends on the n previous states, i.e.

P[Qt = qt |Qt−1 = qt−1,Qt−2 = qt−2, . . . ,Q1 = q1] =

P[Qt = qt |Qt−1 = qt−1,Qt−2 = qt−2, . . . ,Qt−n = qt−n]

∀q1, . . . , qt and n < t.In particular:

P[Qt = qt |Qt−1 = qt−1,Qt−2 = qt−2, . . . ,Q1 = q1] =

P[Qt = qt |Qt−1 = qt−1]

is a first-order Markov chain.


IntroductionHMM


Markov chainDefinition (cont’d)

The event {Qt = qi} is seen as if the chain is in state qi at time t

The event {Qt = qi ,Qt+1 = qj} is the transition from state qi tostate qj at time t

A discrete time homogeneous Markov chain can be seen as a finitestate automata (FSA) with conditional probabilites on transitions

Definition

A Markov process is a stochastic process that has the Markovproperty.

The term Markov chain is used to mean a discrete-time Markovprocess.


IntroductionHMM


Markov chainTransition probabilities

Statistical evolution of a Markov chain is determined by statetransition probabilities:

aij(t) = P[Qt = qj |Qt−1 = qi ]

⇒ function of states succession and current time

If the chain is time-independent, i.e. it does not depend on time t:(time-)homogeneous chain and:

∀t : aij(t) = aij

In an homogeneous chain, the (stochastic) transition matrix is Awhere ∀i , j :

aij = (A)ij

which is the probability to go from state i to state j with

∀i :∑

j

aij = 1 ∧ aij ≥ 0


IntroductionHMM


Markov chainExample

Given the transition matrix:

A =

0.2 0.4 0.40.1 0.7 0.20.5 0.5 0

the FSA corresponding to this homogeneous first-order Markov chain is:

S1

S2 S3

0.4

0.4

0.2

0.1

0.20.7

0.5

0.5


IntroductionHMM

DefinitionsBasic problemsImplementation problems

Outline

1 Introduction

2 HMMDefinitionsBasic problemsImplementation problems


IntroductionHMM


DefinitionsIntroductory example

Lets suppose a dishonest casino where you can play betting games suchas coin tossing. Tosses are supposed to be independent. Imagine nowthat the casino has two coins:

the first coin, c1, is fair

the second coin, c2 is unfair and has a probability of head H of 0.3and tail T of 0.7

The coin flipped is changed at every play with a probability of 0.1 toavoid awaking players’ suspicion.Note that each coin has a probability of 0.5 to be chosen for the first play.How is it possible to model such a problem ?How can we, given an observed result, determine the coins that havebeen tossed ?


IntroductionHMM


DefinitionsNotations

Lets define:1 A set S = {S1,S2, . . . ,SN} of N states where the observed state at

time t will be noted qt2 An alphabet V = {v1, v2, . . . , vM} of size M corresponding to the

output symbols3 A state transition probability distribution matrix A = {aij} where

aij = P[qt = Sj |qt−1 = Si ] 1 ≤ i , j ≤ N

4 An observation symbol probability distribution matrix B = {bj(k)}where

bj(k) = P[vk at t|qt = Sj ] 1 ≤ j ≤ N

1 ≤ k ≤ M

5 An initial state distribution vector π = {πi} where

πi = P[q1 = Si ] 1 ≤ i ≤ N


IntroductionHMM


DefinitionsHMM

Definition ([Rab90])

A complete specification of an HMM is defined with:

1 the probability measures of A, B and π noted

λ = {A,B, π}

where λ is called an HMM model

2 the observation symbols alphabet V

3 the N and M model parameters


IntroductionHMM


DefinitionsObservation sequence

Definition ([Rab90])

HMM can be used to generate an observation sequence

O = O1O2 . . .OT

where Ot ∈ V , 1 ≤ t ≤ T as follows

1 choose q1 = Si according to π

2 set t = 1

3 choose Ot = vk according to bi (k) of state Si

4 transit to qt+1 = Sj according to aij of state Si

5 if t < T then set t = t + 1 and go to 3 else end


IntroductionHMM


DefinitionsIntroductory example (cont’d)

Given the previous example, there comes:

1 the number of states N = 2

2 the alphabet is V = {H,T} and thus M = 2

3 λ = {A,B, π} is

A =

[0.9 0.10.1 0.9

]B =

[0.5 0.50.3 0.7

]π =

[0.50.5

]where A is the transition matrix associated to transitions between c1

and c2, B is the observation symbols probability matrix and π theinitial transition probability vector.


IntroductionHMM


DefinitionsRepresentations of HMM

FSA: shows only the state transitions

S1

S2 S3

⇒ easy analysis of the HMM topology

Distribution dependence: shows the dependence between theobserved symbols and the hidden states

. . . qt+1 qt+2 qt+3 qt+4 . . .

Ot+1 Ot+2 Ot+3 Ot+4

⇒ better view of the distributions dependence


IntroductionHMM


DefinitionsRepresentations of HMM (cont’d)

Time step evolution: shows the collection of states and thetransitions between states at each successive time step

S1

S2

S3

t1 t2 t3

. . .

. . .

. . .

⇒ possible to represent non-homogeneous Markov Chain


IntroductionHMM


DefinitionsModels of HMM

Ergodic model: every state can be reached from any state in asingle step, i.e. ∀i , j : aij 6= 0

S1

S2 S3

Left-Right/Bakis model: as time increases, state indexes increase,

i.e. ∀i , j : j < i → aij = 0 and πi =

{0 i 6= 11 i = 1

S1 S2 S3 S4


IntroductionHMM


DefinitionsHMM variants

Null transition model: some transitions produce no output, i.e.jump from one state to another without producing an observablesymbol

S1 S2 S3 S4

Tied states model: tie states whose parameters are the same. Ithas the advantage to reduce the number of parameters to alter whentraining the model

State duration model: add explicit state duration (instead ofsimple loop)

. . .


IntroductionHMM


Basic problems [Rab90]Overview

1 Evaluation problemHow to efficiently compute P(O|λ) for some given O and λ ?

2 State sequence problemHow to find the state sequence Q = q1q2 . . . qT which best matchessome given O and λ ?

3 Training problemHow to adjust the parameters of λ to maximize P(O|λ) for a givenO ?


IntroductionHMM


Implementation problems [Rab90]Overview

ScalingHow to avoid precision loss (computer limits) due to hugecomputations ?

→ scaling procedures

Multiple observation sequenceHow to ensure that the observation sequence run through everystates of the model ?

→ use multiple observation sequences

Initial estimateHow to initialise the values of A, B and π ?

→ A, π: random, uniform→ B: manual


IntroductionHMM


Implementation problems (cont’d)Overview

Insufficient trainingHow is it possible to ensure that the model has been sufficientlytrained ?

→ increase the size of the observation set→ reduce the size of the model→ segment the model training

Model choiceHow to chose the model architecture, size and observation symbols ?

→ trial/error→ best practice


Evaluation problemState sequence problem

Training problem

Part II

Basic Problems



Training problem

Simple approachForward/Backward procedure

Outline

3 Evaluation problemSimple approachForward/Backward procedure

4 State sequence problem

5 Training problem



Training problem


Simple approach

Given O = O1O2 . . .OT and λ = {A,B, π} enumerate every possiblestate sequence of length T .For a given state sequence with q1 as initial state

Q = q1q2 . . . qT

the probability of observing O for this sequence is

P(O|Q, λ) =T∏

t=1

P(Ot |qt , λ)

which is equal to

P(O|Q, λ) = bq1(O1)bq2(O2) . . . bqT(OT )

where the probability of such sequence is

P(Q|λ) = πq1aq1q2aq2q3 . . . aqT−1qT



Training problem


Simple approach (cont’d)

The joint probability of O and Q, i.e. their simultaneous occurrence is

P(O,Q|λ) = P(O|Q, λ)P(Q|λ)

From this we can derive

P(O|λ) =∑all Q

P(O|Q, λ)P(Q|λ)

=∑

q1q2...qT∈Q

πq1bq1(O1)aq1q2bq2(O2) . . . aqT−1qTbqT

(OT )

Such an approach implies:

1 (2T − 1)NT multiplications2 NT − 1 additions

where NT is the number of possible state sequences

⇒ O(2TNT ) complexity



Training problem


Forward/Backward procedureForward variable definition

Definition

Consider the forward variable αt(i) defined as

αt(i) = P(O1O2 . . .Ot , qt = Si |λ)

which is the probability to observe O1O2 . . .Ot until time t and to be instate Si at time t given the model λ



Training problem


Forward/Backward procedureForward procedure algorithm

αt(i) can be solved inductively:

1 Initialization:

α1(i) = πibi (O1) 1 ≤ i ≤ N

2 Induction:

αt+1(j) =

[N∑

i=1

αt(i)aij

]bj(Ot+1) 1 ≤ t ≤ T − 1

1 ≤ j ≤ N

3 Termination:

P(O|λ) =N∑

i=1

αT (i)



Training problem


Forward/Backward procedureForward procedure evaluation

This procedure requires:

1 N(N + 1)(T − 1) + N multiplications

2 N(N − 1)(T − 1) additions

⇒ O(N2T ) complexity

For N = 5 and T = 100 there comes:

Forward Simple# computations ≈ 2500 1072



Training problem


Forward/Backward procedureForward procedure latice

Implementation of the computation of αt(i) in terms of:

states i

a lattice of observations t

0 1 2 T

1

2

N

Observation (t)

Sta

te



Training problem


Forward/Backward procedureBackward variable definition

Definition

Consider the backward variable βt(i) defined as

βt(i) = P(Ot+1Ot+2 . . .OT |qt = Si , λ)

which is probability of the partial observation sequence from t + 1 to T ,given state Si at time t and the model λ



Training problem


Forward/Backward procedureBackward procedure algorithm

βt(i) can be solved inductively:

1 Initialization:

βT (i) = 1 1 ≤ i ≤ N

2 Induction:

βt(i) =N∑

j=1

aijbj(Ot+1)βt+1(j) t = T − 1,T − 2, . . . , 1

1 ≤ i ≤ N



Training problem

Problem specificationIndividually most likely state methodViterbi algorithm

Outline

3 Evaluation problem

4 State sequence problemProblem specificationIndividually most likely state methodViterbi algorithm

5 Training problem



Training problem


Problem specification

Observation: there is not only a single way of finding the optimal statesequence.Solutions:

1 choose individually most likely states qt

2 find the single best state sequence

Solution 1 maximizes the number of correct states by choosing the mostlikely state for each t.Solution 2 is the most widely used method and the state sequencedetermination is achieved by the Viterbi algorithm.



Training problem


Individually most likely state method

Lets define the probability of being in state Si at time t given a model λand an observation sequence O:

γt(i) = P(qt = Si |O, λ)

=αt(i)βt(i)

P(O|λ)

=αt(i)βt(i)∑Ni=1 αt(i)βt(i)

whereN∑

i=1

γt(i) = 1

The most likely state qt at time t is

qt = argmax1≤i≤N

[γt(i)] 1 ≤ t ≤ T

⇒ no regard to the probability of occurrence of the sequence of states⇒ invalid states sequences may appear



Training problem


Viterbi algorithmIntroduction

Goal: find the best state sequence Q = q1q2 . . . qT for the correspondingobservation O = O1O2 . . .OT

Lets define

δt(i) = maxq1q2...qt−1

P[q1q2 . . . qt = Si ,O1O2 . . .Ot |λ]

which is the highest probability along a single path, at time t, whichaccounts for the first t observations and ends in state Si with inductionstep

δt+1(j) = [maxiδt(i)aij ]bj(Ot+1)

where arguments maximizing δt+1(j) for each t and j are stored in anarray ψt(j)



Training problem


Viterbi algorithmAlgorithm

1 Initialization:

δ1(i) = πibi (01) 1 ≤ i ≤ Nψ1(i) = 0

2 Recursion:

δt(j) = max1≤i≤N

[δt−1(i)aij ]bj(Ot) 2 ≤ t ≤ T , 1 ≤ j ≤ N

ψt(j) = argmax1≤i≤N

[δt−1(i)aij ] 2 ≤ t ≤ T , 1 ≤ j ≤ N

3 Termination:P∗ = max

1≤i≤N[δT (i)]

q∗T = argmax1≤i≤N

[δT (i)]

4 State sequence backtracking:

q∗t = ψt+1(q∗t+1) t = T − 1,T − 2, . . . , 1



Training problem


Viterbi algorithmAlgorithm (con’d)

The 3 first steps are quite similar to the forward procedure where the∑has been replaced by max

The computation may also be implemented with a lattice



Training problem


Viterbi algorithmExample [Oco06]

Lets define

the alphabetV = {a, b, c}

the model λ = {A,B, π} is given by

A =

0.3 0.3 0.40.4 0.4 0.20.1 0.6 0.3

B =

0.25 0.35 0.40.1 0.25 0.650.5 0.45 0.05

π =

0.20.30.5



Training problem


Viterbi algorithmExample (cont’d)

Goal: find the path q1, q2, q3 which best matches O = a, b, cSolution:

Initialisation:γ1(1) = 0.2× 0.25 = 0.05γ1(2) = 0.3× 0.1 = 0.03γ1(3) = 0.5× 0.5 = 0.25

ψ1 = [0 0 0]

Recursion:

γ2(1) = max [0.05× 0.3, 0.03× 0.4, 0.25× 0.1]× 0.35 = 0.00875γ2(2) = max [0.05× 0.3, 0.03× 0.4, 0.25× 0.6]× 0.25 = 0.0375γ2(3) = max [0.05× 0.4, 0.03× 0.2, 0.25× 0.3]× 0.45 = 0.03375

ψ2 = [3 3 3]



Training problem



Solution:

Recursion:

γ3(1) = max [0.00875× 0.3, 0.0375× 0.4, 0.03375× 0.1]× 0.4= 0.006

γ3(2) = max [0.00875× 0.3, 0.0375× 0.4, 0.03375× 0.6]× 0.65= 0.013625

γ3(3) = max [0.00875× 0.4, 0.0375× 0.2, 0.03375× 0.3]× 0.5= 0.0050625

ψ3 = [2 3 3]

Terminaison:

P∗ = max [0.006, 0.013625, 0.0050625] = 0.013625q∗T = argmax [0.006, 0.013625, 0.0050625] = 2



Training problem



Solution:

Backtracking :

S1

a

S1

b

S1

c

0.006

S2 S2 S2 0.013625

S3 S3 S3 0.0050625

0

⇒ Maximum probability path: q1 = S3, q2 = S3 and q3 = S2



Training problem

Problem specificationHMM training

Outline

3 Evaluation problem

4 State sequence problem

5 Training problemProblem specificationHMM training



Training problem


Problem specification

Observation: adjusting the model λ parameters to maximize theprobability of the observed sequence is the most intricate problem to solve

Issue: no known way to analytically solve the problem

Solution: an iterative procedure maximizing P(O|λ) given adjusted λparameters: Baum-Welch Method(other alternative: gradient technique)



Training problem


Baum-Welch Method

Lets define

ξt(i , j) = P(qt = Si , qt+1 = Sj |O, λ)

which is the probability of being in state Si at time t and in state Sj attime t + 1.It can also be defined in terms of the forward/backward variables

ξt(i , j) =P(qt = Si , qt+1 = Sj ,O|λ)

P(O|λ)

=αt(i)aijbj(Ot+1)βt+1(j)

P(O|λ)

=αt(i)aijbj(Ot+1)βt+1(j)∑N

i=1

∑Nj=1 αt(i)aijbj(Ot+1)βt+1(j)



Training problem


Baum-Welch Method (cont’d)

Lets define

γt(i) =N∑

j=1

ξt(i , j)

which is the probability of being in state Si at time t for a given O and λ.From this we can derive:

1 the expected number of transitions from Si from 1 to T − 1, i.e.

T−1∑t=1

γt(i)

2 the expected number of transitions from Si to Sj from 1 to T − 1,i.e.

T−1∑t=1

ξt(i , j)



Training problem



From the previous formula, we can derive re-estimations of A, B and π.

1 the expected frequency (number of times) in state Si at time t = 1:

πi = γ1(i) s.t.N∑

i=1

πi = 1

2 theexpected number of transitions from Si to Sj

expected number of transitions from Si:

aij =

∑T−1t=1 ξt(i , j)∑T−1t=1 γt(i)

s.t.N∑

j=1

aij = 1 1 ≤ i ≤ N

3 theexpected number of times in Sj observing symbol vk

expected number of times in Sj:

bj(k) =

∑Tt=1

s.t.Ot=vt

γt(j)∑Tt=1 γt(j)

s.t.M∑

k=1

bj(k) = 1 1 ≤ j ≤ N



Training problem



The re-estimated model λ = {A,B, π} may imply:

1 P(O|λ) < P(O|λ): model λ is more likely than λ

2 P(O|λ) = P(O|λ): λ defines the critical point of the likelihoodfunction

3 P(O|λ) > P(O|λ): model λ is more likely than λ1 iteratively use λ as the new model⇒ the re-estimation procedure will improve the probability ofobserving O

2 stop when some limiting point is reached

The final result is called the maximum likelihood estimate of the HMM

Issue: avoid local maxima



Training problem



In order to maximize P(O|λ) one can use the Baum’s auxiliary function:

Q(λ, λ) =∑Q

P(Q|O, λ)log [P(O,Q|λ)]

over λ which leads to increased likelihood, i.e.:

maxλ

[Q(λ, λ)] ⇒ P(O|λ) ≥ P(O|λ)

At the end, the likelihood function converges to a critical point


Information extractionFace recognition

Part III

Case Study



ContextModelSolution

Outline

6 Information extractionContextModelSolution

7 Face recognition




Context [SMR99]

Imagine you have a library of computer science papers at your disposal.What kind of information could you extract from their headers:

title

author

names

affiliation

address

notes

email

. . .

How could we tag such words ?




Model

Each element of this list will be considered as a class that we want toextract. Each class is represented by one or more states in the model.Each state emits words from a class-specific distribution.

abstract end

keyword

note

addresspubnum

emailaffiliation

dateauthor

0.84

.01

.01

.02

.01

.11

.61

.7

.93

.19

.04

.87

.09

.96

.1

.17

.73.97

.03

.04.24

.07.11

.03

.88

.12

.04.08

note

title

pubnum0.4

start

0.11

0.93

0.88

0.86

.07

0.6

0.03

0.12

We can learn by training:

1 class-specific distribution

2 state transition probabilities




Solution

Label elements from headers with classes.In order to achieve this goal:

1 treat each word from the header as an observation

2 recover the most-likely state sequence with the Viterbi algorithm

3 assign the states to the words as their class tag



Face recognitionModelSolution

Outline

6 Information extraction

7 Face recognitionFace recognitionModelSolution




Context [NH]

Imagine you have a list of staff members.An identification card containing information such as the name, thefunction and the face is associated to each member.How could you develop a system that automatically recognizes theworkers as they enter the company building ?More precisely, how is it possible to recognize elements as the:

hair

forehead

eyes

nose

mouth

and find matchings with saved information ?




Model

As these elements appear in a fixed order we can define the followingHMM model:

SH SF SE SN SM

a12

a11

a23

a22

a34

a33 a44

a45

a55

Where states S mean:

SH : the hair

SF : the forehead

SE : the eyes

SN : the nose

SM : the mouth




Model (con’t)

Lets define:

W : the image width

H: the image height

L: the segment height

P: the height of the overlapping between the segments

T : the number of segments constituting the image is given by:

T =H − L

L− P+ 1

which must be chosen very carefully.

Each staff member is represented by an HMM face model in DB.

W

HLP




SolutionTraining

Face segments are converted into 2D-DCT coefficients which formthe observation vector ODiscrete Cosine Transform: transforms an image from the spatialdomain to the frequency domain.Each HMM is trained with 5 different instances of the same face

Model

Initialization

Model

Model

Convergence

Reestimation

NO

ModelParameters

YES

Feature

Extraction

Extraction

BlockDataTraining




SolutionRecognition

O is obtained as in the training phaseThe matching face f is such that:

P(O|λf ) = argmaxn

P(O|λn)

λ1

ComputationProbability

ProbabilityComputation

Probability

Computation

λ

λ

2

N

Extraction

Block

Feature

Extraction

Image

Test

Maximum

Selection

Model

Recognized




SolutionResults

Horizontal lines show classes identified with the Viterbi algorithm

Crossed images represent incorrect classifications


SummaryReferences

Part IV

Conclusion


SummaryReferences

Outline

8 Summary

9 References


SummaryReferences

Summary

We have seen that HMM:

strongly relies on statistical theories

can be used in many fields

model highly influences the results

initialization and training is not an easy business

training is unsupervised


SummaryReferences

Outline

8 Summary

9 References


SummaryReferences

Lawrence R. Rabiner.A tutorial on hidden Markov models and selected applications inspeech recognition.In Proceedings of the IEEE, volume 77, pages 267–296, SanFrancisco, CA, USA, 1990. Morgan Kaufmann Publishers Inc.

Kristie Seymore, Andrew McCallum, and Roni Rosenfeld.Learning hidden Markov model structure for information extraction.In AAAI 99 Workshop on Machine Learning for InformationExtraction, 1999.

A. Nefian and M. Hayes.Hidden Markov models for face recognition.In ICASSP98, pp. 2721–2724, 98.

Jeff Bilmes.What HMM can do.Technical Report UWEETR-2002-0003, Dept of EE, University ofWashington, January 2002.


SummaryReferences

Daniel Ocone.Hidden Markov models.In Discrete and Probabilistic Models in Biology, 2006.Chapter 8.

Thomas G Dietterich.Machine learning for sequential data: A review.In Proceedings of the Joint IAPR International Workshop onStructural, Syntactic, and Statistical Pattern Recognition, pages15–30, London, UK, 2002. Springer-Verlag.


SummaryReferences

Thank you for your attention.

Any questions ?


Date post:	20-Jun-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

An introduction to Hidden Markov Models - UNamur · Introduction HMM Overview Stochastic process...

Documents