+ All Categories
Home > Documents > Gaussian Mixture Model and Hidden Markov Model

Gaussian Mixture Model and Hidden Markov Model

Date post: 30-Oct-2014
Category:
Upload: aditya-rai
View: 76 times
Download: 5 times
Share this document with a friend
Popular Tags:
61
Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) Samudravijaya K Majority of the slides are taken from S.Umesh’s tutorial on ASR (WiSSAP 2006). 1 of 88
Transcript
Page 1: Gaussian Mixture Model and Hidden Markov Model

Gaussian Mixture Model (GMM)and

Hidden Markov Model (HMM)

Samudravijaya K

Majority of the slides are taken from S.Umesh’s tutorial on ASR (WiSSAP 2006).

1 of 88

Page 2: Gaussian Mixture Model and Hidden Markov Model

Pattern Recognition

Signal

ModelGeneration

PatternMatching

Input

Output

Training

Testing

Processing

GMM: static patterns

HMM: sequential patterns

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 2 of 88

Page 3: Gaussian Mixture Model and Hidden Markov Model

Basic Probability

Joint and Conditional probability

p(A,B) = p(A|B) p(B) = p(B|A) p(A)

Bayes’ rule

p(A|B) =p(B|A) p(A)

p(B)

If Ais are mutually exclusive events,

p(B) =∑

i

p(B|Ai) p(Ai)

p(A|B) =p(B|A) p(A)∑i p(B|Ai) p(Ai)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 3 of 88

Page 4: Gaussian Mixture Model and Hidden Markov Model

Normal Distribution

Many phenomenon are described by Gaussian pdf

p(x|θ) =1√

2πσ2exp

(− 1

2σ2(x − µ)2

)(1)

pdf is parameterised by θ = [µ, σ2] where mean = µ and variance=σ2.

A convenient pdf : second order statistics is sufficient.

Example: Heights of Pygmies ⇒ Gaussian pdf with µ = 4ft & std-dev(σ) = 1ft

OR: Heights of bushmen ⇒ Gaussian pdf with µ = 6ft & std-dev(σ) = 1ft

Question:If we arbitrarily pick a person from a population ⇒what is the probability of the height being a particular value?

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 4 of 88

Page 5: Gaussian Mixture Model and Hidden Markov Model

If I pick arbitrarily a Pygmy, say x, then

Pr(Height of x=4’1”) =1√2π.1

exp

(− 1

2.1(4′1′′ − 4)2

)(2)

Note: Here mean and variances are fixed, only the observations, x, change.

Also see: Pr(x = 4′1′′) ≫ Pr(x = 5′) AND Pr(x = 4′1′′) ≫ Pr(x = 3′)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 5 of 88

Page 6: Gaussian Mixture Model and Hidden Markov Model

Conversely: Given a person’s height is 4′1′′ ⇒Person is more likely to be a pygmy than bushman.

If we observe heights of many persons – say 3′6′′, 4′1′′, 3′8′′, 4′5′′, 4′7′′, 4′, 6′5′′

and all are from same population (i.e. either pygmy or bushmen.)

⇒ then more certain we are that the population is pygmy.

More the observations ⇒ better will be our decision

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 6 of 88

Page 7: Gaussian Mixture Model and Hidden Markov Model

Likelihood Function

x[0], x[1], . . . , x[N − 1]

⇒ set of independent observations from pdf parameterised by θ.

Previous Example: x[0], x[1], . . . , x[N − 1] are heights observed and

θ is the mean of density which is unknown (σ2 assumed known).

L(X; θ) = p(x0 . . . xN−1; θ) =N∏

i=0

p(xi; θ)

=1

(2πσ2)N |2exp

(− 1

2σ2

N∑

i=0

(xi − θ)2

)(3)

L(X; θ) is a function of θ and is called Likelihood Function

Given: x0 . . . xN−1, ⇒ what can we say about value of θ, i.e. best estimate of θ.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 7 of 88

Page 8: Gaussian Mixture Model and Hidden Markov Model

Maximum Likelihood Estimation

Example: We know height of a person x[0] = 4′4′′.

Most likely to have come from which pdf ⇒ θ = 3′, 4′6′′ or 6′ ?

Maximum of L(x[0]; θ = 3′), L(x[0]; 4′6′′) and L(x[0]; θ = 6′) ⇒ choose θ = 4′6′′.

If θ is just a parameter, we will choose arg maxθ

L(x[0]; θ).

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 8 of 88

Page 9: Gaussian Mixture Model and Hidden Markov Model

Maximum Likelihood Estimator

Given x[0], x[1], . . . , x[N − 1] and pdf parameterised by θ =

θ1

θ2

.

.

θm−1

We form Likelihood function L(X; θ) =N∏

i=0

p(xi;θ)

θMLE = arg maxθ

L(X; θ)

For height problem:

⇒ can show (θ)MLE = 1N

∑xi

⇒ Estimate of mean of Gaussian = sample mean of measured heights.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 9 of 88

Page 10: Gaussian Mixture Model and Hidden Markov Model

Bayesian Estimation

• MLE ⇒ θ is assumed unknown but deterministic

• Bayesian Approach: θ is assumed random with pdf p(θ) ⇒ Prior Knowledge.

p(θ|x)︸ ︷︷ ︸Aposterior

=p(x|θ)p(θ)

p(x)∝ p(x|θ) p(θ)︸︷︷︸

Prior

• Height problem: Unknown mean is random ⇒ pdf Gaussian N (γ, ν2)

p(µ) =1√

2πν2exp

(− 1

2ν2(µ − γ)2

)

Then : (µ)Bayesian =σ2γ + nν2x

σ2 + nν2

⇒ Weighted average of sample mean and a prior mean

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 10 of 88

Page 11: Gaussian Mixture Model and Hidden Markov Model

Gaussian Mixture Model

p(x) = α p(x|N(µ1; σ1)) + (1 − α) p(x|N(µ2; σ2))

p(x) =M∑

m=1

wm p(x|N(µm; σm)),∑

wi = 1

Characteristics of GMM:

Just like ANNs are universal approximators of functions, GMMs are universal

approximators of densities (provided sufficient no. of mixtures are used); true for

diagonal GMMs as well.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 11 of 88

Page 12: Gaussian Mixture Model and Hidden Markov Model

General Assumption in GMM

• Assume that there are M components.

• Each component generates data from a Gaussian with mean µm and covariance

matrix Σm.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 12 of 88

Page 13: Gaussian Mixture Model and Hidden Markov Model

GMM

Consider the following probability density function shown in solid blue

It is useful to parameterise or “model” this seemingly arbitrary “blue” pdf

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 13 of 88

Page 14: Gaussian Mixture Model and Hidden Markov Model

Gaussian Mixture Model (Contd.)

Actually – pdf is a mixture of 3 Gaussians, i.e.

p(x) = c1N(x; µ1, σ1) + c2N(x; µ2, σ2) + c3N(x; µ3, σ3) and∑

ci = 1 (4)

pdf parameters: c1, c2, c3, µ1, µ2, µ3, σ1, σ2, σ3

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 14 of 88

Page 15: Gaussian Mixture Model and Hidden Markov Model

Observation from GMM

Experiment: An urn contains balls of 3 different colurs: red, blue or green. Behind

a curtain, a person picks a ball from urn

If red ball ⇒ generate x[i] from N(x; µ1, σ1)

If blue ball ⇒ generate x[i] from N(x; µ2, σ2)

If green ball ⇒ generate x[i] from N(x; µ3, σ3)

We have access only to observations x[0], x[1], . . . , x[N − 1]

Therefore : p(x[i];θ) = c1N(x; µ1, σ1) + c2N(x; µ2, σ2) + c3N(x; µ3, σ3)

but we do not which urn x[i] comes from!

Can we estimate component θ = [c1 c2 c3 µ1 µ2 µ3 σ1 σ2 σ3]T from the observations?

arg maxθ

p(X; θ) = arg maxθ

N∏

i=1

p(xi;θ) (5)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 15 of 88

Page 16: Gaussian Mixture Model and Hidden Markov Model

Estimation of Parameters of GMM

Easier Problem: We know the component for each observation

Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12]

Comp. 1 2 2 1 3 1 3 3 2 2 3 3 3

X1 = {x[0], x[3], x[5] } belong to p1(x;µ1, σ1)

X2 = {x[1], x[2], x[8], x[9] } belongs to p2(x; µ2, σ2)

X3 = {x[3], x[6], x[7], x[10], x[11], x[12]} belongs to p3(x; µ3, σ3)

From: X1 = {x[0], x[3], x[5] }c1 = 3

13

and µ1 = 13 {x[0] + x[3] + x[5]}

σ21 = 1

3

{(x[0] − µ1)

2 + (x[2] − µ1)2 + (x[5] − µ1)

2}

In practice we do not know which observation come from which pdf.

⇒ How do we solve for arg maxθ

p(X ;θ) ?

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 16 of 88

Page 17: Gaussian Mixture Model and Hidden Markov Model

Incomplete & Complete Data

x[0], x[1], . . . , x[N − 1] ⇒ incomplete data,

Introduce another set of variables y[0], y[1], . . . , y[N − 1]

such that y[i] = 1 if x[i] ∈ p1, y[i] = 2 if x[i] ∈ p2 and y[i] = 3 if x[i] ∈ p3

Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12]

Comp. 1 2 2 1 3 1 3 3 2 2 3 3 3

miss: y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8] y[9] y[10] y[11] y[12]

y[i] = missing data—unobserved data ⇒ information about component

z = (x; y) is complete data ⇒ observations and which density they come from

p(z;θ) = p(x, y;θ)

⇒ θMLE = arg maxθ

p(z;θ)

Question: But how do we find which observation belongs to which density ?

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 17 of 88

Page 18: Gaussian Mixture Model and Hidden Markov Model

Given observation x[0] and θg, what is the probability of x[0] coming from first

distribution?

p (y[0] = 1|x[0];θg)

= p(y[0]=1,x[0];θg)p(x[0];θg)

=p(x[0]|y[0]=1;µ

g1,σ

g1) · p(y[0]=1)∑3

j=1 p(x[0]|y[0]=j;θg) p(y[0]=j)

=p(x[0]|y[0]=1,µ

g1,σ

g1) · c

g1

p(x[0]|y[0]=1;µg1,σ

g1) c

g1+p(x[0]|y[0]=2;µ

g2,σ

g2) c

g2+...

All parameters are known ⇒ we can calculate p(y[0] = 1|x[0];θg)

(Similarly calculate p(y[0] = 2|x[0]; θg), p(y[0] = 3|x[0];θg)

Which density? ⇒ y[0] = arg maxi

p(y[0] = i|x[0]; θg) – Hard allocation

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 18 of 88

Page 19: Gaussian Mixture Model and Hidden Markov Model

Parameter Estimation for Hard Allocation

x[0] x[1] x[2] x[3] x[4] x[5] x[6]

p(y[j] = 1|x[j];θg) 0.5 0.6 0.2 0.1 0.2 0.4 0.2

p(y[j] = 2|x[j];θg) 0.25 0.3 0.75 0.3 0.7 0.5 0.6

p(y[j] = 3|x[j];θg) 0.25 0.1 0.05 0.6 0.1 0.1 0.2

Hard Assign. y[0]=1 y[1]=1 y[2]=2 y[3]=3 y[4]=2 y[5]=2 y[6]=2

Updated Parameters: c1 = 27 c2 = 4

7 c3 = 17 (different from initial guess!)

Similarly (for Gaussian) find: µi, σ2i for ith pdf

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 19 of 88

Page 20: Gaussian Mixture Model and Hidden Markov Model

Parameter Estimation for Soft Assignment

x[0] x[1] x[2] x[3] x[4] x[5] x[6]

p(y[j] = 1|x[j];θg) 0.5 0.6 0.2 0.1 0.2 0.4 0. 2

p(y[j] = 2|x[j];θg) 0.25 0.3 0.75 0.3 0.7 0.5 0.6

p(y[j] = 3|x[j];θg) 0.25 0.1 0.05 0.6 0.1 0.1 0.2

Example: Prob. of each sample belonging to component 1

p(y[0] = 1|x[0]; θg), p(y[1] = 1|x[1]; θg) p(y[2] = 1|x[2]; θg), · · · · · ·

Average probability that a sample belongs to Comp.#1 is

cnew1 =

1

N

N∑

i=1

p(y[i] = 1|x[i];θg)

=0.5 + 0.6 + 0.2 + 0.1 + 0.2 + 0.4 + 0.2

7=

2.2

7

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 20 of 88

Page 21: Gaussian Mixture Model and Hidden Markov Model

Soft Assignment – Estimation of Means & Variances

Recall: Prob. of sample j belonging to component i

p(y[j] = i|x[j];θg)

Soft Assignment: Parameters estimated by taking weighted average !

µnew1 =

∑Ni=1 xi · p(y[i] = 1 | x[i]; θg)∑N

i=1 p(y[i] = 1 | x[i];θg)

(σ21)

new =

∑Ni=1(xi − µ1)

2 · p(y[i] = 1 | x[i]; θg)∑N

i=1 p(y[i] = 1 | x[i]; θg)

These are updated parameters starting with initial guess θg

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 21 of 88

Page 22: Gaussian Mixture Model and Hidden Markov Model

Maximum Likelihood Estimation of Parameters of GMM

1. Make initial guess of parameters: θg = cg1, c

g2, c

g3, µ

g1, µ

g2, µ

g3, σ

g1, σ

g2, σ

g3

2. Knowing parameters θg, find Prob. of sample xi belonging to jth component.

p[y[i] = j | x[i]; θg] for i = 1, 2, . . . N ⇒ no. of observations

for j = 1, 2, . . . M ⇒ no. of components

3.

bcnewj =

1

N

NX

i=1

p(y[i] = j | x[i]; θg)

4.

µnewj =

PNi=1 xi · p(y[i] = j | x[i]; θg)

PNi=1 p(y[i] = j | x[i]; θg)

5.

(σ2j )

new =

PNi=1(xi − bµ1)

2 · p(y[i] = j | x[i]; θg)PN

i=1 p(y[i] = j | x[i]; θg)

6. Go back to (2) and repeat until convergence

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 22 of 88

Page 23: Gaussian Mixture Model and Hidden Markov Model

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 23 of 88

Page 24: Gaussian Mixture Model and Hidden Markov Model

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 24 of 88

Page 25: Gaussian Mixture Model and Hidden Markov Model

Live demonstration: http://www.neurosci.aist.go.jp/˜ akaho/MixtureEM.html

Practical Issues

• E.M. can get stuck in local minima.

• EM is very sensitive to initial conditions; a good initial guess helps; k-means

algorithm is used prior to application of EM algorithm

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 25 of 88

Page 26: Gaussian Mixture Model and Hidden Markov Model

Size of a GMM

Bayesian Information Criterion (BIC) value of a GMM can be defined as follows:

BIC(G | X) = logp(X | G) − d

2logN

where

G represent the GMM with the ML parameter configuration

d represents the number of parameters in G

N is the size of the dataset

the first term is the log-likelihood term;

the second term is the model complexity penalty term.

BIC selects the best GMM corresponding to the largest BIC value by trading off

these two terms.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 26 of 88

Page 27: Gaussian Mixture Model and Hidden Markov Model

source: Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in N.C.Oza et al. (Eds.) LNCS 3541, pp. 12-21, 2005

The BIC criterion can discover the true GMM size effectively as shown in the figure.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 27 of 88

Page 28: Gaussian Mixture Model and Hidden Markov Model

Maximum A Posteriori (MAP)

• Sometimes, it is difficult to get sufficient number of examples for robust estimation

of parameters.

• However, one may have access to large number of similar examples which can be

utilized.

• Adapt the target distribution from such a distribution. For example, adapt a

speaker independent model to a new speaker using small amount of adaptation

data.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 28 of 88

Page 29: Gaussian Mixture Model and Hidden Markov Model

MAP Adaptation

ML Estimation

θMLE = arg maxθ

p(X |θ)

MAP Estimation

θMAP = arg maxθ

p(θ|X)

= arg maxθ

p(X |θ) p(θ)

p(X)

= arg maxθ

p(X |θ) p(θ)

p(θ) is the a priori distribution of parameters θ.

A conjugate prior is chosen such that the corresponding posterior belongs to the

same functional family as the prior.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 29 of 88

Page 30: Gaussian Mixture Model and Hidden Markov Model

Simple Implementation of MAP-GMMs

Source: Statistical Machine Learning from Data: GMM; Samy Bengio

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 30 of 88

Page 31: Gaussian Mixture Model and Hidden Markov Model

Simple Implementation

Train a prior model p with a large amount of available data (say, from multiple

speakers). Adapt the parameters to a new speaker using some adaptation data (X).

Let α = [0, 1] be a parameter that describes the faith on the prior model.

Adapted weight of jth mixture

wj =

[αw

pj + (1 − α)

i

p(j|xi)

Here γ is a normalization factor such that∑

wj = 1.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 31 of 88

Page 32: Gaussian Mixture Model and Hidden Markov Model

Simple Implementation (contd.)

means

µj = αµpj + (1 − α)

∑i p(j|xi) xi∑

i p(j|xi)Weighted average of sample mean and a prior mean

variances

σj = α(σ

pj + µ

pjµ

p′

j

)+ (1 − α)

∑i p(j|xi) xix

i∑i p(j|xi)

− µjµj

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 32 of 88

Page 33: Gaussian Mixture Model and Hidden Markov Model

HMM

• Primary role of speech signal is to carry a message; sequence of sounds (phonemes)

encode a sequence of words.

• The acoustic manifestation of a phoneme is mostly determined by:

– Configuration of articulators (jaw, tongue, lip)

– physiology and emotional state of speaker

– Phonetic context

• HMM models sequential patterns; speech is a sequential pattern

• Most text dependent speaker recognition systems use HMMs

• Text verification involves verification/recognition of phonemes

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 33 of 88

Page 34: Gaussian Mixture Model and Hidden Markov Model

Phoneme recognition

Consider two phonemes classes /aa/ and /iy/.

Problem: Determine to which class a given sound belongs.

Processing of speech signal results in a sequence of feature (observation) vectors:

o1, . . . , oT (say MFCC vectors)

We say the speech is /aa/ if: p(aa|O) > p(iy|O)

Using Bayes Rule

AcousticModel︷ ︸︸ ︷p(O|aa) p(aa)

p(O)V s.

p(O|iy)

PriorProb︷ ︸︸ ︷p(iy)

p(O)

Given p(O|aa), p(aa), p(O|iy) and p(iy) ⇒ which is more probable ?

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 34 of 88

Page 35: Gaussian Mixture Model and Hidden Markov Model

Parameter Estimation of Acoustic Model

How do we find the density function paa(.) and piy(.).

We assume a parametric model:⇒ paa() parameterised by θaa

⇒ pij() parameterised by θiy

Training Phase: Collect many examples of /aa/ being said

⇒ Compute corresponding observations o1, . . . , oTaa

Use the Maximum Likelihood Principle

θaa = arg maxθaa

p(O;θaa)

Recall: if the pdf is modelled as a Gaussian Mixture Model

. ⇒ then we use EM Algorithm

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 35 of 88

Page 36: Gaussian Mixture Model and Hidden Markov Model

Modelling of Phoneme

To enunciate /aa/ in a word ⇒Our Articulators are moving from a configuration

for previous phoneme to /aa/ and then proceeding

to move to configuration of next phoneme.

Can think of 3 distinct time periods:

⇒ Transition from previous phoneme

⇒ Steady state

⇒ Transition to next phoneme

Features for 3 “time-interval ”are quite different

⇒ Use different density functions to model the three time intervals

⇒ model as paa1(;θaa1) paa2(;θaa

2) paa3(;θaa3)

Also need to model the time durations of these time-intervals – transition probs.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 36 of 88

Page 37: Gaussian Mixture Model and Hidden Markov Model

Stochastic Model (HMM)

f(Hz)f(Hz)

p(f)

f(Hz)

p(f)

a12

a11

1

p(f)

2 3

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 37 of 88

Page 38: Gaussian Mixture Model and Hidden Markov Model

HMM Model of Phoneme

• Use term “State”for each of the three time periods.

• Prob. of ot from jth state, i.e. paaj(ot; θaaj) ⇒ denoted as bj(ot)

1 2 3. . .p(; �1aa) p(; �2aa) p(; �3aa)o10o3o2o1

• Observation, ot, is generated by which state density?

– Only observations are seen, the state-sequence is “hidden”

– Recall: In GMM, the “mixture component is “hidden”

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 38 of 88

Page 39: Gaussian Mixture Model and Hidden Markov Model

Probability of Observation

Recall: To classify, we evaluate Pr(O|Λ) – where Λ are parameters of models

In /aa/ Vs /iy/ calculation: ⇒ Pr(O|Λaa) Vs Pr(O|Λiy)

Example: 2-state HMM model and 3 observations o1 o2 o3

Model parameters are assumed known:

Transition Prob. ⇒ a11, a12, a21, and a22 – model time durations

State density ⇒ b1(ot) and b2(ot).

bj(ot) are usually modelled as single Gaussian with parameter µj, σ2j or by GMMs

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 39 of 88

Page 40: Gaussian Mixture Model and Hidden Markov Model

Probability of Observation through one Path

T = 3 observations and N = 2 nodes ⇒ 8 paths thru 2 nodes for 3 observations

Example: Path P1 through states 1, 1, 1.

Pr{O|P1,Λ} = b1(o1) · b1(o2) · b1(o3)

Prob. of Path P1 = Pr{P1|Λ} = a01 · a11 · a11

Pr{O, P1|Λ} = Pr{O|P1,Λ} · Pr{P1|Λ} = a01b1(o1).a11b1(o2).a11b1(o3)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 40 of 88

Page 41: Gaussian Mixture Model and Hidden Markov Model

Probability of Observation

Path o1 o2 o3 p(O, Pi|Λ)

P1 1 1 1 a01b1(o1).a11b1(o2).a11b1(o3)

P2 1 1 2 a01b1(o1).a11b1(o2).a12b2(o3)

P3 1 2 1 a01b1(o1).a12b2(o2).a21b1(o3)

P4 1 2 2 a01b1(o1).a12b2(o2).a22b2(o3)

P5 2 1 1 a02b2(o1).a21b1(o2).a11b1(o3)

P6 1 1 2 a02b2(o1).a21b1(o2).a12b2(o3)

P7 1 1 1 a02b2(o1).a22b1(o2).a21b1(o3)

P8 1 1 2 a02b2(o1).a22b1(o2).a22b2(o3)

p(O|Λ) =∑

Pi

P{O, Pi|Λ} =∑

Pi

P{O|Pi, Λ} · P{Pi,Λ}

Forward Algorithm ⇒ Avoid Repeat Calculations:

Two Multiplications︷ ︸︸ ︷a01b1(o1)a11b1(o2).a11b1(o3) + a02b2(o1).a21b1(o2).a11b1(o3)

=[a01.b1(o1)a11b1(o2) + a02b2(o1)a21b1(o2)]a11b1(o3)︸ ︷︷ ︸One Multiplication

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 41 of 88

Page 42: Gaussian Mixture Model and Hidden Markov Model

Forward Algorithm – Recursion

Let :α1(t = 1) = a01b1(o1)

Let :α2(t = 1) = a02b2(o2)

Recursion : α1(t = 2) = [a01b1(o1).a11 + a02b2(o1).a21].b1(o2)

= [α1(t = 1).a11 + α2(t = 1).a21].b1(o2)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 42 of 88

Page 43: Gaussian Mixture Model and Hidden Markov Model

General Recursion in Forward Algorithm

αj(t) =[∑

αi(t − 1)aij

].bj(ot)

= P{o1, o2, . . .ot, st = j|Λ}

Note

αj(t)

Sum of probabilities of all paths ending at

⇒ node j at time t with partial observation

sequence o1,o2, . . . ,ot

The probability of the entire observation (o1,o2, . . . ,oT ), therefore, is

p(0|Λ) =

N∑

j=1

P{o1,o2, . . . ,oT , ST = j|Λ}

=N∑

j=1

αj(T )

where N=No. of nodes

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 43 of 88

Page 44: Gaussian Mixture Model and Hidden Markov Model

Backward Algorithm

• analogous to Forward, but coming from the last time instant T

Example: a01b1(o1).a11b1(o2).a11b1(o3) + a01b1(o1).a11b1(o2)a12b2(o3) + . . .

= [a01.b1(o1).a11b1(o2)].(a11b1(o3) + a12b2(o3))

β1(t = 2) = p{o3|st=2 = 1, Λ}= p{o3, st=3 = 1|st=2 = 1;Λ} + p{o3, st=3 = 2|st=2 = 1, Λ}

=p{o3|st=3 = 1, st=2 = 1, Λ}.p{st=3 = 1|st=2 = 1, Λ} +

p{o3|st=3 = 2, st=2 = 1,Λ}.p{st=3 = 2|st=2 = 1, Λ}= b1(o3).a11 + b2(o3).a12

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 44 of 88

Page 45: Gaussian Mixture Model and Hidden Markov Model

General Recursion in Backward Algorithm

βj(t)

Given that we are at node j at time t

⇒ Sum of probabilities of all paths such that

partial sequence ot+1, . . . , oT are observed

βi(t) =

N∑

j=1

[aijbj(ot+1)]

︸ ︷︷ ︸Going to each node from ith node

βj(t + 1)︸ ︷︷ ︸Prob. of observation ot+2 . . . oT given

now we are in jth node at t + 1

= p{ot+1, . . . , ot|st = i,Λ}

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 45 of 88

Page 46: Gaussian Mixture Model and Hidden Markov Model

Estimation of Parameters of HMM Model

• Given known Model parameters, Λ:

– Evaluated p(O|Λ) ⇒ useful for classification

– Efficient Implementation: Use Forward or Backward Algo.

• Given set of observation vectors, ot how do we estimate parameters of HMM?

– Do not know which states ot come from

∗ Analogous to GMM – do not know which component

– Use a special case of EM – Baum-Welch Algorithm

– Use following relations from Forward/Backward

p(O|Λ) = αN(T ) = β1(T ) =N∑

j=1

αj(t)βj(t)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 46 of 88

Page 47: Gaussian Mixture Model and Hidden Markov Model

Parameter Estimation for Known State Sequence

Assume each state is modelled as a single Gaussian:

µj = Sample mean of observations assigned to state j.

σ2j = Variance of the observations assigned to state j.

and

Trans. Prob. from state i to j =No. of times transition was made from i to j

Total number of times we made transition from i

In practice since we do not know which state generated the observation

⇒ So we will do probabilistic assignment.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 47 of 88

Page 48: Gaussian Mixture Model and Hidden Markov Model

Review of GMM Parameter Estimation

Do not know which component of the GMM generated output observation.

Given initial model parameters Λg, and observation sequence x1, . . . , xT .

Find probability xi comes from component j ⇒ Soft Assignment

p[component = 1|xi; Λg] =

p[component = 1, xi|Λg]

p[xi|Λg]

So, re-estimation equations are:

bCj =1

T

TX

i=1

p(comp = j|xi; Λg)

bµnewj =

PTi=1 xip(comp = j|xi; Λ

g)PT

i=1 p(comp = j|xi; Λg)bσ2

j =

PTi=1(xi − bµj)

2p(comp = j|xi; Λg)PTi=1 p(comp = j|xi; Λg)

A similar analogy holds for hidden Markov models

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 48 of 88

Page 49: Gaussian Mixture Model and Hidden Markov Model

Baum-Welch Algorithm

Here: We do not know which observation ot comes from which state si

Again like GMM we will assume initial guess parameter Λg

Then prob. of being in “state=i at time=t” and “state=j at time=t+1” is

τt(i, j) = p{qt = i, qt+1 = j|O,Λg} =p{qt = i, qt+1 = j,O|Λg}

p{O|Λg}

where p{O|Λg} = αN(T ) =∑

i αi(T )

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 49 of 88

Page 50: Gaussian Mixture Model and Hidden Markov Model

Baum-Welch Algorithm

Then prob. of being in “state=i at time=t” and “state=j at time=t+1” is

τt(i, j) = p{qt = i, qt+1 = j|O,Λg} =p{qt = i, qt+1 = j, O|Λg}

p{O|Λg}

where p{O|Λg} = αN(T ) =∑

i αi(T )

From ideas of Forward-Backward Algorithm, numerator is

p{qt = i, qt+1 = j,O|Λg} = αi(t).aijbj(ot+1).βj(t + 1)

So τt(i, j) =αi(t).aijbj(ot+1)βj(t + 1)

αN(t)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 50 of 88

Page 51: Gaussian Mixture Model and Hidden Markov Model

Estimating Transition Probability

Trans. Prob. from state i to j = No. of times transition was made from i to jTotal number of times we made transition from i

τt(i, j) ⇒ prob. of being in “state=i at time=t” and “state=j at time=t+1”

If we average τt(i, j) over all time-instants, we get the number of times the system

was in ith state and made a transition to jth state. So, a revised estimation of

transition probability is

anewij =

∑T−1t=1 τt(i, j)

∑Tt=1(

N∑

j=1

τt(i, j)

︸ ︷︷ ︸all transitions out

of i at time=t

)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 51 of 88

Page 52: Gaussian Mixture Model and Hidden Markov Model

Estimating State-Density Parameters

Analogous to GMM: which observation belonged to which component,

New estimates for the state pdf parameters are (assuming single Gaussian)

µi =

∑Tt=1 γi(t)ot∑Tt=1 γi(t)

∑i

=

∑Tt=1 γi(t)(ot − µi)(ot − µi)

T

∑Tt=1 γi(t)

These are weighted averages ⇒ weighted by Prob. of being in state j at t

– Given observation ⇒ HMM model parameters estimated iteratively

– p(O|Λ) ⇒ evaluated efficiently by Forward/Backward algorithm

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 52 of 88

Page 53: Gaussian Mixture Model and Hidden Markov Model

Viterbi Algorithm

Given the observation sequence,

• the goal is to find corresponding state-sequence that generated it

• there are many possible combination (NT ) of state sequence ⇒ many paths.

One possible criterion : Choose the state sequence corresponding to path that with

maximum probability

maxi

P{O, Pi|Λ}

Word : represented as sequence of phones

Phone : represented as sequence states

Optimal state-sequence ⇒ Optimal phone-sequence ⇒ Word sequence

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 53 of 88

Page 54: Gaussian Mixture Model and Hidden Markov Model

Viterbi Algorithm and Forward Algorithm

Recall Forward Algorithm : We found probability over each path and summed over

all possible paths

NT∑

i=1

p{O, Pi|Λ}

Viterbi is just special case of Forward algo.

At each node

{instead of sum of prob. of all paths

choose path with max prob.

In Practice: p(O|Λ) approximated by Viterbi (instead of Forward Algo.)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 54 of 88

Page 55: Gaussian Mixture Model and Hidden Markov Model

Viterbi Algorithm

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 55 of 88

Page 56: Gaussian Mixture Model and Hidden Markov Model

Decoding

• Recall: Desired transcription W obtained by maximising

W = arg maxW

p(W |O)

• Search over all possible W – astronomically large!

• Viterbi Search – find most likely path through a HMM

– Sequence of phones (states) which is most probable

– Mostly: most probable sequence of phones correspond to most probable

sequence of words

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 56 of 88

Page 57: Gaussian Mixture Model and Hidden Markov Model

Training of a Speech Recognition System

– HMM parameter’s estimated using large databases – 100 hours

∗ Parameters estimated using Maximum Likelihood Criterion

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 57 of 88

Page 58: Gaussian Mixture Model and Hidden Markov Model

Recognition

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 58 of 88

Page 59: Gaussian Mixture Model and Hidden Markov Model

Speaker Recognition

Spectra (formants) of a given sound are different for different speakers.

Spectra of 2 speakers for one “frame” of /iy/

Derive speaker dependent model of a new speaker by MAP adaptation of Speaker-

Independent (SI) model using small amount of adaptation data; use for speaker

recognition.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 59 of 88

Page 60: Gaussian Mixture Model and Hidden Markov Model

References

• Pattern Classification, R.O.Duda, P.E.Hart and D.G.Stork, John Wiley, 2001.

• Introduction to Statistical Pattern Recognition, K.Fukunaga, Academic Press,

1990.

• The EM Algorithm and Extensions, Geoffrey J. McLachlan and Thriyambakam

Krishnan, Wiley-Interscience; 2nd edition, 2008. ISBN-10: 0471201707

• The EM Algorithm and Extensions, Geoffrey J. McLachlan and Thriyambakam

Krishnan, Wiley-Interscience; 2nd edition, 2008. ISBN-10: 0471201707

• Fundamentals of Speech Recognition, Lawrence Rabiner & Biing-Hwang Juang,

Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing Series), c1993, ISBN

0-13-015157-2

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 60 of 88

Page 61: Gaussian Mixture Model and Hidden Markov Model

• Hidden Markov models for speech recognition, X.D. Huang, Y. Ariki, M.A. Jack.

Edinburgh: Edinburgh University Press, c1990.

• Statistical methods for speech recognition, F.Jelinek, The MIT Press, Cambridge,

MA., 1998.

• Maximum Likelihood from incomplete data via the em algorithm, J. Royal

Statistical Soc. 39(1), pp. 1-38, 1977.

• Maximum a Posteriori Estimation for Multivariate Guassian Mixture Observation

of Markov Chains, J.-L.Gauvain and C.-H.Lee, IEEE Trans. SAP, 2(2), pp. 291-

298, 1994.

• A Gentle Tutorial of the EM Algorithm and its Application to Parameter

Estimation for Gaussian Mixture and Hidden Markov Models, J.A.Bilmes,

ISCI, TR-97-021.

• Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in

N.C.Oza et al. (Eds.) LNCS 3541, pp. 12-21, 2005.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 61 of 88


Recommended