EM-algorithm

Parametric estimation of the multivariate probabilitydensity function. EM-algorithm. RBF network

A.Tyulpin

Measurement Systems and Digital Signal Processing laboratory,Northern (Arctic) Federal University

September, 2014

A.Tyulpin (DSPlab) EM-algorithm & RBF nets September, 2014 1 / 30

Plan

1 Examples of data classification problems2 Classification using Bayes decision theory3 Parametric estimation of probability density function4 Mixture model and EM algorithm5 EM-algorithm for Gaussian mixture model6 Data generation using GMM7 Radial Basis Functions


Classification problems


Classification problems


Bayes decision ruleBinary classification problem

X ⊂ Rn – features space, Ω = 1, . . . ,K – set of labels.X × Ω – statistical population.ω1, . . . , ωK – are denoted as classes.Xm = (xj , ωj) ∈ X × Ω, j = 1,m – data sample.p(x|ωj) – probability density function (pdf) of x in class ωj .p(x, ω) – joint pdf in X × Ω.P (ωi|x) – a posterior probability of x ∈ ωi, i = 1, 2.

Bayes decision rule (two classes):

Assign x to ω1 if P (ω1|x) > P (ω2|x)

Bayes decision rule (K classes):

Assign x to ωl, where ωl = arg maxω∈Ω

P (ω|x)


How to find P (ωj|x)

Bayes rule

P (ωj |x) =p(x|ωj)P (ωj)

p(x)

If we have a data sample Xm, it is possible to calculate P (ωj).But pdfs p(x|ωj) and p(x) are still unknown.

Two approaches for estimation of pdf

Nonparametric, e.g. Parzen window (or kernel density estimation).Parametric e.g. maximum likelihood parameter estimation andEM-algorithm.

There is also another approach – modelling of P (ωj |x)


Maximum likelihood estimation

Let x1, . . . , xm be random samples drawn from pdf p(x; θ).X = x1, . . . , xm, p(X; θ) ≡ p(x1, . . . , xm; θ) – joint pdf. Assumingstatistical independence between different samples, we have a likelihoodfunction:

p(x1, . . . , xm; θ) =

m∏k=1

p(xk; θ). (1)

Maximum likelihoodMaximum likelihood (ML) method estimates θ so that the value of thelikelihood function takes its maximum value:

θML = arg maxθ

m∏k=1

p(xk; θ). (2)


Maximum likelihood estimation

Obvious fact:

arg maxθ

m∏k=1

p(xk; θ) = arg maxθ

ln

m∏k=1

p(xk; θ). (3)

Let log-likelihood function is denoted as

L(θ) = ln

m∏k=1

p(xk; θ) =

m∑k=1

ln p(xk; θ). (4)

It takes its maximum value where

∂L(θ)

∂θ=

m∑k=1

1

p(xk; θ)

∂p(xk; θ)

∂θ= 0. (5)


Normal distribution

− 30 − 20 − 10 0 10 20 30− 3

− 2

− 1

0

1

2

3


Complex distribution

− 15 − 10 − 5 0 5 10− 6

− 5

− 4

− 3

− 2

− 1

0

1

2

3


Mixture model

Let pdf of x ∈ X is a mixture of K distributions:

p(x) =

K∑j=1

πjpj(x),

K∑j=1

πj = 1, πj ≥ 0, (6)

where pj(x) and πj ≡ Pj – are pdf and a prior probability of j-thcomponent of the mixture respectively. ϕ(x; θ) is a parametric family ofpdfs: pj(x) ≡ ϕ(x; θj).Separation of mixtureLet K and Xm are given. We need to estimate a vector of perametersΘ = [π1, . . . , πK , θ1, . . . , θK ]. The naive MLE of p(x) is a very complexproblem.


EM-algorithm

EM(expectation-maximization)

Algorithm for separation of mixture of distributions.

The general idea of EM-algorithm: Repeat following steps while Θ andG are not stable:

1 G = E(Θ) (E-step)2 Θ = M(Θ, G) (M-step)

In the EM-algorithm we will use G = (gij)m×K – the matrix of latentvariables. gij ≡ P (θj |xi) and ∀i = 1, . . . ,m

∑Kj=1 gij = 1. G is very useful

for calculating of MLE of p(x).


EM-algorithmWe can use Bayes rule:

gij =πjpj(xi)∑Ks=1 πsps(xi)

, (7)

therefore, if we have Θ we can calculate G. This is the goal of E-step.Now, let’s look at the MLE of p(x). This an optimization problem

Q(Θ) =

m∑i=1

ln

K∑j=1

πjpj(xi)→ max, (8)

with constraints of equlity and inequality type:

K∑j=1

πj = 1, πj ≥ 0.


EM-algorithm

Let’s "forget"about constraints πj ≥ 0, j = 1, . . . ,K for a while and useLagrange multipliers:

L(Θ;Xm) =

m∑i=1

ln

K∑j=1

πjpj(xi)− λ( K∑j=1

πj − 1). (9)

∂L

∂πj=

pj(xi)∑Ks=1 πsps(xi)

− λ = 0. (10)

Let’s multiply eq. (10) by πj , sum all K such equations and change indicesof summation:

m∑i=1

K∑j=1

πjpj(xi)∑Ks=1 πsps(xi)

= λ

K∑j=1

πj . (11)


EM-algorithm

Let’s «forget» about constraints πj ≥ 0, j = 1, . . . ,K for a while and useLagrange multipliers:

L(Θ;Xm) =

m∑i=1

ln

K∑j=1

πjpj(xi)− λ( K∑j=1

πj − 1). (12)

∂L

∂πj=

pj(xi)∑Ks=1 πsps(xi)

− λ = 0. (13)

Let’s multiply this equation by πj , sum all K such equations and changeindices of summation:

m∑i=1

K∑j=1

πjpj(xi)∑Ks=1 πsps(xi)︸︷︷︸

1

= λ

K∑j=1

πj︸︷︷︸1

⇒ λ = m. (14)


EM-algorithm

Now, let’s multiply the eq. (10) by πj , but with a substitution λ = m:

∀i = 1, . . . ,m

m∑i=1

πjpj(xi)∑Ks=1 πsps(xi)

= mπj . (15)

It’s obviously that

πj =1

m

m∑i=1

gij . (16)

Also, we can see that constarints πj ≥ 0 are satisfied if they were satisfiedat the beginning.


EM-algorithmRecall that pj(x) ≡ ϕ(x; θj). Now, let’s find a partial derivative ∂L

∂θj:

∂L

∂θj=

m∑i=1

πj∑Ks=1 πsps(xi)

∂pj(xi)

∂θj=

=

m∑i=1

πjpj(xi)∑Ks=1 πsps(xi)︸︷︷︸

gij

∂

∂θjln pj(xi) =

∂

∂θj

m∑i=1

gij ln pj(xi) = 0. (17)

The problem is called weighted MLE:

θj = arg maxθ

m∑i=1

gij ln pj(xi) (18)

The goal of M-step is to find new values of πj and solve Kindependent problems of weighted MLE for θj.


Last words about EM-algorithm in general

EM-algorithm converges.Q(Θ) can has many extremes, therefore it can stuck at the localextremes. Usage of Stochastic EM-algorithm can might solve suchproblems.For chosing the K we can use the EM-algorithm with sequentialincreasing of it.It is very benefitial to use well-known pdfs: Gaussian, Bernoulli, etc.

On the next slides example of using Gaussian Mixture Model will begiven.


Gaussian Mixture ModelIf the ϕ(x; θj) = N (x;µj , Σj), then ∀j = 1 . . . ,K:1st case: Σ is non-diagonal matrix (not benefitial to use):

µj =1

mπj

m∑i=1

gijxi (19)

Σj =1

mπj

m∑i=1

gij(xi − µj)(xi − µj)T (20)

2nd case: Σ is a diagonal matrix ∀l = 1 . . . , n:

µjl =1

mπj

m∑i=1

gijxil (21)

σ2jl =

1

mπj

m∑i=1

gij(xil − µjl)2 (22)

There are other cases, but these 2 are important.A.Tyulpin (DSPlab) EM-algorithm & RBF nets September, 2014 19 / 30

EM-algorithm for GMM with diagonal ΣData: Xm, K, [π1, . . . , πK ], [µ1, . . . , µK ], [Σ1, . . . ,ΣK ], εResult: [µ1, . . . , µK ], [Σ1, . . . ,ΣK ]G = (0)m×n;repeat

for i = 1, . . . ,m, j = 1, . . . ,K dog0ij := gij ;

gij :=πjN (xi;µj ,Σj)∑K

s=1 πsN (xi;µs,Σs);

endfor j = 1, . . . ,K do

πj := 1m

∑mi=1 gij ;

for l = 1, . . . ,m doµjl := 1

mπj

∑mi=1 gijxil;

σ2jl := 1

mπj

∑mi=1 gij(xil − µjl)2;

endend

until maxi,j|gij − g0

ij | < ε;

return [µ1, . . . , µK ], [Σ1, . . . ,ΣK ]


Example of usage of GMM and EM

− 1 5 − 1 0 − 5 0 5 1 0− 6

− 5

− 4

− 3

− 2

− 1

0

1

2

3GMM 5 com p on e n ts


Examples of data generation

Handwritten digits – 1797 of images 8x8. 100 of them:


3 components


5 components


10 components


20 components


30 components


Radial Basis Function networkRecall the Bayes decision rule:

Assign x to ωl, where ωl = arg maxω∈Ω

P (ω|x)

Schema for M -classes problem:


References

K. Vorontsov – Mathematical methods of supervised learning.S. Theodoridis – Pattern Recognition.Cristoph M. Bishop – Pattern Recognition and Machine Learning.


[email protected]


Date post:	21-Jun-2015
Category:	Science
Upload:	aleksey-tyulpin
View:	459 times
Download:	2 times

EM-algorithm

Science