+ All Categories
Home > Documents > LECTURE 04: MAXIMUM LIKELIHOOD ESTIMATION

LECTURE 04: MAXIMUM LIKELIHOOD ESTIMATION

Date post: 23-Feb-2016
Category:
Upload: gusty
View: 50 times
Download: 1 times
Share this document with a friend
Description:
LECTURE 04: MAXIMUM LIKELIHOOD ESTIMATION. •Objectives: Discrete Features Maximum Likelihood Bias in ML Estimates Bayesian Estimation Example - PowerPoint PPT Presentation
19
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Bias in ML Estimates Bayesian Estimation Example Resources: D.H.S: Chapter 3 (Part 1) D.H.S.: Chapter 3 (Part 2) J.O.S.: Tutorial Nebula: Links BGSU: Example A.W.M.: Tutorial A.W.M.: Links S.P.: Primer CSRN: Unbiased A.W.M.: Bias Wiki: ML M.Y.: ML Tutorial J.O.S.: Bayesian Est. J.H.: Euro Coin
Transcript
Page 1: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8443 – Pattern RecognitionECE 8527 – Introduction to Machine Learning and Pattern Recognition

LECTURE 04: MAXIMUM LIKELIHOOD ESTIMATION

• Objectives:Discrete FeaturesMaximum LikelihoodBias in ML EstimatesBayesian EstimationExample

• Resources:D.H.S: Chapter 3 (Part 1)D.H.S.: Chapter 3 (Part 2)J.O.S.: TutorialNebula: LinksBGSU: ExampleA.W.M.: TutorialA.W.M.: LinksS.P.: PrimerCSRN: Unbiased

A.W.M.: BiasWiki: MLM.Y.: ML TutorialJ.O.S.: Bayesian Est.J.H.: Euro Coin

Page 2: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 2

• For problems where features are discrete:))( jj |ωPdp xxx

x (

• Bayes formula involves probabilities (not densities):

xx

xx

xx

PPP

PpPp

P jjj

jjj

where

c

jjj PPP

1xx

• Bayes rule remains the same:

)|(minarg* xii

αRα

• The maximum entropy distribution is a uniform distribution:

Discrete Features

NP i

1)( xx

Page 3: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 3

• Consider independent binary features:t

dxx ),...,( 1x

• Assuming conditional independence:

ii xi

xi

d

ippωP

1

11 )1()|(x ii x

ixi

d

iqqωP

1

12 )1()|(x

• The likelihood ratio is:

ii

ii

xi

xi

xi

xi

d

i qqpp

ωPωP

1

1

12

1

)1()1(

)|()|(

xx

• The discriminant function is:

Discriminant Functions For Discrete Features

Page 4: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 4

• In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(ωi), and class-conditional densities, p(x|ωi).

• What can we do if we do not have this information?• What limitations do we face?• There are two common approaches to parameter estimation: maximum

likelihood and Bayesian estimation.• Maximum Likelihood: treat the parameters as quantities whose values are

fixed but unknown.• Bayes: treat the parameters as random variables having some known prior

distribution. Observations of samples converts this to a posterior.• Bayesian Learning: sharpen the a posteriori density causing it to peak near

the true value.

Introduction to Maximum Likelihood Estimation

Page 5: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 5

• I.I.D.: c data sets, D1,...,Dc, where Dj drawn independently according to p(x|ωj).

• Assume p(x|ωj) has a known parametric form and is completely determined

by the parameter vector θj (e.g., p(x|ωj) ~ N(μj, Σj),

where θj=[μ1, ..., μj , σ11, σ12, ..., σdd]).

• p(x|ωj) has an explicit dependence on θj: p(x|ωj, θj)

• Use training samples to estimate θ1, θ2,..., θc

• Functional independence: assume Di gives no useful informationabout θj for i≠j.

• Simplifies notation to a set D of training samples (x1,... xn) drawn independently from p(x|ω) to estimate ω.

• Because the samples were drawn independently:)()|(

1

n

kkpDp x

General Principle

Page 6: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 6

• p(D|θ) is called the likelihood of θ with respect to the data.

• Given several training points• Top: candidate source distributions are

shown• Which distribution is the ML estimate?• Middle: an estimate of the likelihood of

the data as a function of θ (the mean)• Bottom: log likelihood

• The value of θ that maximizes this likelihood, denoted ,

is the maximum likelihood estimate (ML) of θ.

Example of ML Estimation

Page 7: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 7

n

kk

n

kk

θ

p

p

p

p

l

Dpl

1

1

1

21

ln

))(ln(

maxargˆln:Define

.Let

.),...,,(Let

x

x

t • The ML estimate is found by solving this equation:

.0ln

]ln[

1

1

n

kk

n

kk

p

pl

x

x

• The solution to this equation can be a global maximum, a local maximum, or even an inflection point.• Under what conditions is it a

global maximum?

General Mathematics

Page 8: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 8

• A class of estimators – maximum a posteriori (MAP) – maximize

where describes the prior probability of different parameter values.

pl

p

• An ML estimator is a MAP estimator for uniform priors.

• A MAP estimator finds the peak, or mode, of a posterior density.

• MAP estimators are not transformation invariant (if we perform a nonlinear transformation of the input data, the estimator is no longer optimum in the new space). This observation will be useful later in the course.

Maximum A Posteriori Estimation

Page 9: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 9

• Consider the case where only the mean, θ = μ, is unknown:

)()(21])2ln[(

21

)]()(21exp[

)2(1ln[))(ln(

1

12/12/

kkd

kkdp

xx

xxx

t

tk

0ln1

n

kkp x

)())(ln( 1 kp xxkwhich implies:

)(

)]()(21[])2ln[(

21[

)]()(21])2ln[(

21[

1

1

1

k

kkd

kkd

x

xx

xx

t

t

because:

Gaussian Case: Unknown Mean

Page 10: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 10

• Rearranging terms:

• Significance???

n

kk

n

kk

n

k

n

kk

n

kk

n

kk

n

n

1

1

1 1

1

1

1

0)ˆ(

0)ˆ(

x

x

x

x

x

• Substituting into the expression for the total likelihood:

0)(ln1

1

1

n

kk

n

kkpl xx

Gaussian Case: Unknown Mean

Page 11: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 11

• Let θ = [μ,σ2]. The log likelihood of a SINGLE point is:

))(21])2ln[(

21))(ln( 1

1212

kt

k (xxxp k

22

21

2

12

2)(

21

)(1

))(ln(

k

k

x

xxpl θθθ k

• The full likelihood leads to:

n

k

n

kk

n

k

k

n

kk

xx

x

12

1

21

1 22

21

2

11

2

ˆ)ˆ(0ˆ2)ˆ(

ˆ21

0)ˆ(ˆ1

Gaussian Case: Unknown Mean and Variance

Page 12: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 12

• This leads to these equations:

2

1

22

11

)ˆ1ˆˆ

1ˆˆ

n

kk

n

kk

xn

xn

(

• In the multivariate case:

n

kkk

n

kk

n

n

1

2

1

ˆˆ1ˆ

txx

x

• The true covariance is the expected value of the matrix ,which is a familiar result.

tkk ˆˆ xx

Gaussian Case: Unknown Mean and Variance

Page 13: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 13

• Does the maximum likelihood estimate of the variance converge to the true value of the variance? Let’s start with a few simple results we will need later.

• Expected value of the ML estimate of the mean:

n

i

n

ii

n

ii

n

xEn

xn

EE

1

1

1

1

][1

]1[]ˆ[

22

1 12

2

11

22

22

][1

]11[

]ˆ[

])ˆ[(]ˆ[]ˆvar[

n

i

n

jji

n

jj

n

ii

xxEn

xn

xn

E

E

EE

Convergence of the Mean

Page 14: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 14

• The expected value of xixj,, E[xixj,], will be μ2 for i ≠ j and μ2 + σ2 otherwise since the two random variables are independent.

• The expected value of xi2 will be μ2 + σ2.

• Hence, in the summation above, we have n2-n terms with expected value μ2 and n terms with expected value μ2 + σ2.

• Thus,

n

nnnn

222222

21]ˆvar[

• We see that the variance of the estimate goes to zero as n goes to infinity, and our estimate converges to the true estimate (error goes to zero).

which implies:

22

22 ])ˆ[(]ˆvar[]ˆ[ n

EE

Variance of the ML Estimate of the Mean

Page 15: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 15

2

11

2

22

222

2222

)1(

][

][2][

][][2][)[(

n

ii

n

ii x

nx

xE

ExE

ExExExE

Note that this implies:22

1

2

n

iix

• Now we can combine these results. Recall our expression for the ML estimate of the variance:

n

iixn

E1

22 ]ˆ1[ˆ

• We will need one more result:

Variance Relationships

Page 16: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 16

))(]ˆ[2)((1

])ˆ[]ˆ[2][(1

)]ˆˆ2([1ˆ1[ˆ

22

1

22

2

1

2

2

1

2

1

22

nxEn

ExExEn

xxEn

xn

E

in

i

n

iii

in

ii

n

ii

• Expand the covariance and simplify:

n

nn

nn

xxExxEn

xxEn

xxExE iin

jij

jin

jji

n

jjii

2222222

111

)((1))1((1

])[][(1][1][]ˆ[

• One more intermediate term to derive:

Covariance Expansion

Page 17: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 17

2

1

2

1

22

1

2

2

1

2

2222

1

22

2222

1

22

22

1

222

)1(

)1(1)/11(1)(1

)(1

)22(1

))()(2)((1

))(]ˆ[2)((1ˆ

nn

nn

nn

nn

n

nn

nnn

nnn

nxEn

n

i

n

i

n

i

n

i

n

i

n

i

i

n

i

• Substitute our previously derived expression for the second term:

Biased Variance Estimate

Page 18: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 18

22

1

22 1]ˆ1[ˆ n

nxn

En

ii

• An unbiased estimator is:

n

i

tiin 1ˆˆ

11 xxC

• These are related by:

Cnn )1(ˆ

which is asymptotically unbiased. See Burl, AJWills and AWM for excellent examples and explanations of the details of this derivation.

• Therefore, the ML estimate is biased:

However, the ML estimate converges (and is MSE).

Expectation Simplification

Page 19: LECTURE  04:  MAXIMUM LIKELIHOOD ESTIMATION

ECE 8527: Lecture 04, Slide 19

Summary• Discriminant functions for discrete features are completely analogous to the

continuous case (end of Chapter 2).• To develop an optimal classifier, we need reliable estimates of the statistics of

the features.• In Maximum Likelihood (ML) estimation, we treat the parameters as having

unknown but fixed values.• Justified many well-known results for estimating parameters (e.g., computing

the mean by summing the observations).• Biased and unbiased estimators.• Convergence of the mean and variance estimates.


Recommended