ECE 8443 – Pattern RecognitionECE 8527 – Introduction to Machine Learning and Pattern Recognition
LECTURE 04: MAXIMUM LIKELIHOOD ESTIMATION
• Objectives:Discrete FeaturesMaximum LikelihoodBias in ML EstimatesBayesian EstimationExample
• Resources:D.H.S: Chapter 3 (Part 1)D.H.S.: Chapter 3 (Part 2)J.O.S.: TutorialNebula: LinksBGSU: ExampleA.W.M.: TutorialA.W.M.: LinksS.P.: PrimerCSRN: Unbiased
A.W.M.: BiasWiki: MLM.Y.: ML TutorialJ.O.S.: Bayesian Est.J.H.: Euro Coin
ECE 8527: Lecture 04, Slide 2
• For problems where features are discrete:))( jj |ωPdp xxx
x (
• Bayes formula involves probabilities (not densities):
xx
xx
xx
PPP
PpPp
P jjj
jjj
where
c
jjj PPP
1xx
• Bayes rule remains the same:
)|(minarg* xii
αRα
• The maximum entropy distribution is a uniform distribution:
Discrete Features
NP i
1)( xx
ECE 8527: Lecture 04, Slide 3
• Consider independent binary features:t
dxx ),...,( 1x
• Assuming conditional independence:
ii xi
xi
d
ippωP
1
11 )1()|(x ii x
ixi
d
iqqωP
1
12 )1()|(x
• The likelihood ratio is:
ii
ii
xi
xi
xi
xi
d
i qqpp
ωPωP
1
1
12
1
)1()1(
)|()|(
xx
• The discriminant function is:
Discriminant Functions For Discrete Features
ECE 8527: Lecture 04, Slide 4
• In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(ωi), and class-conditional densities, p(x|ωi).
• What can we do if we do not have this information?• What limitations do we face?• There are two common approaches to parameter estimation: maximum
likelihood and Bayesian estimation.• Maximum Likelihood: treat the parameters as quantities whose values are
fixed but unknown.• Bayes: treat the parameters as random variables having some known prior
distribution. Observations of samples converts this to a posterior.• Bayesian Learning: sharpen the a posteriori density causing it to peak near
the true value.
Introduction to Maximum Likelihood Estimation
ECE 8527: Lecture 04, Slide 5
• I.I.D.: c data sets, D1,...,Dc, where Dj drawn independently according to p(x|ωj).
• Assume p(x|ωj) has a known parametric form and is completely determined
by the parameter vector θj (e.g., p(x|ωj) ~ N(μj, Σj),
where θj=[μ1, ..., μj , σ11, σ12, ..., σdd]).
• p(x|ωj) has an explicit dependence on θj: p(x|ωj, θj)
• Use training samples to estimate θ1, θ2,..., θc
• Functional independence: assume Di gives no useful informationabout θj for i≠j.
• Simplifies notation to a set D of training samples (x1,... xn) drawn independently from p(x|ω) to estimate ω.
• Because the samples were drawn independently:)()|(
1
n
kkpDp x
General Principle
ECE 8527: Lecture 04, Slide 6
• p(D|θ) is called the likelihood of θ with respect to the data.
• Given several training points• Top: candidate source distributions are
shown• Which distribution is the ML estimate?• Middle: an estimate of the likelihood of
the data as a function of θ (the mean)• Bottom: log likelihood
• The value of θ that maximizes this likelihood, denoted ,
is the maximum likelihood estimate (ML) of θ.
Example of ML Estimation
ECE 8527: Lecture 04, Slide 7
n
kk
n
kk
θ
p
p
p
p
l
Dpl
1
1
1
21
ln
))(ln(
maxargˆln:Define
.Let
.),...,,(Let
x
x
t • The ML estimate is found by solving this equation:
.0ln
]ln[
1
1
n
kk
n
kk
p
pl
x
x
• The solution to this equation can be a global maximum, a local maximum, or even an inflection point.• Under what conditions is it a
global maximum?
General Mathematics
ECE 8527: Lecture 04, Slide 8
• A class of estimators – maximum a posteriori (MAP) – maximize
where describes the prior probability of different parameter values.
pl
p
• An ML estimator is a MAP estimator for uniform priors.
• A MAP estimator finds the peak, or mode, of a posterior density.
• MAP estimators are not transformation invariant (if we perform a nonlinear transformation of the input data, the estimator is no longer optimum in the new space). This observation will be useful later in the course.
Maximum A Posteriori Estimation
ECE 8527: Lecture 04, Slide 9
• Consider the case where only the mean, θ = μ, is unknown:
)()(21])2ln[(
21
)]()(21exp[
)2(1ln[))(ln(
1
12/12/
kkd
kkdp
xx
xxx
t
tk
0ln1
n
kkp x
)())(ln( 1 kp xxkwhich implies:
)(
)]()(21[])2ln[(
21[
)]()(21])2ln[(
21[
1
1
1
k
kkd
kkd
x
xx
xx
t
t
because:
Gaussian Case: Unknown Mean
ECE 8527: Lecture 04, Slide 10
• Rearranging terms:
• Significance???
n
kk
n
kk
n
k
n
kk
n
kk
n
kk
n
n
1
1
1 1
1
1
1
1ˆ
0ˆ
0ˆ
0)ˆ(
0)ˆ(
x
x
x
x
x
• Substituting into the expression for the total likelihood:
0)(ln1
1
1
n
kk
n
kkpl xx
Gaussian Case: Unknown Mean
ECE 8527: Lecture 04, Slide 11
• Let θ = [μ,σ2]. The log likelihood of a SINGLE point is:
))(21])2ln[(
21))(ln( 1
1212
kt
k (xxxp k
22
21
2
12
2)(
21
)(1
))(ln(
k
k
x
xxpl θθθ k
• The full likelihood leads to:
n
k
n
kk
n
k
k
n
kk
xx
x
12
1
21
1 22
21
2
11
2
ˆ)ˆ(0ˆ2)ˆ(
ˆ21
0)ˆ(ˆ1
Gaussian Case: Unknown Mean and Variance
ECE 8527: Lecture 04, Slide 12
• This leads to these equations:
2
1
22
11
)ˆ1ˆˆ
1ˆˆ
n
kk
n
kk
xn
xn
(
• In the multivariate case:
n
kkk
n
kk
n
n
1
2
1
ˆˆ1ˆ
1ˆ
txx
x
• The true covariance is the expected value of the matrix ,which is a familiar result.
tkk ˆˆ xx
Gaussian Case: Unknown Mean and Variance
ECE 8527: Lecture 04, Slide 13
• Does the maximum likelihood estimate of the variance converge to the true value of the variance? Let’s start with a few simple results we will need later.
• Expected value of the ML estimate of the mean:
n
i
n
ii
n
ii
n
xEn
xn
EE
1
1
1
1
][1
]1[]ˆ[
22
1 12
2
11
22
22
][1
]11[
]ˆ[
])ˆ[(]ˆ[]ˆvar[
n
i
n
jji
n
jj
n
ii
xxEn
xn
xn
E
E
EE
Convergence of the Mean
ECE 8527: Lecture 04, Slide 14
• The expected value of xixj,, E[xixj,], will be μ2 for i ≠ j and μ2 + σ2 otherwise since the two random variables are independent.
• The expected value of xi2 will be μ2 + σ2.
• Hence, in the summation above, we have n2-n terms with expected value μ2 and n terms with expected value μ2 + σ2.
• Thus,
n
nnnn
222222
21]ˆvar[
• We see that the variance of the estimate goes to zero as n goes to infinity, and our estimate converges to the true estimate (error goes to zero).
which implies:
22
22 ])ˆ[(]ˆvar[]ˆ[ n
EE
Variance of the ML Estimate of the Mean
ECE 8527: Lecture 04, Slide 15
2
11
2
22
222
2222
)1(
][
][2][
][][2][)[(
n
ii
n
ii x
nx
xE
ExE
ExExExE
Note that this implies:22
1
2
n
iix
• Now we can combine these results. Recall our expression for the ML estimate of the variance:
n
iixn
E1
22 ]ˆ1[ˆ
• We will need one more result:
Variance Relationships
ECE 8527: Lecture 04, Slide 16
))(]ˆ[2)((1
])ˆ[]ˆ[2][(1
)]ˆˆ2([1ˆ1[ˆ
22
1
22
2
1
2
2
1
2
1
22
nxEn
ExExEn
xxEn
xn
E
in
i
n
iii
in
ii
n
ii
• Expand the covariance and simplify:
n
nn
nn
xxExxEn
xxEn
xxExE iin
jij
jin
jji
n
jjii
2222222
111
)((1))1((1
])[][(1][1][]ˆ[
• One more intermediate term to derive:
Covariance Expansion
ECE 8527: Lecture 04, Slide 17
2
1
2
1
22
1
2
2
1
2
2222
1
22
2222
1
22
22
1
222
)1(
)1(1)/11(1)(1
)(1
)22(1
))()(2)((1
))(]ˆ[2)((1ˆ
nn
nn
nn
nn
n
nn
nnn
nnn
nxEn
n
i
n
i
n
i
n
i
n
i
n
i
i
n
i
• Substitute our previously derived expression for the second term:
Biased Variance Estimate
ECE 8527: Lecture 04, Slide 18
22
1
22 1]ˆ1[ˆ n
nxn
En
ii
• An unbiased estimator is:
n
i
tiin 1ˆˆ
11 xxC
• These are related by:
Cnn )1(ˆ
which is asymptotically unbiased. See Burl, AJWills and AWM for excellent examples and explanations of the details of this derivation.
• Therefore, the ML estimate is biased:
However, the ML estimate converges (and is MSE).
Expectation Simplification
ECE 8527: Lecture 04, Slide 19
Summary• Discriminant functions for discrete features are completely analogous to the
continuous case (end of Chapter 2).• To develop an optimal classifier, we need reliable estimates of the statistics of
the features.• In Maximum Likelihood (ML) estimation, we treat the parameters as having
unknown but fixed values.• Justified many well-known results for estimating parameters (e.g., computing
the mean by summing the observations).• Biased and unbiased estimators.• Convergence of the mean and variance estimates.