2/9/08 CS 461, Winter 2008 2
Plan for Today
Note: Room change for 2/16: E&T A129 Solution to Midterm Solution to Homework 3
Parametric methods Data comes from distribution Bernoulli, Gaussian, and their parameters How good is a parameter estimate? (bias, variance)
Bayes estimation ML: use the data MAP: use the prior and the data Bayes estimator: integrated estimate (weighted)
Parametric classification Maximize the posterior probability
2/9/08 CS 461, Winter 2008 3
Review from Lecture 5
Probability Axioms
Bayesian Learning Classification Bayes’s Rule Bayesian Networks Naïve Bayes Classifier Association Rules
2/9/08 CS 461, Winter 2008 4
Parametric Methods
Chapter 4
2/9/08 CS 461, Winter 2008 5
Parametric Learning
Assume: data x comes from a distribution p(x)
Model this distribution by selecting parameters θ e.g., N ( μ, σ2) where θ = { μ, σ2}
2/9/08 CS 461, Winter 2008 6
Maximum Likelihood Model
(Last time: Max Likelihood Classification) Likelihood of θ given the sample X
l (θ|X) = p (X |θ) = ∏t p (xt|θ)
Log likelihood L(θ|X) = log l (θ|X) = ∑
t log p (xt|θ)
Maximum likelihood estimator (MLE)θ* = argmaxθ L(θ|X)
[Alpaydin 2004 © The MIT Press]
2/9/08 CS 461, Winter 2008 7
Example: do you wear glasses?
Bernoulli: Two states, x in {0,1}P (x) = po
x (1 – po ) (1 – x)
L (po|X) = log ∏t po
xt (1 – po ) (1 – xt)
MLE: po = ∑t xt / N
[Alpaydin 2004 © The MIT Press]
2/9/08 CS 461, Winter 2008 8
Gaussian (Normal) Distribution
p(x) = N ( μ, σ2)
MLE for μ and σ2:
[Alpaydin 2004 © The MIT Press]
( ) ( )!"
#$%
&
'
µ(
')=
2
2
2exp
2
1 x-xp
μ σ!
p x( ) =1
2"#exp $
x $µ( )2
2# 2
%
& ' '
(
) * *
!
m =
xt
t
"
N
s2 =
xt #m( )
2
t
"
N
2/9/08 CS 461, Winter 2008 9
How good is that estimate?
Let d be the estimate of θ It is also a random variable Technically, it’s d(X) since it depends on X E [d] is the expected value of d (regardless of X)
Bias: bθ(d) = E [d] – θ How far off the correct value is it?
Variance: Var(d) = E [(d–E [d])2] How much does it change with different X?
Mean square error:r (d,θ) = E [(d–θ)2]
= (E [d] – θ)2 + E [(d–E [d])2]= Bias2 + Variance
[Alpaydin 2004 © The MIT Press]
2/9/08 CS 461, Winter 2008 10
Bayes Estimator:Using what we already know
Prior information p (θ) Bayes’s rule (get posterior):
p (θ|X) = p(X|θ) p(θ) / p(X) Maximum a Posteriori (MAP):θMAP = argmaxθ p(θ|X)
Maximum Likelihood (ML):θML = argmaxθ p(X|θ)
Bayes Estimator:θBayes = E[θ|X] = ∫ θ p(θ|X) dθ In a sense, it is the “weighted mean” for θ For our purposes, θMAP = θBayes
[Alpaydin 2004 © The MIT Press]
2/9/08 CS 461, Winter 2008 11
Bayes Estimator: Continuous Example
Assume xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)
θML = m (sample mean) θMAP = θBayes =
Estimated mean = weighted average ofsample mean m and prior mean μ Weights indicate how much you trust the sample
[ ] µ!+!
!+
!+!
!="
22
0
2
22
0
2
0
1
1
1|
//N
/m
//N
/NE X
[Alpaydin 2004 © The MIT Press]
2/9/08 CS 461, Winter 2008 12
Example: Coin flipping
?
??
??
?
?
0 flips total
[Copyright Terran Lane]
2/9/08 CS 461, Winter 2008 13
Example: Coin flipping
1 flip total
[Copyright Terran Lane]
2/9/08 CS 461, Winter 2008 14
Example: Coin flipping
5 flips total
[Copyright Terran Lane]
2/9/08 CS 461, Winter 2008 15
Example: Coin flipping
10 flips total
[Copyright Terran Lane]
2/9/08 CS 461, Winter 2008 16
Example: Coin flipping
20 flips total
[Copyright Terran Lane]
2/9/08 CS 461, Winter 2008 17
Example: Coin flipping
50 flips total
[Copyright Terran Lane]
2/9/08 CS 461, Winter 2008 18
Example: Coin flipping
100 flips total
[Copyright Terran Lane]
2/9/08 CS 461, Winter 2008 19
Parametric Classification
Remember Naïve Bayes? Maximum likelihood estimator (MLE)
Maximum a-posteriori (MAP) classifier
!
Cpredict
= argmaxc
P(X1 = u1LXm
= um|C = c)
!
Cpredict
= argmaxc
P(C = c | X1 = u1LXm
= um)
[Copyright Andrew Moore]
2/9/08 CS 461, Winter 2008 20
Parametric Classification
!
gi x( ) = p x |Ci( )P Ci( )
or equivalently
gi x( ) = log p x |Ci( ) + log P Ci( )
!
p x |Ci( ) =1
2"# i
exp $x $µi( )
2
2# i
2
%
& ' '
(
) * *
gi x( ) = $1
2log 2" $ log # i $
x $µi( )2
2# i
2+ log P Ci( )
[Alpaydin 2004 © The MIT Press]
Discriminant (take the max over i):
If we assume p(x|Ci) are Gaussian:
2/9/08 CS 461, Winter 2008 21
Given the sample
ML estimates of priors are
Discriminant becomes
N
t
tt,rx 1}{ ==X
!"x!"
!#$
%&
&=
, if 0
if 1
ijx
xr
j
t
i
t
t
iC
C
!
ˆ P Ci( ) =
ri
t
t
"
N m
i=
xtri
t
t
"
ri
t
t
" s
i
2 =
xt #m
i( )2
ri
t
t
"
ri
t
t
"
!
gi x( ) = "1
2log 2# " log si "
x "mi( )2
2si
2+ log ˆ P Ci( )
[Alpaydin 2004 © The MIT Press]
1-D data
Observedfrequency
Classindicators
Observedvariance
Observedmean
2/9/08 CS 461, Winter 2008 22
Example: 2 classes (same variance)
[Alpaydin 2004 © The MIT Press]
Posterior =discriminant g(x)normalized by P(x)
Likelihood =Gaussian withmean, std dev
Assume both classeshave the same prior
2/9/08 CS 461, Winter 2008 23
Example: 2 classes (diff variance)
[Alpaydin 2004 © The MIT Press]
Posterior =discriminant g(x)normalized by P(x)
Likelihood =Gaussian withmean, std dev
Assume both classeshave the same prior
2/9/08 CS 461, Winter 2008 24
Example: Predicting student’s major
From HW 3: Use “glasses” feature: yes or no Discrete distribution
Likelihood (of data)
2/9/08 CS 461, Winter 2008 25
Example: Predicting student’s major
Posterior (of class)
Priors: P(CS) = 0.4, P(Physics) = 0.3, P(EE) = 0.3Probability of data: P(yes) = 0.6, P(no) = 0.4
Bayes: P(major|glasses) = P(glasses|major) P(major) / P(glasses)
MAPnoMAPyes
2/9/08 CS 461, Winter 2008 26
Summary: Key Points for Today
Parametric methods Data comes from distribution Bernoulli, Gaussian, and their parameters How good is a parameter estimate? (bias, variance)
Bayes estimation ML: use the data MAP: use the prior and the data Bayes estimator: integrated estimate (weighted)
Parametric classification Maximize the posterior probability
2/9/08 CS 461, Winter 2008 27
Next Time
Clustering!(read Ch. 7.1-7.4, 7.8… optional)
No reading questions