Download - CS 461: Machine Learning Lecture 6 - wkiri.com · 2/9/08 CS 461, Winter 2008 1 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff [email protected]

2/9/08 CS 461, Winter 2008 1

CS 461: Machine LearningLecture 6

Dr. Kiri [email protected]

2/9/08 CS 461, Winter 2008 2

Plan for Today

Note: Room change for 2/16: E&T A129 Solution to Midterm Solution to Homework 3

Parametric methods Data comes from distribution Bernoulli, Gaussian, and their parameters How good is a parameter estimate? (bias, variance)

Bayes estimation ML: use the data MAP: use the prior and the data Bayes estimator: integrated estimate (weighted)

Parametric classification Maximize the posterior probability

2/9/08 CS 461, Winter 2008 3

Review from Lecture 5

Probability Axioms

Bayesian Learning Classification Bayes’s Rule Bayesian Networks Naïve Bayes Classifier Association Rules

2/9/08 CS 461, Winter 2008 4

Parametric Methods

Chapter 4

2/9/08 CS 461, Winter 2008 5

Parametric Learning

Assume: data x comes from a distribution p(x)

Model this distribution by selecting parameters θ e.g., N ( μ, σ2) where θ = { μ, σ2}

2/9/08 CS 461, Winter 2008 6

Maximum Likelihood Model

(Last time: Max Likelihood Classification) Likelihood of θ given the sample X

l (θ|X) = p (X |θ) = ∏t p (xt|θ)

Log likelihood L(θ|X) = log l (θ|X) = ∑

t log p (xt|θ)

Maximum likelihood estimator (MLE)θ* = argmaxθ L(θ|X)

[Alpaydin 2004 © The MIT Press]

2/9/08 CS 461, Winter 2008 7

Example: do you wear glasses?

Bernoulli: Two states, x in {0,1}P (x) = po

x (1 – po ) (1 – x)

L (po|X) = log ∏t po

xt (1 – po ) (1 – xt)

MLE: po = ∑t xt / N


2/9/08 CS 461, Winter 2008 8

Gaussian (Normal) Distribution

p(x) = N ( μ, σ2)

MLE for μ and σ2:


( ) ( )!"

#$%

&

'

µ(

')=

2

2

2exp

2

1 x-xp

μ σ!

p x( ) =1

2"#exp $

x $µ( )2

2# 2

%

& ' '

(

) * *

!

m =

xt

t

"

N

s2 =

xt #m( )

2

t

"

N

2/9/08 CS 461, Winter 2008 9

How good is that estimate?

Let d be the estimate of θ It is also a random variable Technically, it’s d(X) since it depends on X E [d] is the expected value of d (regardless of X)

Bias: bθ(d) = E [d] – θ How far off the correct value is it?

Variance: Var(d) = E [(d–E [d])2] How much does it change with different X?

Mean square error:r (d,θ) = E [(d–θ)2]

= (E [d] – θ)2 + E [(d–E [d])2]= Bias2 + Variance


2/9/08 CS 461, Winter 2008 10

Bayes Estimator:Using what we already know

Prior information p (θ) Bayes’s rule (get posterior):

p (θ|X) = p(X|θ) p(θ) / p(X) Maximum a Posteriori (MAP):θMAP = argmaxθ p(θ|X)

Maximum Likelihood (ML):θML = argmaxθ p(X|θ)

Bayes Estimator:θBayes = E[θ|X] = ∫ θ p(θ|X) dθ In a sense, it is the “weighted mean” for θ For our purposes, θMAP = θBayes


2/9/08 CS 461, Winter 2008 11

Bayes Estimator: Continuous Example

Assume xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)

θML = m (sample mean) θMAP = θBayes =

Estimated mean = weighted average ofsample mean m and prior mean μ Weights indicate how much you trust the sample

[ ] µ!+!

!+

!+!

!="

22

0

2

22

0

2

0

1

1

1|

//N

/m

//N

/NE X


2/9/08 CS 461, Winter 2008 12

Example: Coin flipping

?

??

??

?

?

0 flips total

[Copyright Terran Lane]

2/9/08 CS 461, Winter 2008 13


1 flip total


2/9/08 CS 461, Winter 2008 14


5 flips total


2/9/08 CS 461, Winter 2008 15


10 flips total


2/9/08 CS 461, Winter 2008 16


20 flips total


2/9/08 CS 461, Winter 2008 17


50 flips total


2/9/08 CS 461, Winter 2008 18


100 flips total


2/9/08 CS 461, Winter 2008 19

Parametric Classification

Remember Naïve Bayes? Maximum likelihood estimator (MLE)

Maximum a-posteriori (MAP) classifier

!

Cpredict

= argmaxc

P(X1 = u1LXm

= um|C = c)

!

Cpredict

= argmaxc

P(C = c | X1 = u1LXm

= um)

[Copyright Andrew Moore]

2/9/08 CS 461, Winter 2008 20

Parametric Classification

!

gi x( ) = p x |Ci( )P Ci( )

or equivalently

gi x( ) = log p x |Ci( ) + log P Ci( )

!

p x |Ci( ) =1

2"# i

exp $x $µi( )

2

2# i

2

%

& ' '

(

) * *

gi x( ) = $1

2log 2" $ log # i $

x $µi( )2

2# i

2+ log P Ci( )


Discriminant (take the max over i):

If we assume p(x|Ci) are Gaussian:

2/9/08 CS 461, Winter 2008 21

Given the sample

ML estimates of priors are

Discriminant becomes

N

t

tt,rx 1}{ ==X

!"x!"

!#$

%&

&=

, if 0

if 1

ijx

xr

j

t

i

t

t

iC

C

!

ˆ P Ci( ) =

ri

t

t

"

N m

i=

xtri

t

t

"

ri

t

t

" s

i

2 =

xt #m

i( )2

ri

t

t

"

ri

t

t

"

!

gi x( ) = "1

2log 2# " log si "

x "mi( )2

2si

2+ log ˆ P Ci( )


1-D data

Observedfrequency

Classindicators

Observedvariance

Observedmean

2/9/08 CS 461, Winter 2008 22

Example: 2 classes (same variance)


Posterior =discriminant g(x)normalized by P(x)

Likelihood =Gaussian withmean, std dev

Assume both classeshave the same prior

2/9/08 CS 461, Winter 2008 23

Example: 2 classes (diff variance)


Posterior =discriminant g(x)normalized by P(x)

Likelihood =Gaussian withmean, std dev

Assume both classeshave the same prior

2/9/08 CS 461, Winter 2008 24

Example: Predicting student’s major

From HW 3: Use “glasses” feature: yes or no Discrete distribution

Likelihood (of data)

2/9/08 CS 461, Winter 2008 25

Example: Predicting student’s major

Posterior (of class)

Priors: P(CS) = 0.4, P(Physics) = 0.3, P(EE) = 0.3Probability of data: P(yes) = 0.6, P(no) = 0.4

Bayes: P(major|glasses) = P(glasses|major) P(major) / P(glasses)

MAPnoMAPyes

2/9/08 CS 461, Winter 2008 26

Summary: Key Points for Today

Parametric methods Data comes from distribution Bernoulli, Gaussian, and their parameters How good is a parameter estimate? (bias, variance)

Bayes estimation ML: use the data MAP: use the prior and the data Bayes estimator: integrated estimate (weighted)

Parametric classification Maximize the posterior probability

2/9/08 CS 461, Winter 2008 27

Next Time

Clustering!(read Ch. 7.1-7.4, 7.8… optional)

No reading questions