20: Maximum Likelihood Estimationweb.stanford.edu/.../20_mle_blank-announcements.pdf5. / = 12 +4 11...

Post on 03-Sep-2020

16 views 0 download

transcript

20: Maximum Likelihood EstimationLisa YanMay 20, 2020

1

Lisa Yan, CS109, 2020

Quick slide reference

2

3 Intro to parameter estimation 20a_intro

14 Maximum Likelihood Estimator 20b_mle

21 argmax and log-likelihood 20c_argmax

30 MLE: Bernoulli 20d_mle_bernoulli

42 MLE exercises: Poisson, Uniform, Gaussian LIVE

Intro to parameter estimation

3

20a_intro

Lisa Yan, CS109, 2020

Story so farAt this point:

If you are given a model with all thenecessary probabilities, you canmake predictions.

But what if you want to learn the probabilities in the model?

What if you want to learn the structure of the model, too?

Machine Learning4

(I wish…another day)

π‘Œ~Poi 5

𝑋!, … , 𝑋" i.i.d.𝑋~Ber 0.2 ,𝑋 = βˆ‘ 𝑋#"

#$!

Lisa Yan, CS109, 2020

ML: Rooted in probability theory

Artificial Intelligence

Machine Learning

Deep Learning

AI and Machine Learning

Lisa Yan, CS109, 2020

Tensor Flow

Alright, so Deep Learning now?

Not so fast…

Lisa Yan, CS109, 2020

Lisa Yan, CS109, 2020 8

Lisa Yan, CS109, 2020

Once upon a time…

9

…there was parameter estimation.

Lisa Yan, CS109, 2020

Recall some estimators 𝑋!, 𝑋", … , 𝑋# are 𝑛 i.i.d. random variables,where 𝑋$ drawn from distribution 𝐹 with 𝐸 𝑋$ = πœ‡, Var 𝑋$ = 𝜎".

Sample mean:

10

𝑋+ =1𝑛

- 𝑋$

#

$%!

unbiased estimate of πœ‡

Sample variance: 𝑆" =1

𝑛 βˆ’ 1- 𝑋$ βˆ’ 𝑋+ "

#

$%!

unbiased estimate of 𝜎"

Lisa Yan, CS109, 2020

What are parameters?def Many random variables we have learned so far are parametric models:

Distribution = model + parameter πœƒex The distribution Ber 0.2

For each of the distributions below, what is the parameter πœƒ?

1. Ber 𝑝2. Poi πœ†3. Uni 𝛼, 𝛽4. 𝒩(πœ‡, 𝜎!)5. π‘Œ = π‘šπ‘‹ + 𝑏

11

πœƒ = 𝑝

= Bernoulli model, parameter πœƒ = 0.2.

πŸ€”

Lisa Yan, CS109, 2020

What are parameters?def Many random variables we have learned so far are parametric models:

Distribution = model + parameter πœƒex The distribution Ber 0.2

For each of the distributions below, what is the parameter πœƒ?

1. Ber 𝑝2. Poi πœ†3. Uni 𝛼, 𝛽4. 𝒩(πœ‡, 𝜎!)5. π‘Œ = π‘šπ‘‹ + 𝑏

12

= Bernoulli model, parameter πœƒ = 0.2.

πœƒ = π‘πœƒ = πœ†πœƒ = 𝛼, 𝛽

πœƒ = π‘š, π‘πœƒ = πœ‡, 𝜎"

πœƒ is the parameter of a distribution.πœƒ can be a vector of parameters!

Lisa Yan, CS109, 2020

Why do we care?In real world, we don’t know the β€œtrue” parameters.β€’ But we do get to observe data:

def estimator πœƒ: : random variable estimating parameter πœƒ from data.

In parameter estimation,We use the point estimate of parameter estimate (best single value):β€’ Better understanding of the process producing dataβ€’ Future predictions based on modelβ€’ Simulation of future processes

13

(# times coin comes up heads, lifetimes of disk drives produced, # visitors to website per day, etc.)

Maximum Likelihood Estimator

14

20b_mle

Lisa Yan, CS109, 2020

Defining the likelihood of data: BernoulliConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.β€’ 𝑋$ was drawn from distribution 𝐹 = Ber πœƒ with unknown parameter πœƒ.β€’ Observed data:

0, 0, 1, 1, 1, 1, 1, 1, 1, 1

How likely was the observed data if πœƒ = 0.4?

𝑃 sample|πœƒ = 0.4 = 0.4 & 0.6 " = 0.000236

15

(𝑛 = 10)

Likelihood of datagiven parameter πœƒ = 0.4 Is there a better

parameter πœƒ?

Lisa Yan, CS109, 2020

Defining the likelihood of dataConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.β€’ 𝑋$ was drawn from a distribution with density function 𝑓 𝑋$|πœƒ .β€’ Observed data: 𝑋!, 𝑋", … , 𝑋#

Likelihood question:How likely is the observed data 𝑋!, 𝑋", … , 𝑋# given parameter πœƒ?

Likelihood function, 𝐿 πœƒ :

16

This is just a product, since 𝑋# are i.i.d.

or mass

= B 𝑓 𝑋$|πœƒ#

$%!

𝐿 πœƒ = 𝑓 𝑋!, 𝑋", … , 𝑋#|πœƒ

Lisa Yan, CS109, 2020

Defining the likelihood of data

17

𝐿 πœƒ = B 𝑓 𝑋$|πœƒ#

$%!

Lisa Yan, CS109, 2020

Maximum Likelihood EstimatorConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#, drawn from a distribution 𝑓 𝑋$|πœƒ .def The Maximum Likelihood Estimator (MLE) of πœƒ is the value of πœƒ that

maximizes 𝐿 πœƒ .

18

πœƒ!"# = arg max$

𝐿 πœƒ

Lisa Yan, CS109, 2020

Maximum Likelihood EstimatorConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#, drawn from a distribution 𝑓 𝑋$|πœƒ .def The Maximum Likelihood Estimator (MLE) of πœƒ is the value of πœƒ that

maximizes 𝐿 πœƒ .

19

πœƒ!"# = arg max$

𝐿 πœƒ

𝐿 πœƒ = B 𝑓 𝑋$|πœƒ#

$%!

Likelihood of your sample

For continuous 𝑋$, 𝑓 𝑋$|πœƒ is PDF; for discrete 𝑋$, 𝑓 𝑋$|πœƒ is PMF

Lisa Yan, CS109, 2020

Maximum Likelihood EstimatorConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#, drawn from a distribution 𝑓 𝑋$|πœƒ .def The Maximum Likelihood Estimator (MLE) of πœƒ is the value of πœƒ that

maximizes 𝐿 πœƒ .

20

πœƒ!"# = arg max$

𝐿 πœƒ

The argument πœƒthat maximizes 𝐿 πœƒ

Stay tuned!

argmax

21

20c_argmax

Lisa Yan, CS109, 2020

New function: arg max

1. max'

𝑓 π‘₯ ?

2. arg max'

𝑓 π‘₯ ?

22

arg max%

𝑓 π‘₯ The argument π‘₯ thatmaximizes the function 𝑓 π‘₯ .

Let 𝑓 π‘₯ = βˆ’π‘₯" + 4, where βˆ’2 < π‘₯ < 2.

0

1

2

3

4

-2 -1 0 1 2

𝑓 π‘₯

π‘₯ πŸ€”

Lisa Yan, CS109, 2020

New function: arg max

1. max'

𝑓 π‘₯ ?

2. arg max'

𝑓 π‘₯ ?

23

arg max%

𝑓 π‘₯ The argument π‘₯ thatmaximizes the function 𝑓 π‘₯ .

Let 𝑓 π‘₯ = βˆ’π‘₯" + 4, where βˆ’2 < π‘₯ < 2.

0

1

2

3

4

-2 -1 0 1 2

𝑓 π‘₯

π‘₯

= 4

= 0

Lisa Yan, CS109, 2020

Argmax and log

24

arg max%

𝑓 π‘₯

Let 𝑓 π‘₯ = βˆ’π‘₯" + 4, where βˆ’2 < π‘₯ < 2.

0

1

2

3

4

-2 -1 0 1 2

𝑓 π‘₯

π‘₯arg max

' 𝑓 π‘₯ = 0

-8

-4

0

4

-2 -1 0 1 2

log 𝑓 π‘₯

π‘₯

= arg max%

log 𝑓 π‘₯

The argument π‘₯ thatmaximizes the function 𝑓 π‘₯ .

Lisa Yan, CS109, 2020

Logs all around

25

-2

-1

0

1

-1 0 1 2 3 4 5 6

β€’ Log is monotonic:π‘₯ ≀ 𝑦 ⟺ log π‘₯ ≀ log 𝑦

β€’ Log of product = sum of logs:

β€’ Natural logslog π‘₯

log π‘Žπ‘ = log π‘Ž + log 𝑏

log(π‘₯ = ln π‘₯

Lisa Yan, CS109, 2020

Argmax properties

26

arg max%

𝑓 π‘₯

= arg max%

log 𝑓 π‘₯ (log is an increasing function: π‘₯ ≀ 𝑦 ⟺ log π‘₯ ≀ log 𝑦)

= arg max%

𝑐 log 𝑓 π‘₯

for any positive constant 𝑐

(π‘₯ ≀ 𝑦 ⟺ 𝑐 log π‘₯ ≀ 𝑐 log 𝑦)

The argument π‘₯ thatmaximizes the function 𝑓 π‘₯ .

Lisa Yan, CS109, 2020

Argmax properties

27

arg max%

𝑓 π‘₯

= arg max%

log 𝑓 π‘₯ (log is monotonic: π‘₯ ≀ 𝑦 ⟺ log π‘₯ ≀ log 𝑦)

= arg max%

𝑐 log 𝑓 π‘₯

for any positive constant 𝑐

(π‘₯ ≀ 𝑦 ⟺ 𝑐 log π‘₯ ≀ 𝑐 log 𝑦)

The argument π‘₯ thatmaximizes the function 𝑓 π‘₯ .

arg max

How do we compute argmax?

Lisa Yan, CS109, 2020

Finding the argmax with calculus

28

π‘₯/ = arg max%

𝑓 π‘₯ Let 𝑓 π‘₯ = βˆ’π‘₯" + 4, where βˆ’2 < π‘₯ < 2.

0

1

2

3

4

-2 -1 0 1 2

𝑓 π‘₯

π‘₯

𝑑𝑑π‘₯

𝑓 π‘₯ =𝑑

𝑑π‘₯π‘₯" + 4 = 2π‘₯Differentiate w.r.t.

argmax’s argument

Set to 0 and solve 2π‘₯ = 0 β‡’ π‘₯U = 0

Make sure π‘₯;is a maximum

β€’ Check 𝑓 π‘₯; Β± πœ– < 𝑓 π‘₯;β€’ Often ignored in expository derivationsβ€’ We’ll ignore it here too

(and won’t require it in class)

MLE: Bernoulli

29

20d_mle_bernoulli

Lisa Yan, CS109, 2020

Computing the MLE

General approach for finding πœƒ)*+ , the MLE of πœƒ:

30

πœƒ!"# = arg max$

𝐿𝐿 πœƒ

1. Determine formula for 𝐿𝐿 πœƒ

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ

𝐿𝐿 πœƒ = ? log 𝑓 𝑋#|πœƒ"

#$!

πœ•πΏπΏ πœƒπœ•πœƒ

3. Solve resulting(simultaneous) equations

To maximize:πœ•πΏπΏ πœƒ

πœ•πœƒ = 0

4. Make sure derived πœƒB%&' is a maximum β€’ Check 𝐿𝐿 πœƒ%&' Β± πœ– < 𝐿𝐿 πœƒ%&'β€’ Often ignored in expository derivationsβ€’ We’ll ignore it here too (and won’t require it in class)

(algebra orcomputer)

𝐿𝐿 πœƒ is often easier to differentiate than 𝐿 πœƒ .

Lisa Yan, CS109, 2020

Maximum Likelihood with BernoulliConsider a sample of 𝑛 i.i.d. RVs 𝑋!, 𝑋", … , 𝑋#.What is πœƒ)*+ = 𝑝)*+?

31

1. Determine formula for 𝐿𝐿 πœƒ

β€’ Let 𝑋$~Ber 𝑝 .

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

3. Solve resultingequations

𝑓 𝑋$|𝑝 = W𝑝 if 𝑋$ = 11 βˆ’ 𝑝 if 𝑋$ = 0

𝐿𝐿 πœƒ = ? log 𝑓 𝑋#|𝑝"

#$!?

Lisa Yan, CS109, 2020

Maximum Likelihood with BernoulliConsider a sample of 𝑛 i.i.d. RVs 𝑋!, 𝑋", … , 𝑋#.What is πœƒ)*+ = 𝑝)*+?

32

1. Determine formula for 𝐿𝐿 πœƒ

β€’ Let 𝑋$~Ber 𝑝 .β€’ 𝑓 𝑋#|𝑝 = 𝑝(! 1 βˆ’ 𝑝 !)(!

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

3. Solve resultingequations

𝐿𝐿 πœƒ = ? log 𝑓 𝑋#|𝑝"

#$!

𝑓 𝑋$|𝑝 = 𝑝7! 1 βˆ’ 𝑝 !87! where 𝑋$ ∈ {0,1}

β€’ Is differentiable with respect to 𝑝‒ Valid PMF over discrete domainβœ…

𝑓 𝑋$|𝑝 = W𝑝 if 𝑋$ = 11 βˆ’ 𝑝 if 𝑋$ = 0

Lisa Yan, CS109, 2020

𝐿𝐿 πœƒ = ? log 𝑓 𝑋#|𝑝"

#$!

Maximum Likelihood with BernoulliConsider a sample of 𝑛 i.i.d. RVs 𝑋!, 𝑋", … , 𝑋#.What is πœƒ)*+ = 𝑝)*+?

33

1. Determine formula for 𝐿𝐿 πœƒ

β€’ Let 𝑋$~Ber 𝑝 .β€’ 𝑓 𝑋#|𝑝 = 𝑝(! 1 βˆ’ 𝑝 !)(!

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

3. Solve resultingequations

= ? log 𝑝(! 1 βˆ’ 𝑝 !)(!"

#$!

π‘Œ = ? 𝑋#"

#$!

= ? 𝑋# log 𝑝 + 1 βˆ’ 𝑋# log 1 βˆ’ 𝑝"

#$!

= π‘Œ log 𝑝 + 𝑛 βˆ’ π‘Œ log 1 βˆ’ 𝑝 , where

Lisa Yan, CS109, 2020

Maximum Likelihood with BernoulliConsider a sample of 𝑛 i.i.d. RVs 𝑋!, 𝑋", … , 𝑋#.What is πœƒ)*+ = 𝑝)*+?

34

1. Determine formula for 𝐿𝐿 πœƒ

β€’ Let 𝑋$~Ber 𝑝 .β€’ 𝑓 𝑋#|𝑝 = 𝑝(! 1 βˆ’ 𝑝 !)(!

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

3. Solve resultingequations

= π‘Œ log 𝑝 + 𝑛 βˆ’ π‘Œ log 1 βˆ’ 𝑝 , where π‘Œ = ? 𝑋#

"

#$!

𝐿𝐿 πœƒ = ? 𝑋# log 𝑝 + 1 βˆ’ 𝑋# log 1 βˆ’ 𝑝"

#$!

πœ•πΏπΏ πœƒπœ•π‘ = π‘Œ

1𝑝 + 𝑛 βˆ’ π‘Œ

βˆ’11 βˆ’ 𝑝 = 0

Lisa Yan, CS109, 2020

Maximum Likelihood with BernoulliConsider a sample of 𝑛 i.i.d. RVs 𝑋!, 𝑋", … , 𝑋#.What is πœƒ)*+ = 𝑝)*+?

35

1. Determine formula for 𝐿𝐿 πœƒ

β€’ Let 𝑋$~Ber 𝑝 .β€’ 𝑓 𝑋#|𝑝 = 𝑝(! 1 βˆ’ 𝑝 !)(!

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

3. Solve resultingequations

= π‘Œ log 𝑝 + 𝑛 βˆ’ π‘Œ log 1 βˆ’ 𝑝 , where π‘Œ = ? 𝑋#

"

#$!

𝐿𝐿 πœƒ = ? 𝑋# log 𝑝 + 1 βˆ’ 𝑋# log 1 βˆ’ 𝑝"

#$!

πœ•πΏπΏ πœƒπœ•π‘ = π‘Œ

1𝑝 + 𝑛 βˆ’ π‘Œ

βˆ’11 βˆ’ 𝑝 = 0

Lisa Yan, CS109, 2020

Maximum Likelihood with BernoulliConsider a sample of 𝑛 i.i.d. RVs 𝑋!, 𝑋", … , 𝑋#.What is πœƒ)*+ = 𝑝)*+?

36

1. Determine formula for 𝐿𝐿 πœƒ

β€’ Let 𝑋$~Ber 𝑝 .β€’ 𝑓 𝑋#|𝑝 = 𝑝(! 1 βˆ’ 𝑝 !)(!

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

3. Solve resultingequations

= π‘Œ log 𝑝 + 𝑛 βˆ’ π‘Œ log 1 βˆ’ 𝑝 , where π‘Œ = ? 𝑋#

"

#$!

𝐿𝐿 πœƒ = ? 𝑋# log 𝑝 + 1 βˆ’ 𝑋# log 1 βˆ’ 𝑝"

#$!

πœ•πΏπΏ πœƒπœ•π‘ = π‘Œ

1𝑝 + 𝑛 βˆ’ π‘Œ

βˆ’11 βˆ’ 𝑝 = 0

𝑝%&' =1𝑛 π‘Œ =

1𝑛 ? 𝑋#

"

#$!

MLE of the Bernoulli parameter, 𝑝%&', is the unbiased estimate of the mean, 𝑋F (sample mean)

Lisa Yan, CS109, 2020

MLE of Bernoulli is the sample mean

37

𝑋F =1𝑛 ? 𝑋#

"

#$!

Bernoulli𝑓 𝑋#|𝑝 = 𝑝(! 1 βˆ’ 𝑝 !)(! ,

where 𝑋# ∈ {0,1}

𝐿𝐿 πœƒ = ? log 𝑓 𝑋#|𝑝"

#$!

Lisa Yan, CS109, 2020

Quick checkβ€’ You draw 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋# from the distribution 𝐹,

yielding the following sample: 0, 0, 1, 1, 1, 1, 1, 1, 1, 1

β€’ Suppose distribution 𝐹 = Ber 𝑝 with unknown parameter 𝑝.

38

(𝑛 = 10)

A. 1.0B. 0.5C. 0.8D. 0.2E. None/other

1. What is 𝑝)*+ , the MLE of the parameter 𝑝?

𝑝%&' = 𝑋F =1𝑛 ? 𝑋#

"

#$!

πŸ€”

Lisa Yan, CS109, 2020

Quick checkβ€’ You draw 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋# from the distribution 𝐹,

yielding the following sample: 0, 0, 1, 1, 1, 1, 1, 1, 1, 1

β€’ Suppose distribution 𝐹 = Ber 𝑝 with unknown parameter 𝑝.

39

A. 1.0B. 0.5C. 0.8D. 0.2E. None/other

1. What is 𝑝)*+ , the MLE of the parameter 𝑝?

𝑝%&' = 𝑋F =1𝑛 ? 𝑋#

"

#$!

(𝑛 = 10)

Lisa Yan, CS109, 2020

Quick checkβ€’ You draw 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋# from the distribution 𝐹,

yielding the following sample: 0, 0, 1, 1, 1, 1, 1, 1, 1, 1

β€’ Suppose distribution 𝐹 = Ber 𝑝 with unknown parameter 𝑝.

40

C. 0.8

𝐿 πœƒ = J 𝑓 𝑋#|𝑝"

#$!

1. What is 𝑝)*+ , the MLE of the parameter 𝑝?2. What is the likelihood 𝐿 πœƒ of this particular sample?

𝑓 𝑋#|𝑝 = 𝑝(! 1 βˆ’ 𝑝 !)(! where 𝑋# ∈ {0,1}

= 𝑝* 1 βˆ’ 𝑝 +

where πœƒ = 𝑝

(𝑛 = 10)

(live)20: Maximum Likelihood EstimationLisa YanMay 20, 2020

41

Lisa Yan, CS109, 2020

Computing the MLE

General approach for finding πœƒ)*+ , the MLE of πœƒ:

42

1. Determine formula for 𝐿𝐿 πœƒ

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ

𝐿𝐿 πœƒ = ? log 𝑓 𝑋#|πœƒ"

#$!

πœ•πΏπΏ πœƒπœ•πœƒ

3. Solve resulting(simultaneous) equations

To maximize:πœ•πΏπΏ πœƒ

πœ•πœƒ = 0

4. Make sure derived πœƒB%&' is a maximum β€’ Check 𝐿𝐿 πœƒ%&' Β± πœ– < 𝐿𝐿 πœƒ%&'β€’ Often ignored in expository derivationsβ€’ We’ll ignore it here too (and won’t require it in class)

(algebra orcomputer)

𝐿𝐿 πœƒ is often easier to differentiate than 𝐿 πœƒ .

Review

Lisa Yan, CS109, 2020

Maximum Likelihood with PoissonConsider a sample of 𝑛 i.i.d. RVs 𝑋!, 𝑋", … , 𝑋#.What is πœƒ)*+ = πœ†)*+?

43

1. Determine formula for 𝐿𝐿 πœƒ

𝐿𝐿 πœƒ =$log𝑒!"πœ†#!𝑋$!

%

$&'

= βˆ’π‘›πœ† + log πœ† $𝑋$

%

$&'

βˆ’$log 𝑋$!%

$&'

=$ βˆ’πœ† log 𝑒 + 𝑋$ log πœ† βˆ’ log𝑋$!%

$&'(using natural log, ln 𝑒 = 1)

𝑓 𝑋#|πœ† =𝑒),πœ†(!

𝑋#!

β€’ Let 𝑋$~Poi πœ† .β€’ PMF:

Lisa Yan, CS109, 2020

Maximum Likelihood with PoissonConsider a sample of 𝑛 i.i.d. RVs 𝑋!, 𝑋", … , 𝑋#.What is πœƒ)*+ = πœ†)*+?

44

1. Determine formula for 𝐿𝐿 πœƒ

𝐿𝐿 πœƒ =$log𝑒!"πœ†#!𝑋$!

%

$&'

= βˆ’π‘›πœ† + log πœ† $𝑋$

%

$&'

βˆ’$log 𝑋$!%

$&'

=$ βˆ’πœ† log 𝑒 + 𝑋$ log πœ† βˆ’ log𝑋$!%

$&'(using natural log, ln 𝑒 = 1)

𝑓 𝑋#|πœ† =𝑒),πœ†(!

𝑋#!

β€’ Let 𝑋$~Poi πœ† .β€’ PMF:

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

βˆ’π‘› +1πœ† A 𝑋%

&

%'(+ 𝑛 log πœ† βˆ’ A

1𝑋%!

β‹…πœ•π‘‹%!πœ•π‘‹%

&

%'(βˆ’π‘› +

1πœ† ? 𝑋#

"

#$!

A. B. C. None/other/don’t know πŸ€”

πœ•πΏπΏ πœƒπœ•πœ† = ?

Lisa Yan, CS109, 2020

Maximum Likelihood with PoissonConsider a sample of 𝑛 i.i.d. RVs 𝑋!, 𝑋", … , 𝑋#.What is πœƒ)*+ = πœ†)*+?

45

1. Determine formula for 𝐿𝐿 πœƒ

𝐿𝐿 πœƒ =$log𝑒!"πœ†#!𝑋$!

%

$&'

= βˆ’π‘›πœ† + log πœ† $𝑋$

%

$&'

βˆ’$log 𝑋$!%

$&'

=$ βˆ’πœ† log 𝑒 + 𝑋$ log πœ† βˆ’ log𝑋$!%

$&'(using natural log, ln 𝑒 = 1)

𝑓 𝑋#|πœ† =𝑒),πœ†(!

𝑋#!

β€’ Let 𝑋$~Poi πœ† .β€’ PMF:

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

βˆ’π‘› +1πœ† A 𝑋%

&

%'(+ 𝑛 log πœ† βˆ’ A

1𝑋%!

β‹…πœ•π‘‹%!πœ•π‘‹%

&

%'(βˆ’π‘› +

1πœ† ? 𝑋#

"

#$!

A. B. C. None/other/don’t know

πœ•πΏπΏ πœƒπœ•πœ† = ?

Lisa Yan, CS109, 2020

Maximum Likelihood with PoissonConsider a sample of 𝑛 i.i.d. RVs 𝑋!, 𝑋", … , 𝑋#.What is πœƒ)*+ = πœ†)*+?

46

1. Determine formula for 𝐿𝐿 πœƒ

𝐿𝐿 πœƒ =$log𝑒!"πœ†#!𝑋$!

%

$&'

= βˆ’π‘›πœ† + log πœ† $𝑋$

%

$&'

βˆ’$log 𝑋$!%

$&'

=$ βˆ’πœ† log 𝑒 + 𝑋$ log πœ† βˆ’ log𝑋$!%

$&'(using natural log, ln 𝑒 = 1)

𝑓 𝑋#|πœ† =𝑒),πœ†(!

𝑋#!

β€’ Let 𝑋$~Poi πœ† .β€’ PMF:

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

πœ•πΏπΏ πœƒπœ•πœ† = βˆ’π‘› +

1πœ† ? 𝑋#

"

#$!= 0

3. Solve resultingequations

πœ†%&' =1𝑛 ? 𝑋#

"

#$!

Lisa Yan, CS109, 2020

Maximum Likelihood with PoissonConsider a sample of 𝑛 i.i.d. RVs 𝑋!, 𝑋", … , 𝑋#.What is πœƒ)*+ = πœ†)*+?

47

1. Determine formula for 𝐿𝐿 πœƒ

𝐿𝐿 πœƒ =$log𝑒!"πœ†#!𝑋$!

%

$&'

= βˆ’π‘›πœ† + log πœ† $𝑋$

%

$&'

βˆ’$log 𝑋$!%

$&'

=$ βˆ’πœ† log 𝑒 + 𝑋$ log πœ† βˆ’ log𝑋$!%

$&'(using natural log, ln 𝑒 = 1)

𝑓 𝑋#|πœ† =𝑒),πœ†(!

𝑋#!

β€’ Let 𝑋$~Poi πœ† .β€’ PMF:

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

πœ•πΏπΏ πœƒπœ•πœ† = βˆ’π‘› +

1πœ† ? 𝑋#

"

#$!= 0

3. Solve resultingequations

πœ†%&' =1𝑛 ? 𝑋#

"

#$!

MLE of the Poisson parameter, πœ†%&', is the unbiased estimate of the mean, 𝑋F (sample mean)

Lisa Yan, CS109, 2020

Quick check1. A particular experiment can be modeled as a

Poisson RV with parameter πœ†, in terms of events/minute.Collect data: observe 53 events over the next 10 minutes. What is πœ†)*+?

2. Is the Bernoulli MLE an unbiased estimator of the Bernoulli parameter 𝑝?

3. Is the Poisson MLE an unbiased estimator of the Poisson variance?

4. What does unbiased mean?

48

πŸ€”

πœ†%&' =1𝑛 ? 𝑋#

"

#$!

Lisa Yan, CS109, 2020

Quick check1. A particular experiment can be modeled as a

Poisson RV with parameter πœ†, in terms of events/minute.Collect data: observe 53 events over the next 10 minutes. What is πœ†)*+?

2. Is the Bernoulli MLE an unbiased estimator of the Bernoulli parameter 𝑝?

3. Is the Poisson MLE an unbiased estimator of the Poisson variance?

4. What does unbiased mean?

49

πœ†%&' =1𝑛 ? 𝑋#

"

#$!

βœ…

βœ…

𝐸 estimator = true_thingUnbiased: If you could repeat your experiment, on average you would get what you are looking for.

Interlude for jokes/announcements

50

Lisa Yan, CS109, 2020

Announcements

51

Problem Set 5

Only do problems on the official Pset handout.

Problem Set 6Released today! Due Wed. August 12 (no late days or on-time bonus).

Regrade RequestsPset 1-5 and midterm regrade requests are due by August 11 via Gradescope. Please submit Pset 6 regrades only in extreme cases (e.g. we didn’t see your answers because of mislabeled pages) via email.

Completely Optional ProjectYou may be able to replace an early Pset grade that you’re unhappy with by completing a CS109-related project. Details here: https://us.edstem.org/courses/667/discussion/98951

Lisa Yan, CS109, 2020

Are these trials independent?Are probabilities consistent across jobs?

Interesting probability news

52

Bernoulli’strialscantellyouhowmanyjobapplicationstosend

https://swizec.com/blog/bernoullis-trials-can-tell-many-job-applications-send/swizec/7677

Lisa Yan, CS109, 2020

Maximum Likelihood with UniformConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.

Let 𝑋$~Uni 𝛼, 𝛽 .

53

𝑓 𝑋#|𝛼, 𝛽 = Q1

𝛽 βˆ’ 𝛼 if 𝛼 ≀ π‘₯# ≀ 𝛽

0 otherwise

1. Determine formula for 𝐿 πœƒ

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

𝐿 πœƒ = R1

𝛽 βˆ’ 𝛼

" if 𝛼 ≀ π‘₯!, π‘₯+, … , π‘₯" ≀ 𝛽

0 otherwise

A. Great, let’s do itB. Differentiation is hardC. Constraint 𝛼 ≀ π‘₯!, π‘₯+, … , π‘₯" ≀ 𝛽

makes differentiation hard πŸ€”

Lisa Yan, CS109, 2020

Example sample from a UniformConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.

Let 𝑋$~Uni 𝛼, 𝛽 .

54

𝐿 πœƒ = R1

𝛽 βˆ’ 𝛼

" if 𝛼 ≀ π‘₯!, π‘₯+, … , π‘₯" ≀ 𝛽

0 otherwise

A. Uni 𝛼 = 0 , 𝛽 = 1

B. Uni 𝛼 = 0.15, 𝛽 = 0.75

C. Uni 𝛼 = 0.15, 𝛽 = 0.70

Suppose 𝑋$~Uni 0,1 .You observe data:

Which parameterswould give youmaximum 𝐿 πœƒ ?

0.15, 0.20, 0.30, 0.40, 0.65, 0.70, 0.75

πŸ€”

Lisa Yan, CS109, 2020

Example sample from a UniformConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.

Let 𝑋$~Uni 𝛼, 𝛽 .

55

𝐿 πœƒ = R1

𝛽 βˆ’ 𝛼

" if 𝛼 ≀ π‘₯!, π‘₯+, … , π‘₯" ≀ 𝛽

0 otherwise

A. Uni 𝛼 = 0 , 𝛽 = 1

B. Uni 𝛼 = 0.15, 𝛽 = 0.75

C. Uni 𝛼 = 0.15, 𝛽 = 0.70

Suppose 𝑋$~Uni 0,1 .You observe data:

Which parameterswould give youmaximum 𝐿 πœƒ ?

0.15, 0.20, 0.30, 0.40, 0.65, 0.70, 0.75

!-.//

0β‹… 0 = 0

1 1 = 1!-.0

1= 59.5

⚠ Original parameters may not yield maximum likelihood.

Lisa Yan, CS109, 2020

Maximum Likelihood with UniformConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.

Let 𝑋$~Uni 𝛼, 𝛽 .

56

𝐿 πœƒ = R1

𝛽 βˆ’ 𝛼

" if 𝛼 ≀ π‘₯!, π‘₯+, … , π‘₯" ≀ 𝛽

0 otherwise

πœƒ)*+ : 𝛼)*+ = min π‘₯!, π‘₯", … , π‘₯# 𝛽)*+ = max π‘₯!, π‘₯", … , π‘₯#

Intuition:β€’ Want interval size 𝛽 βˆ’ 𝛼 to be as small

as possible to maximize likelihood function per datapoint

β€’ Need to make sure all observed data is in interval (if not, then 𝐿 πœƒ = 0)

Lisa Yan, CS109, 2020

Small samples = problems with MLEMaximum Likelihood Estimator πœƒ)*+ :β€’ Best explains data we have seen β€’ Does not attempt to generalize to unseen data.

In many cases,

β€’ Unbiased (𝐸 πœ‡%&' = πœ‡ regardless of size of sample, 𝑛)

For some cases, like Uniform:

β€’ Biased. Problematic for small sample sizeβ€’ Example: If 𝑛 = 1 then 𝛼 = 𝛽, yielding an invalid distribution

57

πœ‡%&' =1𝑛 ? 𝑋#

"

#$!Sample mean (MLE for Bernoulli 𝑝,

Poisson πœ†, Normal πœ‡)

𝛼)*+ β‰₯ 𝛼, 𝛽)*+ ≀ 𝛽

βœ…

⚠

πœƒ)*+ = arg maxF

𝐿 πœƒ

Lisa Yan, CS109, 2020

Properties of MLEMaximum Likelihood Estimator:β€’ Best explains data we have seen β€’ Does not attempt to generalize to unseen data.

β€’ Often used when sample size 𝑛 is large relative to parameter space

β€’ Potentially biased (though asymptotically less so, as 𝑛 β†’ ∞)

β€’ Consistent:

As 𝑛 β†’ ∞ (i.e., more data), probability that πœƒB significantly differs from πœƒ is zero

58

πœƒ)*+ = arg maxF

𝐿 πœƒ

lim#β†’H

𝑃 πœƒ: βˆ’ πœƒ < πœ€ = 1 where πœ€ > 0

Lisa Yan, CS109, 2020

Maximum Likelihood with Normal

59

𝐿𝐿 πœƒ = ? log1

2πœ‹πœŽπ‘’) (!)2 "/ +4"

"

#$!= ? βˆ’ log 2πœ‹πœŽ βˆ’ 𝑋# βˆ’ πœ‡ +/ 2𝜎+

"

#$! (using natural log)

= βˆ’ ? log 2πœ‹πœŽ"

#$!

βˆ’ ? 𝑋# βˆ’ πœ‡ +/ 2𝜎+"

#$!

Consider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.β€’ Let 𝑋#~ 𝒩 πœ‡, 𝜎+ .

What is πœƒ)*+ = πœ‡)*+ , 𝜎)*+" ?

1. Determine formula for 𝐿𝐿 πœƒ

3. Solve resultingequations

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

𝑓 𝑋#|πœ‡, 𝜎+ =1

2πœ‹πœŽπ‘’) (!)2 "/ +4"

Lisa Yan, CS109, 2020

Maximum Likelihood with NormalConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.β€’ Let 𝑋#~ 𝒩 πœ‡, 𝜎+ .

What is πœƒ)*+ = πœ‡)*+ , 𝜎)*+" ?

60

1. Determine formula for 𝐿𝐿 πœƒ

3. Solve resultingequations

𝐿𝐿 πœƒ = βˆ’ ? log 2πœ‹πœŽ"

#$!

βˆ’ ? 𝑋# βˆ’ πœ‡ +/ 2𝜎+"

#$!

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

𝑓 𝑋#|πœ‡, 𝜎+ =1

2πœ‹πœŽπ‘’) (!)2 "/ +4"

πœ•πΏπΏ πœƒπœ•πœ‡ = ? 2 𝑋# βˆ’ πœ‡ / 2𝜎+

"

#$!

=1

𝜎+ ? 𝑋# βˆ’ πœ‡"

#$!

= 0

with respect to πœ‡

Lisa Yan, CS109, 2020

Maximum Likelihood with NormalConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.β€’ Let 𝑋#~ 𝒩 πœ‡, 𝜎+ .

What is πœƒ)*+ = πœ‡)*+ , 𝜎)*+" ?

61

1. Determine formula for 𝐿𝐿 πœƒ

3. Solve resultingequations

𝐿𝐿 πœƒ = βˆ’ ? log 2πœ‹πœŽ"

#$!

βˆ’ ? 𝑋# βˆ’ πœ‡ +/ 2𝜎+"

#$!

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

𝑓 𝑋#|πœ‡, 𝜎+ =1

2πœ‹πœŽπ‘’) (!)2 "/ +4"

πœ•πΏπΏ πœƒπœ•πœ‡ = ? 2 𝑋# βˆ’ πœ‡ / 2𝜎+

"

#$!

=1

𝜎+ ? 𝑋# βˆ’ πœ‡"

#$!

= 0

with respect to πœ‡ with respect to 𝜎

πœ•πΏπΏ πœƒπœ•πœŽ = βˆ’ ?

1𝜎

"

#$!

+ ? 2 𝑋# βˆ’ πœ‡ +/ 2𝜎5"

#$!

= βˆ’π‘›πœŽ +

1𝜎5 ? 𝑋# βˆ’ πœ‡ +

"

#$!

= 0

Lisa Yan, CS109, 2020

Maximum Likelihood with Normal Consider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.β€’ Let 𝑋#~ 𝒩 πœ‡, 𝜎+ .

What is πœƒ)*+ = πœ‡)*+ , 𝜎)*+" ?

62

3. Solve resultingequations

𝑓 𝑋#|πœ‡, 𝜎+ =1

2πœ‹πœŽπ‘’) (!)2 "/ +4"

1𝜎+ ? 𝑋# βˆ’ πœ‡

"

#$!= 0Two equations,

two unknowns:

First, solvefor πœ‡%&':

1𝜎+ ? 𝑋#

"

#$!βˆ’

1𝜎+ ? πœ‡

"

#$!= 0 β‡’ ? 𝑋#

"

#$!

= π‘›πœ‡ β‡’ πœ‡%&' =1𝑛 ? 𝑋#

"

#$!unbiased

βˆ’π‘›πœŽ +

1𝜎5 ? 𝑋# βˆ’ πœ‡ +

"

#$!

= 0

Lisa Yan, CS109, 2020

Maximum Likelihood with NormalConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.β€’ Let 𝑋#~ 𝒩 πœ‡, 𝜎+ .

What is πœƒ)*+ = πœ‡)*+ , 𝜎)*+" ?

63

3. Solve resultingequations

𝑓 𝑋#|πœ‡, 𝜎+ =1

2πœ‹πœŽπ‘’) (!)2 "/ +4"

βˆ’π‘›πœŽ +

1𝜎5 ? 𝑋# βˆ’ πœ‡ +

"

#$!

= 0Two equations, two unknowns:

1𝜎+ ? 𝑋#

"

#$!βˆ’

1𝜎+ ? πœ‡

"

#$!= 0 β‡’ ? 𝑋#

"

#$!

= π‘›πœ‡ β‡’ πœ‡%&' =1𝑛 ? 𝑋#

"

#$!

Next, solvefor 𝜎%&':

1𝜎5 ? 𝑋# βˆ’ πœ‡ +

"

#$!

=π‘›πœŽ β‡’ ? 𝑋# βˆ’ πœ‡ +

"

#$!

= 𝜎+𝑛 β‡’ 𝜎%&'+ =1𝑛 ? 𝑋# βˆ’ πœ‡%&' +

"

#$!biased

unbiased

First, solvefor πœ‡%&':

1𝜎+ ? 𝑋# βˆ’ πœ‡

"

#$!= 0

Lisa Yan, CS109, 2020

Estimating a Bernoulli parameterConsider 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.β€’ Suppose distribution 𝐹 = Ber πœƒ with unknown parameter πœƒ.β€’ Say you have three coins: πœƒ! = 0.5, πœƒ" = 0.8, or πœƒK = 1

Which coin is most likely to give you the following sample (𝑛 = 10)?0, 0, 1, 1, 1, 1, 1, 1, 1, 1

64

πŸ€”

Lisa Yan, CS109, 2020

Estimating a Bernoulli parameterConsider 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.β€’ Suppose distribution 𝐹 = Ber πœƒ with unknown parameter πœƒ.β€’ Say you have three coins: πœƒ! = 0.5, πœƒ" = 0.8, or πœƒK = 1

Which estimate is most likely to give you the following sample (𝑛 = 10)?0, 0, 1, 1, 1, 1, 1, 1, 1, 1

65

How do we write this process mathematically?

Most likely, sochoose this coin

𝑃 sample|πœƒ = 0.5 = 0.5 & 0.5 " = 0.00097𝑃 sample|πœƒ = 0.8 = 0.8 & 0.2 " = 0.00671𝑃 sample|πœƒ = 1.0 = 1.0 & 0 " = 0

Lisa Yan, CS109, 2020

Estimating a Bernoulli parameter Consider 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.β€’ Suppose distribution 𝐹 = Ber πœƒ with unknown parameter πœƒ.β€’ Say you have three coins: πœƒ! = 0.5, πœƒ" = 0.8, or πœƒK = 1

Which estimate is most likely to give you the following sample (𝑛 = 10)? 0, 0, 1, 1, 1, 1, 1, 1, 1, 1

66

πœƒ3 = arg max$∈ '.),'.+,,

πœƒ+ 1 βˆ’ πœƒ - = 0.8

Most likely, sochoose this coin

𝑃 sample|πœƒ = 0.5 = 0.5 & 0.5 " = 0.00097𝑃 sample|πœƒ = 0.8 = 0.8 & 0.2 " = 0.00671𝑃 sample|πœƒ = 1.0 = 1.0 & 0 " = 0

Lisa Yan, CS109, 2020

Maximum Likelihood with BernoulliConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.β€’ Let 𝑋#~Ber 𝑝 .

What is πœƒ)*+ = 𝑝)*+?

67

What is the PMF 𝑓 𝑋$|𝑝 ?A. 𝑝B. 1 βˆ’ 𝑝

C. W𝑝 if 𝑋$ = 11 βˆ’ 𝑝 if 𝑋$ = 0

D. 𝑝7! 1 βˆ’ 𝑝 !87! where 𝑋$ ∈ {0,1}

1. Determine formula for 𝐿𝐿 πœƒ

3. Solve resultingequations

𝐿𝐿 πœƒ = ? log 𝑓 𝑋#|𝑝"

#$!

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

πŸ€”

Lisa Yan, CS109, 2020

Maximum Likelihood with BernoulliConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.β€’ Let 𝑋#~Ber 𝑝 .

What is πœƒ)*+ = 𝑝)*+?

68

What is the PMF 𝑓 𝑋$|𝑝 ?A. 𝑝B. 1 βˆ’ 𝑝

C. W𝑝 if 𝑋$ = 11 βˆ’ 𝑝 if 𝑋$ = 0

D. 𝑝7! 1 βˆ’ 𝑝 !87! where 𝑋$ ∈ {0,1}

1. Determine formula for 𝐿𝐿 πœƒ

3. Solve resultingequations

𝐿𝐿 πœƒ = ? log 𝑓 𝑋#|𝑝"

#$!

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0

Lisa Yan, CS109, 2020

Maximum Likelihood with BernoulliConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.β€’ Let 𝑋#~Ber 𝑝 .

What is πœƒ)*+ = 𝑝)*+?

69

β€’ Is differentiableβ€’ Valid PMF over

discrete domain

What is the PMF 𝑓 𝑋$|𝑝 ?A. 𝑝B. 1 βˆ’ 𝑝

C. W𝑝 if 𝑋$ = 11 βˆ’ 𝑝 if 𝑋$ = 0

D. 𝑝7! 1 βˆ’ 𝑝 !87! where 𝑋$ ∈ {0,1}

1. Determine formula for 𝐿𝐿 πœƒ

3. Solve resultingequations

𝐿𝐿 πœƒ = ? log 𝑓 𝑋#|𝑝"

#$!

2. Differentiate 𝐿𝐿 πœƒw.r.t. (each) πœƒ, set to 0