20: Maximum Likelihood Estimationweb.stanford.edu/.../20_mle_blank-announcements.pdf5. / = 12 +4 11...

transcript

20: Maximum Likelihood EstimationLisa YanMay 20, 2020

Lisa Yan, CS109, 2020

Quick slide reference

3 Intro to parameter estimation 20a_intro

14 Maximum Likelihood Estimator 20b_mle

21 argmax and log-likelihood 20c_argmax

30 MLE: Bernoulli 20d_mle_bernoulli

42 MLE exercises: Poisson, Uniform, Gaussian LIVE

Intro to parameter estimation

20a_intro

Story so farAt this point:

If you are given a model with all thenecessary probabilities, you canmake predictions.

But what if you want to learn the probabilities in the model?

What if you want to learn the structure of the model, too?

Machine Learning4

(I wish…another day)

𝑌~Poi 5

𝑋!, … , 𝑋" i.i.d.𝑋~Ber 0.2 ,𝑋 = ∑ 𝑋#"

ML: Rooted in probability theory

Artificial Intelligence

Machine Learning

Deep Learning

AI and Machine Learning

Tensor Flow

Alright, so Deep Learning now?

Not so fast…

Lisa Yan, CS109, 2020 8

Once upon a time…

…there was parameter estimation.

Recall some estimators 𝑋!, 𝑋", … , 𝑋# are 𝑛 i.i.d. random variables,where 𝑋$ drawn from distribution 𝐹 with 𝐸 𝑋$ = 𝜇, Var 𝑋$ = 𝜎".

Sample mean:

𝑋+ =1𝑛

- 𝑋$

unbiased estimate of 𝜇

Sample variance: 𝑆" =1

𝑛 − 1- 𝑋$ − 𝑋+ "

unbiased estimate of 𝜎"

What are parameters?def Many random variables we have learned so far are parametric models:

Distribution = model + parameter 𝜃ex The distribution Ber 0.2

For each of the distributions below, what is the parameter 𝜃?

1. Ber 𝑝2. Poi 𝜆3. Uni 𝛼, 𝛽4. 𝒩(𝜇, 𝜎!)5. 𝑌 = 𝑚𝑋 + 𝑏

𝜃 = 𝑝

= Bernoulli model, parameter 𝜃 = 0.2.

What are parameters?def Many random variables we have learned so far are parametric models:

Distribution = model + parameter 𝜃ex The distribution Ber 0.2

For each of the distributions below, what is the parameter 𝜃?

1. Ber 𝑝2. Poi 𝜆3. Uni 𝛼, 𝛽4. 𝒩(𝜇, 𝜎!)5. 𝑌 = 𝑚𝑋 + 𝑏

= Bernoulli model, parameter 𝜃 = 0.2.

𝜃 = 𝑝𝜃 = 𝜆𝜃 = 𝛼, 𝛽

𝜃 = 𝑚, 𝑏𝜃 = 𝜇, 𝜎"

𝜃 is the parameter of a distribution.𝜃 can be a vector of parameters!

Why do we care?In real world, we don’t know the “true” parameters.• But we do get to observe data:

def estimator 𝜃: : random variable estimating parameter 𝜃 from data.

In parameter estimation,We use the point estimate of parameter estimate (best single value):• Better understanding of the process producing data• Future predictions based on model• Simulation of future processes

(# times coin comes up heads, lifetimes of disk drives produced, # visitors to website per day, etc.)

Maximum Likelihood Estimator

20b_mle

Defining the likelihood of data: BernoulliConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.• 𝑋$ was drawn from distribution 𝐹 = Ber 𝜃 with unknown parameter 𝜃.• Observed data:

0, 0, 1, 1, 1, 1, 1, 1, 1, 1

How likely was the observed data if 𝜃 = 0.4?

𝑃 sample|𝜃 = 0.4 = 0.4 & 0.6 " = 0.000236

(𝑛 = 10)

Likelihood of datagiven parameter 𝜃 = 0.4 Is there a better

parameter 𝜃?

Defining the likelihood of dataConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#.• 𝑋$ was drawn from a distribution with density function 𝑓 𝑋$|𝜃 .• Observed data: 𝑋!, 𝑋", … , 𝑋#

Likelihood question:How likely is the observed data 𝑋!, 𝑋", … , 𝑋# given parameter 𝜃?

Likelihood function, 𝐿 𝜃 :

This is just a product, since 𝑋# are i.i.d.

or mass

= B 𝑓 𝑋$|𝜃#

𝐿 𝜃 = 𝑓 𝑋!, 𝑋", … , 𝑋#|𝜃

Defining the likelihood of data

𝐿 𝜃 = B 𝑓 𝑋$|𝜃#

Maximum Likelihood EstimatorConsider a sample of 𝑛 i.i.d. random variables 𝑋!, 𝑋", … , 𝑋#, drawn from a distribution 𝑓 𝑋$|𝜃 .def The Maximum Likelihood Estimator (MLE) of 𝜃 is the value of 𝜃 that

maximizes 𝐿 𝜃 .

𝜃!"# = arg max$

𝐿 𝜃

𝜃!"# = arg max$

𝐿 𝜃

𝐿 𝜃 = B 𝑓 𝑋$|𝜃#

Likelihood of your sample

For continuous 𝑋$, 𝑓 𝑋$|𝜃 is PDF; for discrete 𝑋$, 𝑓 𝑋$|𝜃 is PMF

𝜃!"# = arg max$

𝐿 𝜃

The argument 𝜃that maximizes 𝐿 𝜃

Stay tuned!

argmax

20c_argmax

New function: arg max

1. max'

𝑓 𝑥 ?

2. arg max'

𝑓 𝑥 ?

arg max%

𝑓 𝑥 The argument 𝑥 thatmaximizes the function 𝑓 𝑥 .

Let 𝑓 𝑥 = −𝑥" + 4, where −2 < 𝑥 < 2.

-2 -1 0 1 2

𝑓 𝑥

𝑥 🤔

New function: arg max

1. max'

𝑓 𝑥 ?

2. arg max'

𝑓 𝑥 ?

arg max%

𝑓 𝑥 The argument 𝑥 thatmaximizes the function 𝑓 𝑥 .

Let 𝑓 𝑥 = −𝑥" + 4, where −2 < 𝑥 < 2.

-2 -1 0 1 2

𝑓 𝑥

Argmax and log

arg max%

𝑓 𝑥

Let 𝑓 𝑥 = −𝑥" + 4, where −2 < 𝑥 < 2.

-2 -1 0 1 2

𝑓 𝑥

𝑥arg max

' 𝑓 𝑥 = 0

-2 -1 0 1 2

log 𝑓 𝑥

= arg max%

log 𝑓 𝑥

The argument 𝑥 thatmaximizes the function 𝑓 𝑥 .

Logs all around

-1 0 1 2 3 4 5 6

• Log is monotonic:𝑥 ≤ 𝑦 ⟺ log 𝑥 ≤ log 𝑦

• Log of product = sum of logs:

• Natural logslog 𝑥

log 𝑎𝑏 = log 𝑎 + log 𝑏

log(𝑥 = ln 𝑥

Argmax properties

arg max%

𝑓 𝑥

= arg max%

log 𝑓 𝑥 (log is an increasing function: 𝑥 ≤ 𝑦 ⟺ log 𝑥 ≤ log 𝑦)

= arg max%

𝑐 log 𝑓 𝑥

for any positive constant 𝑐

(𝑥 ≤ 𝑦 ⟺ 𝑐 log 𝑥 ≤ 𝑐 log 𝑦)

Argmax properties

arg max%

𝑓 𝑥

= arg max%

log 𝑓 𝑥 (log is monotonic: 𝑥 ≤ 𝑦 ⟺ log 𝑥 ≤ log 𝑦)

= arg max%

𝑐 log 𝑓 𝑥

for any positive constant 𝑐

(𝑥 ≤ 𝑦 ⟺ 𝑐 log 𝑥 ≤ 𝑐 log 𝑦)

arg max

How do we compute argmax?

Finding the argmax with calculus

𝑥/ = arg max%

𝑓 𝑥 Let 𝑓 𝑥 = −𝑥" + 4, where −2 < 𝑥 < 2.

-2 -1 0 1 2

𝑓 𝑥

𝑑𝑑𝑥

𝑓 𝑥 =𝑑

𝑑𝑥𝑥" + 4 = 2𝑥Differentiate w.r.t.

argmax’s argument

Set to 0 and solve 2𝑥 = 0 ⇒ 𝑥U = 0

Make sure 𝑥;is a maximum

• Check 𝑓 𝑥; ± 𝜖 < 𝑓 𝑥;• Often ignored in expository derivations• We’ll ignore it here too

(and won’t require it in class)

MLE: Bernoulli

20d_mle_bernoulli

Computing the MLE

General approach for finding 𝜃)*+ , the MLE of 𝜃:

𝜃!"# = arg max$

𝐿𝐿 𝜃

1. Determine formula for 𝐿𝐿 𝜃

2. Differentiate 𝐿𝐿 𝜃w.r.t. (each) 𝜃

𝐿𝐿 𝜃 = ? log 𝑓 𝑋#|𝜃"

𝜕𝐿𝐿 𝜃𝜕𝜃

3. Solve resulting(simultaneous) equations

To maximize:𝜕𝐿𝐿 𝜃

𝜕𝜃 = 0

4. Make sure derived 𝜃B%&' is a maximum • Check 𝐿𝐿 𝜃%&' ± 𝜖 < 𝐿𝐿 𝜃%&'• Often ignored in expository derivations• We’ll ignore it here too (and won’t require it in class)

(algebra orcomputer)

𝐿𝐿 𝜃 is often easier to differentiate than 𝐿 𝜃 .

Maximum Likelihood with BernoulliConsider a sample of 𝑛 i.i.d. RVs 𝑋!, 𝑋", … , 𝑋#.What is 𝜃)*+ = 𝑝)*+?

• Let 𝑋$~Ber 𝑝 .

2. Differentiate 𝐿𝐿 𝜃w.r.t. (each) 𝜃, set to 0

3. Solve resultingequations

𝑓 𝑋$|𝑝 = W𝑝 if 𝑋$ = 11 − 𝑝 if 𝑋$ = 0

𝐿𝐿 𝜃 = ? log 𝑓 𝑋#|𝑝"