18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear...

Post on 24-Jul-2020

15 views 0 download

transcript

18-661 Introduction to Machine Learning

Linear Regression – I

Spring 2020

ECE – Carnegie Mellon University

Outline

1. Recap of MLE/MAP

2. Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

1

Recap of MLE/MAP

Dogecoin

• Scenario: You find a coin on the ground.

• You ask yourself: Is this a fair or biased coin? What is the

probability that I will flip a heads?

2

• You flip the coin 10 times . . .

• It comes up as ’H’ 8 times and ’T’ 2 times

• Can we learn from this data?

3

• You flip the coin 10 times . . .

• It comes up as ’H’ 8 times and ’T’ 2 times

• Can we learn from this data?

3

• You flip the coin 10 times . . .

• It comes up as ’H’ 8 times and ’T’ 2 times

• Can we learn from this data?

3

Machine Learning Pipeline

ML methoddata intelligence

feature extraction

model & parameters optimization evaluation

Two approaches that we discussed:

• Maximum likelihood Estimation (MLE)

• Maximum a posteriori Estimation (MAP)

4

Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1− θ)nT

• Question: Given this model and the data we’ve observed, can we

calculate an estimate of θ?

• MLE: Choose θ that maximizes the likelihood of the observed data

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

=nH

nH + nT

5

Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1− θ)nT

• Question: Given this model and the data we’ve observed, can we

calculate an estimate of θ?

• MLE: Choose θ that maximizes the likelihood of the observed data

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

=nH

nH + nT

5

Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1− θ)nT

• Question: Given this model and the data we’ve observed, can we

calculate an estimate of θ?

• MLE: Choose θ that maximizes the likelihood of the observed data

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

=nH

nH + nT

5

Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1− θ)nT

• Question: Given this model and the data we’ve observed, can we

calculate an estimate of θ?

• MLE: Choose θ that maximizes the likelihood of the observed data

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

=nH

nH + nT

5

Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1− θ)nT

• Question: Given this model and the data we’ve observed, can we

calculate an estimate of θ?

• MLE: Choose θ that maximizes the likelihood of the observed data

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

=nH

nH + nT

5

MAP for Dogecoin

θMAP = arg maxθ

P(θ | D) = arg maxθ

P(D | θ)P(θ)

• Recall that P(D | θ) = θnH (1− θ)nT

• How should we set the prior, P(θ)?

• Common choice for a binomial likelihood is to use the Beta

distribution, θ ∼ Beta(α, β):

P(θ) =1

B(α, β)θα−1(1− θ)β−1

• Interpretation: α = number of expected heads, β = number of

expected tails. Larger value of α + β denotes more confidence (and

smaller variance).

6

MAP for Dogecoin

θMAP = arg maxθ

P(θ | D) = arg maxθ

P(D | θ)P(θ)

• Recall that P(D | θ) = θnH (1− θ)nT

• How should we set the prior, P(θ)?

• Common choice for a binomial likelihood is to use the Beta

distribution, θ ∼ Beta(α, β):

P(θ) =1

B(α, β)θα−1(1− θ)β−1

• Interpretation: α = number of expected heads, β = number of

expected tails. Larger value of α + β denotes more confidence (and

smaller variance).

6

MAP for Dogecoin

θMAP = arg maxθ

P(θ | D) = arg maxθ

P(D | θ)P(θ)

• Recall that P(D | θ) = θnH (1− θ)nT

• How should we set the prior, P(θ)?

• Common choice for a binomial likelihood is to use the Beta

distribution, θ ∼ Beta(α, β):

P(θ) =1

B(α, β)θα−1(1− θ)β−1

• Interpretation: α = number of expected heads, β = number of

expected tails. Larger value of α + β denotes more confidence (and

smaller variance).

6

MAP for Dogecoin

θMAP = arg maxθ

P(θ | D) = arg maxθ

P(D | θ)P(θ)

• Recall that P(D | θ) = θnH (1− θ)nT

• How should we set the prior, P(θ)?

• Common choice for a binomial likelihood is to use the Beta

distribution, θ ∼ Beta(α, β):

P(θ) =1

B(α, β)θα−1(1− θ)β−1

• Interpretation: α = number of expected heads, β = number of

expected tails. Larger value of α + β denotes more confidence (and

smaller variance).

6

Putting it all together

θMLE =nH

nH + nT

θMAP =α + nH − 1

α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

7

Putting it all together

θMLE =nH

nH + nT

θMAP =α + nH − 1

α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}

• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

7

Putting it all together

θMLE =nH

nH + nT

θMAP =α + nH − 1

α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

7

Putting it all together

θMLE =nH

nH + nT

θMAP =α + nH − 1

α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

7

Putting it all together

θMLE =nH

nH + nT

θMAP =α + nH − 1

α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

7

Putting it all together

θMLE =nH

nH + nT

θMAP =α + nH − 1

α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

7

Linear Regression

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

8

Task 1: Regression

How much should you sell your house for?

ML methoddata intelligenceregressiondata intelligence

= ??

house size

pirc

e ($

)

input: houses & features

regressiondata intelligence

= ??

house size

pirc

e ($

)

learn: x → y relationship

regressiondata intelligence

= ??

house size

pirc

e ($

)

predict: y (continuous)

Course Covers: Linear/Ridge Regression, Loss Function, SGD, Feature

Scaling, Regularization, Cross Validation 9

Supervised Learning

Supervised learning

In a supervised learning problem, you have access to input variables (X )

and outputs (Y ), and the goal is to predict an output given an input

• Examples:

• Housing prices (Regression): predict the price of a house based on

features (size, location, etc)

• Cat vs. Dog (Classification): predict whether a picture is of a cat

or a dog

10

Supervised Learning

Supervised learning

In a supervised learning problem, you have access to input variables (X )

and outputs (Y ), and the goal is to predict an output given an input

• Examples:

• Housing prices (Regression): predict the price of a house based on

features (size, location, etc)

• Cat vs. Dog (Classification): predict whether a picture is of a cat

or a dog

10

Supervised Learning

Supervised learning

In a supervised learning problem, you have access to input variables (X )

and outputs (Y ), and the goal is to predict an output given an input

• Examples:

• Housing prices (Regression): predict the price of a house based on

features (size, location, etc)

• Cat vs. Dog (Classification): predict whether a picture is of a cat

or a dog

10

Regression

Predicting a continuous outcome variable:

• Predicting a company’s future stock price using its profit and other

financial info

• Predicting annual rainfall based on local flora and fauna

• Predicting distance from a traffic light using LIDAR measurements

Magnitude of the error matters:

• We can measure ’closeness’ of prediction and labels, leading to

different ways to evaluate prediction errors.

• Predicting stock price: better to be off by 1$ than by 20$

• Predicting distance from a traffic light: better to be off 1 m than by

10 m

• We should choose learning models and algorithms accordingly.

11

Regression

Predicting a continuous outcome variable:

• Predicting a company’s future stock price using its profit and other

financial info

• Predicting annual rainfall based on local flora and fauna

• Predicting distance from a traffic light using LIDAR measurements

Magnitude of the error matters:

• We can measure ’closeness’ of prediction and labels, leading to

different ways to evaluate prediction errors.

• Predicting stock price: better to be off by 1$ than by 20$

• Predicting distance from a traffic light: better to be off 1 m than by

10 m

• We should choose learning models and algorithms accordingly.

11

Ex: predicting the sale price of a house

Retrieve historical sales records

(This will be our training data)

12

Features used to predict

13

Correlation between square footage and sale price

14

Roughly linear relationship

Sale price ≈ price per sqft × square footage + fixed expense

15

Data Can be Compactly Represented by Matrices

regressiondata intelligence

= ??

house size

pirc

e ($

)

• Learn parameters (w0,w1) of the orange line y = w1x + w0

Sq.ft

House 1: 1000× w1 + w0 = 200, 000

House 2: 2000× w1 + w0 = 350, 000

• Can represent compactly in matrix notation[

1000 1

2000 1

][w1

w0

]=

[200, 000

350, 000

]

16

Data Can be Compactly Represented by Matrices

regressiondata intelligence

= ??

house size

pirc

e ($

)

• Learn parameters (w0,w1) of the orange line y = w1x + w0

Sq.ft

House 1: 1000× w1 + w0 = 200, 000

House 2: 2000× w1 + w0 = 350, 000

• Can represent compactly in matrix notation[

1000 1

2000 1

][w1

w0

]=

[200, 000

350, 000

]

16

Some Concepts That You Should Know

• Invertibility of Matrices and Computing Inverses

• Vector Norms – L2, Frobenius etc., Inner Products

• Eigenvalues and Eigen-vectors

• Singular Value Decomposition

• Covariance Matrices and Positive Semi-definite-ness

Excellent Resources:

• Essence of Linear Algebra YouTube Series

• Prof. Gilbert Strang’s course at MIT

17

Matrix Inverse

• Let us solve the house-price prediction problem

[1000 1

2000 1

][w1

w0

]=

[200, 000

350, 000

](1)

[w1

w0

]=

([1000 1

2000 1

])−1 [200, 000

350, 000

](2)

=1

−1000

[1 −1

−2000 1000

][200, 000

350, 000

](3)

=1

−1000

[150, 000

−5× 107

](4)

[w1

w0

]=

[150

50, 000

](5)

18

You could have data from many houses

• Sale price =

price per sqft× square footage + fixed expense + unexplainable stuff

• Want to learn the price per sqft and fixed expense

• Training data: past sales record.

sqft sale price

2000 800K

2100 907K

1100 312K

5500 2,600K

· · · · · ·

Problem: there isn’t a w = [w1,w0]T that will satisfy all equations

19

Want to predict the best price per sqft and fixed expense

• Sale price =

price per sqft× square footage + fixed expense + unexplainable stuff

• Want to learn the price per sqft and fixed expense

• Training data: past sales record.

sqft sale price prediction

2000 810K 720K

2100 907K 800K

1100 312K 350K

5500 2,600K 2,600K

· · · · · · · · ·

20

Reduce prediction error

How to measure errors?

• absolute difference: |prediction− sale price|.• squared difference: (prediction− sale price)2 [differentiable!].

sqft sale price prediction abs error squared error

2000 810K 720K 90K 8100

2100 907K 800K 107K 1072

1100 312K 350K 38K 382

5500 2,600K 2,600K 0 0

· · · · · ·

21

Reduce prediction error

How to measure errors?

• absolute difference: |prediction− sale price|.• squared difference: (prediction− sale price)2 [differentiable!].

sqft sale price prediction abs error squared error

2000 810K 720K 90K 8100

2100 907K 800K 107K 1072

1100 312K 350K 38K 382

5500 2,600K 2,600K 0 0

· · · · · ·

21

Geometric Illustration: Each house corresponds to one line

c>1 w = y1

c>2 w = y2w

c>3 w = y3

c>4 w = y4

r =

26664

c>1 w � y1

c>2 w � y2

...c>4 w � y4

37775

= Aw � y

• Want to find w that minimizes the difference between Xw, y

• But since this a vector, we need an operator that can map the

residual vector r(w) = y − Xw to a scalar

22

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

2i

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball.

23

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

2i

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball.

23

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

2i

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball.

23

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

2i

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball.

23

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

2i

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball.

23

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

2i

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |

• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball.

23

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

2i

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball. 23

Minimize squared errors

Our model:

Sale price =

price per sqft× square footage + fixed expense + unexplainable stuff

Training data:

sqft sale price prediction error squared error

2000 810K 720K 90K 8100

2100 907K 800K 107K 1072

1100 312K 350K 38K 382

5500 2,600K 2,600K 0 0

· · · · · ·Total 8100 + 1072 + 382 + 0 + · · ·

Aim:

Adjust price per sqft and fixed expense such that the sum of the squared

error is minimized — i.e., the unexplainable stuff is minimized.

24

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

25

Linear regression

Setup:

• Input: x ∈ RD (covariates, predictors, features, etc)

• Output: y ∈ R (responses, targets, outcomes, outputs, etc)

• Model: f : x→ y , with f (x) = w0 +∑D

d=1 wdxd = w0 + w>x.

• w = [w1 w2 · · · wD ]>: weights, parameters, or parameter vector

• w0 is called bias.

• Sometimes, we also call w = [w0 w1 w2 · · · wD ]> parameters.

• Training data: D = {(xn, yn), n = 1, 2, . . . ,N}

Minimize the Residual sum of squares:

RSS(w) =N∑

n=1

[yn − f (xn)]2 =N∑

n=1

[yn − (w0 +D∑

d=1

wdxnd)]2

26

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

27

A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

What kind of function is this? CONVEX (has a unique global minimum)

28

A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

What kind of function is this?

CONVEX (has a unique global minimum)

28

A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

What kind of function is this? CONVEX (has a unique global minimum)28

A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

Stationary points:

Take derivative with respect to parameters and set it to zero

∂RSS(w)

∂w0= 0⇒ −2

n

[yn − (w0 + w1xn)] = 0,

∂RSS(w)

∂w1= 0⇒ −2

n

[yn − (w0 + w1xn)]xn = 0.

29

A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

Stationary points:

Take derivative with respect to parameters and set it to zero

∂RSS(w)

∂w0= 0⇒ −2

n

[yn − (w0 + w1xn)] = 0,

∂RSS(w)

∂w1= 0⇒ −2

n

[yn − (w0 + w1xn)]xn = 0.

29

A simple case: x is just one-dimensional (D=1)

∂RSS(w)

∂w0= 0⇒ −2

n

[yn − (w0 + w1xn)] = 0

∂RSS(w)

∂w1= 0⇒ −2

n

[yn − (w0 + w1xn)]xn = 0

Simplify these expressions to get the “Normal Equations”:∑

yn = Nw0 + w1

∑xn

∑xnyn = w0

∑xn + w1

∑x2n

Solving the system we obtain the least squares coefficient estimates:

w1 =

∑(xn − x)(yn − y)∑

(xi − x)2and w0 = y − w1x

where x = 1N

∑n xn and y = 1

N

∑n yn.

30

A simple case: x is just one-dimensional (D=1)

∂RSS(w)

∂w0= 0⇒ −2

n

[yn − (w0 + w1xn)] = 0

∂RSS(w)

∂w1= 0⇒ −2

n

[yn − (w0 + w1xn)]xn = 0

Simplify these expressions to get the “Normal Equations”:∑

yn = Nw0 + w1

∑xn

∑xnyn = w0

∑xn + w1

∑x2n

Solving the system we obtain the least squares coefficient estimates:

w1 =

∑(xn − x)(yn − y)∑

(xi − x)2and w0 = y − w1x

where x = 1N

∑n xn and y = 1

N

∑n yn.

30

A simple case: x is just one-dimensional (D=1)

∂RSS(w)

∂w0= 0⇒ −2

n

[yn − (w0 + w1xn)] = 0

∂RSS(w)

∂w1= 0⇒ −2

n

[yn − (w0 + w1xn)]xn = 0

Simplify these expressions to get the “Normal Equations”:∑

yn = Nw0 + w1

∑xn

∑xnyn = w0

∑xn + w1

∑x2n

Solving the system we obtain the least squares coefficient estimates:

w1 =

∑(xn − x)(yn − y)∑

(xi − x)2and w0 = y − w1x

where x = 1N

∑n xn and y = 1

N

∑n yn.

30

Example

sqft (1000’s) sale price (100k)

1 2

2 3.5

1.5 3

2.5 4.5

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

The w1 and w0 that minimize this are given by:

w1 =

∑(xn − x)(yn − y)∑

(xi − x)2and w0 = y − w1x

where x = 1N

∑n xn and y = 1

N

∑n yn.

31

Example

sqft (1000’s) sale price (100k)

1 2

2 3.5

1.5 3

2.5 4.5

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

The w1 and w0 that minimize this are given by:

w1 ≈ 1.6

w0 ≈ 0.45

32

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

33

Least Mean Squares when x is D-dimensional

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

RSS(w) in matrix form:

RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd)]2 =∑

n

[yn − w>xn]2,

where we have redefined some variables (by augmenting)

x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>

34

Least Mean Squares when x is D-dimensional

RSS(w) in matrix form:

RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd)]2 =∑

n

[yn − w>xn]2,

where we have redefined some variables (by augmenting)

x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>

which leads to

RSS(w) =∑

n

(yn − w>xn)(yn − x>n w)

=∑

n

w>xnx>n w − 2ynx>n w + const.

=

{w>

(∑

n

xnx>n

)w − 2

(∑

n

ynx>n

)w

}+ const.

35

Least Mean Squares when x is D-dimensional

RSS(w) in matrix form:

RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd)]2 =∑

n

[yn − w>xn]2,

where we have redefined some variables (by augmenting)

x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>

which leads to

RSS(w) =∑

n

(yn − w>xn)(yn − x>n w)

=∑

n

w>xnx>n w − 2ynx>n w + const.

=

{w>

(∑

n

xnx>n

)w − 2

(∑

n

ynx>n

)w

}+ const.

35

RSS(w) in new notations

From previous slide:

RSS(w) =

{w>

(∑

n

xnx>n

)w − 2

(∑

n

ynx>n

)w

}+ const.

Design matrix and target vector:

X =

x>1x>2...

x>N

∈ RN×(D+1)

, y =

y1y2...

yN

∈ RN

Compact expression:

RSS(w) = ‖Xw − y‖22 =

{w>X>Xw − 2

(X>y

)>w

}+ const

36

RSS(w) in new notations

From previous slide:

RSS(w) =

{w>

(∑

n

xnx>n

)w − 2

(∑

n

ynx>n

)w

}+ const.

Design matrix and target vector:

X =

x>1x>2...

x>N

∈ RN×(D+1), y =

y1y2...

yN

∈ RN

Compact expression:

RSS(w) = ‖Xw − y‖22 =

{w>X>Xw − 2

(X>y

)>w

}+ const

36

Example: RSS(w) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

Design matrix and target vector:

X =

x>1x>2...

x>N

∈ RN×(D+1), y =

y1y2...

yN

∈ RN

. Compact expression:

RSS(w) = ‖Xw − y‖22 =

{w>X>Xw − 2

(X>y

)>w

}+ const

37

Example: RSS(w) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

Design matrix and target vector:

X =

x>1x>2...

x>N

=

1 1 2 1

1 2 2 2

1 1.5 3 2

1 2.5 4 2.5

, y =

2

3.5

3

4.5

. Compact expression:

RSS(w) = ‖Xw − y‖22 =

{w>X>Xw − 2

(X>y

)>w

}+ const

38

Solution in matrix form

Compact expression

RSS(w) = ||Xw − y||22 =

{w>X>Xw − 2

(X>y

)>w

}+ const

Gradients of Linear and Quadratic Functions

• ∇x(b>x) = b

• ∇x(x>Ax) = 2Ax (symmetric A)

Normal equation

∇wRSS(w) = 2X>Xw − 2X>y = 0

This leads to the least-mean-squares (LMS) solution

wLMS =(

X>X)−1

X>y

39

Solution in matrix form

Compact expression

RSS(w) = ||Xw − y||22 =

{w>X>Xw − 2

(X>y

)>w

}+ const

Gradients of Linear and Quadratic Functions

• ∇x(b>x) = b

• ∇x(x>Ax) = 2Ax (symmetric A)

Normal equation

∇wRSS(w) = 2X>Xw − 2X>y = 0

This leads to the least-mean-squares (LMS) solution

wLMS =(

X>X)−1

X>y

39

Solution in matrix form

Compact expression

RSS(w) = ||Xw − y||22 =

{w>X>Xw − 2

(X>y

)>w

}+ const

Gradients of Linear and Quadratic Functions

• ∇x(b>x) = b

• ∇x(x>Ax) = 2Ax (symmetric A)

Normal equation

∇wRSS(w) = 2X>Xw − 2X>y = 0

This leads to the least-mean-squares (LMS) solution

wLMS =(

X>X)−1

X>y

39

Example: RSS(w) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

Write the least-mean-squares (LMS) solution

wLMS =(

X>X)−1

X>y

Can use solvers in Matlab, Python etc., to compute this for any given X

and y.

40

Example: RSS(w) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

Write the least-mean-squares (LMS) solution

wLMS =(

X>X)−1

X>y

Can use solvers in Matlab, Python etc., to compute this for any given X

and y.

40

Exercise: RSS(w) in compact form

Using the general least-mean-squares (LMS) solution

wLMS =(

X>X)−1

X>y

recover the uni-variate solution that we had computed earlier:

w1 =

∑(xn − x)(yn − y)∑

(xi − x)2and w0 = y − w1x

where x = 1N

∑n xn and y = 1

N

∑n yn.

41

Exercise: RSS(w) in compact form

For the 1-D case, the least-mean-squares solution is

wLMS =(

X>X)−1

X>y

=

[1 1 . . . 1

x1 x2 . . . xN

]

1 x11 x21 . . .

1 xN

−1[

1 1 . . . 1

x1 x2 . . . xN

]

y1y2. . .

yN

=

([N Nx

Nx∑

n x2n

])−1 [ ∑n yn∑

n xnyn

]

[w0

w1

]=

1∑(xi − x)2)

[y∑

(xi − x)2 − x∑

(xn − x)(yn − y)∑(xn − x)(yn − y)

]

where x = 1N

∑n xn and y = 1

N

∑n yn.

42

Exercise: RSS(w) in compact form

For the 1-D case, the least-mean-squares solution is

wLMS =(

X>X)−1

X>y

=

[1 1 . . . 1

x1 x2 . . . xN

]

1 x11 x21 . . .

1 xN

−1[

1 1 . . . 1

x1 x2 . . . xN

]

y1y2. . .

yN

=

([N Nx

Nx∑

n x2n

])−1 [ ∑n yn∑

n xnyn

]

[w0

w1

]=

1∑(xi − x)2)

[y∑

(xi − x)2 − x∑

(xn − x)(yn − y)∑(xn − x)(yn − y)

]

where x = 1N

∑n xn and y = 1

N

∑n yn.

42

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

43

Why is minimizing RSS sensible?

c>1 w = y1

c>2 w = y2w

c>3 w = y3

c>4 w = y4

r =

26664

c>1 w � y1

c>2 w � y2

...c>4 w � y4

37775

= Aw � y

• Want to find w that minimizes the difference between Xw, y

• But since this a vector, we need an operator that can map the

residual vector r(w) = y − Xw to a scalar

• We take the sum of the squares of the elements of r(w)44

Why is minimizing RSS sensible?

Probabilistic interpretation

• Noisy observation model:

Y = w0 + w1X + η

where η ∼ N(0, σ2) is a Gaussian random variable

• Conditional likelihood of one training sample:

p(yn|xn) = N(w0 + w1xn, σ2) =

1√2πσ

e−[yn−(w0+w1xn)]

2

2σ2

45

Probabilistic interpretation (cont’d)

Log-likelihood of the training data D (assuming i.i.d):

logP(D) = logN∏

n=1

p(yn|xn) =∑

n

log p(yn|xn)

=∑

n

{− [yn − (w0 + w1xn)]2

2σ2− log

√2πσ

}

= − 1

2σ2

n

[yn − (w0 + w1xn)]2 − N

2log σ2 − N log

√2π

= −1

2

{1

σ2

n

[yn − (w0 + w1xn)]2 + N log σ2

}+ const

What is the relationship between minimizing RSS and maximizing the

log-likelihood?

46

Maximum likelihood estimation

Estimating σ, w0 and w1 can be done in two steps

• Maximize over w0 and w1:

max logP(D)⇔ min∑

n

[yn − (w0 + w1xn)]2 ← This is RSS(w)!

• Maximize over s = σ2:

∂ logP(D)

∂s= −1

2

{− 1

s2

n

[yn − (w0 + w1xn)]2 + N1

s

}= 0

→ σ∗2 = s∗ =1

N

n

[yn − (w0 + w1xn)]2

47

How does this probabilistic interpretation help us?

• It gives a solid footing to our intuition: minimizing RSS(w) is a

sensible thing based on reasonable modeling assumptions.

• Estimating σ∗ tells us how much noise there is in our predictions.

For example, it allows us to place confidence intervals around our

predictions.

48

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

49

Computational complexity of the Least Squares Solution

Bottleneck of computing the solution?

w =(X>X

)−1Xy

Matrix multiply of X>X ∈ R(D+1)×(D+1)

Inverting the matrix X>X

How many operations do we need?

• O(ND2) for matrix multiplication

• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent

theoretical advances) for matrix inversion

• Impractical for very large D or N

50

Alternative method: Batch Gradient Descent

(Batch) Gradient descent

• Initialize w to w (0) (e.g., randomly);

set t = 0; choose η > 0

• Loop until convergence

1. Compute the gradient

∇RSS(w) = X>(Xw (t) − y)

2. Update the parameters

w (t+1) = w (t) − η∇RSS(w)

3. t ← t + 1

What is the complexity of each iteration?

O(ND)

51

Why would this work?

If gradient descent converges, it will converge to the same solution as

using matrix inversion.

This is because RSS(w) is a convex function in its parameters w

Hessian of RSS

RSS(w) = w>X>Xw − 2(X>y

)>w + const

⇒ ∂2RSS(w)

∂ww>= 2X>X

X>X is positive semidefinite, because for any v

v>X>Xv = ‖X>v‖22 ≥ 0

52

Alternative method: Batch Gradient Descent

(Batch) Gradient descent

• Initialize w to w (0) (e.g., randomly);

set t = 0; choose η > 0

• Loop until convergence

1. Compute the gradient

∇RSS(w) = X>Xw (t) − X>y2. Update the parameters

w (t+1) = w (t) − η∇RSS(w)

3. t ← t + 1

What is the complexity of each iteration?

O(ND)

53

Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time

• Initialize w to some w (0); set t = 0; choose η > 0

• Loop until convergence

1. random choose a training a sample x t

2. Compute its contribution to the gradient

g t = (x>t w (t) − yt)xt

3. Update the parameters

w (t+1) = w (t) − ηg t

4. t ← t + 1

How does the complexity per iteration compare with gradient descent?

• O(ND) for gradient descent versus O(D) for SGD

54

SGD versus Batch GD

• SGD reduces per-iteration complexity from O(ND) to O(D)

• But it is noisier and can take longer to converge

55

How to Choose Learning Rate η in practice?

• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on

this later) and choose the one that gives fastest, stable convergence

• Reduce η by a constant factor (eg. 10) when learning saturates so

that we can reach closer to the true minimum.

• More advanced learning rate schedules such as AdaGrad, Adam,

AdaDelta are used in practice.

56

How to Choose Learning Rate η in practice?

• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on

this later) and choose the one that gives fastest, stable convergence

• Reduce η by a constant factor (eg. 10) when learning saturates so

that we can reach closer to the true minimum.

• More advanced learning rate schedules such as AdaGrad, Adam,

AdaDelta are used in practice.

56

How to Choose Learning Rate η in practice?

• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on

this later) and choose the one that gives fastest, stable convergence

• Reduce η by a constant factor (eg. 10) when learning saturates so

that we can reach closer to the true minimum.

• More advanced learning rate schedules such as AdaGrad, Adam,

AdaDelta are used in practice.

56

Mini-Summary

• Linear regression is the linear combination of features

f : x → y , with f (x) = w0 +∑

d wdxd = w0 + w>x

• If we minimize residual sum of squares as our learning objective, we

get a closed-form solution of parameters

• Probabilistic interpretation: maximum likelihood if assuming residual

is Gaussian distributed

• Gradient Descent and mini-batch SGD can overcome computational

issues

57