18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear...

transcript

18-661 Introduction to Machine Learning

Linear Regression – I

Spring 2020

ECE – Carnegie Mellon University

Outline

1. Recap of MLE/MAP

2. Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

Recap of MLE/MAP

Dogecoin

• Scenario: You find a coin on the ground.

• You ask yourself: Is this a fair or biased coin? What is the

probability that I will flip a heads?

• You flip the coin 10 times . . .

• It comes up as ’H’ 8 times and ’T’ 2 times

• Can we learn from this data?

Machine Learning Pipeline

ML methoddata intelligence

feature extraction

model & parameters optimization evaluation

Two approaches that we discussed:

• Maximum likelihood Estimation (MLE)

• Maximum a posteriori Estimation (MAP)

Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1− θ)nT

• Question: Given this model and the data we’ve observed, can we

calculate an estimate of θ?

• MLE: Choose θ that maximizes the likelihood of the observed data

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

nH + nT

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

P(D | θ) = θnH (1− θ)nT

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

nH + nT

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

P(D | θ) = θnH (1− θ)nT

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

nH + nT

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

P(D | θ) = θnH (1− θ)nT

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

nH + nT

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

P(D | θ) = θnH (1− θ)nT

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

nH + nT

MAP for Dogecoin

θMAP = arg maxθ

P(θ | D) = arg maxθ

P(D | θ)P(θ)

• Recall that P(D | θ) = θnH (1− θ)nT

• How should we set the prior, P(θ)?

• Common choice for a binomial likelihood is to use the Beta

distribution, θ ∼ Beta(α, β):

P(θ) =1

B(α, β)θα−1(1− θ)β−1

• Interpretation: α = number of expected heads, β = number of

expected tails. Larger value of α + β denotes more confidence (and

smaller variance).

MAP for Dogecoin

θMAP = arg maxθ

P(D | θ)P(θ)

P(θ) =1

B(α, β)θα−1(1− θ)β−1

smaller variance).

MAP for Dogecoin

θMAP = arg maxθ

P(D | θ)P(θ)

P(θ) =1

B(α, β)θα−1(1− θ)β−1

smaller variance).

MAP for Dogecoin

θMAP = arg maxθ

P(D | θ)P(θ)

P(θ) =1

B(α, β)θα−1(1− θ)β−1

smaller variance).

Putting it all together

θMLE =nH

nH + nT

θMAP =α + nH − 1

α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

θMLE =nH

nH + nT

α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

θMLE =nH

nH + nT

α + β + nH + nT − 2

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

θMLE =nH

nH + nT

α + β + nH + nT − 2

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

θMLE =nH

nH + nT

α + β + nH + nT − 2

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

θMLE =nH

nH + nT

α + β + nH + nT − 2

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

Linear Regression

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Task 1: Regression

How much should you sell your house for?

ML methoddata intelligenceregressiondata intelligence

house size

input: houses & features

regressiondata intelligence

house size

learn: x → y relationship

house size

predict: y (continuous)

Course Covers: Linear/Ridge Regression, Loss Function, SGD, Feature

Scaling, Regularization, Cross Validation 9

Supervised Learning

Supervised learning

In a supervised learning problem, you have access to input variables (X )

and outputs (Y ), and the goal is to predict an output given an input

• Examples:

• Housing prices (Regression): predict the price of a house based on

features (size, location, etc)

• Cat vs. Dog (Classification): predict whether a picture is of a cat

or a dog

Supervised Learning

Supervised learning

• Examples:

or a dog

Supervised Learning

Supervised learning

• Examples:

or a dog

Regression

Predicting a continuous outcome variable:

• Predicting a company’s future stock price using its profit and other

financial info

• Predicting annual rainfall based on local flora and fauna

• Predicting distance from a traffic light using LIDAR measurements

Magnitude of the error matters:

• We can measure ’closeness’ of prediction and labels, leading to

different ways to evaluate prediction errors.

• Predicting stock price: better to be off by 1$ than by 20$

• Predicting distance from a traffic light: better to be off 1 m than by

• We should choose learning models and algorithms accordingly.

Regression

Predicting a continuous outcome variable:

• Predicting a company’s future stock price using its profit and other

financial info

• Predicting annual rainfall based on local flora and fauna

• Predicting distance from a traffic light using LIDAR measurements

Magnitude of the error matters:

• We can measure ’closeness’ of prediction and labels, leading to

different ways to evaluate prediction errors.

• Predicting stock price: better to be off by 1$ than by 20$

• Predicting distance from a traffic light: better to be off 1 m than by

• We should choose learning models and algorithms accordingly.

Ex: predicting the sale price of a house

Retrieve historical sales records

(This will be our training data)

Features used to predict

Correlation between square footage and sale price

Roughly linear relationship

Sale price ≈ price per sqft × square footage + fixed expense

Data Can be Compactly Represented by Matrices

house size

• Learn parameters (w0,w1) of the orange line y = w1x + w0

House 1: 1000× w1 + w0 = 200, 000

House 2: 2000× w1 + w0 = 350, 000

• Can represent compactly in matrix notation[

1000 1

2000 1

[200, 000

350, 000

Data Can be Compactly Represented by Matrices

house size

• Learn parameters (w0,w1) of the orange line y = w1x + w0

House 1: 1000× w1 + w0 = 200, 000

House 2: 2000× w1 + w0 = 350, 000

• Can represent compactly in matrix notation[

1000 1

2000 1

[200, 000

350, 000

Some Concepts That You Should Know

• Invertibility of Matrices and Computing Inverses

• Vector Norms – L2, Frobenius etc., Inner Products

• Eigenvalues and Eigen-vectors

• Singular Value Decomposition

• Covariance Matrices and Positive Semi-definite-ness

Excellent Resources:

• Essence of Linear Algebra YouTube Series

• Prof. Gilbert Strang’s course at MIT

Matrix Inverse

• Let us solve the house-price prediction problem

[1000 1

2000 1

[200, 000

350, 000

([1000 1

2000 1

])−1 [200, 000

350, 000

−1000

[1 −1

−2000 1000

][200, 000

350, 000

−1000

[150, 000

−5× 107

50, 000

You could have data from many houses

• Sale price =

price per sqft× square footage + fixed expense + unexplainable stuff

• Want to learn the price per sqft and fixed expense

• Training data: past sales record.

sqft sale price

2000 800K

2100 907K

1100 312K

5500 2,600K

· · · · · ·

Problem: there isn’t a w = [w1,w0]T that will satisfy all equations

Want to predict the best price per sqft and fixed expense

• Sale price =

• Want to learn the price per sqft and fixed expense

• Training data: past sales record.

sqft sale price prediction

2000 810K 720K

2100 907K 800K

1100 312K 350K

5500 2,600K 2,600K

· · · · · · · · ·

Reduce prediction error

How to measure errors?

• absolute difference: |prediction− sale price|.• squared difference: (prediction− sale price)2 [differentiable!].

sqft sale price prediction abs error squared error

2000 810K 720K 90K 8100

2100 907K 800K 107K 1072

1100 312K 350K 38K 382

5500 2,600K 2,600K 0 0

· · · · · ·

Reduce prediction error

How to measure errors?

• absolute difference: |prediction− sale price|.• squared difference: (prediction− sale price)2 [differentiable!].

sqft sale price prediction abs error squared error

2000 810K 720K 90K 8100

2100 907K 800K 107K 1072

1100 312K 350K 38K 382

5500 2,600K 2,600K 0 0

· · · · · ·

Geometric Illustration: Each house corresponds to one line

c>1 w = y1

c>2 w = y2w

c>3 w = y3

c>4 w = y4

c>1 w � y1

c>2 w � y2

...c>4 w � y4

= Aw � y

• Want to find w that minimizes the difference between Xw, y

• But since this a vector, we need an operator that can map the

residual vector r(w) = y − Xw to a scalar

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball.

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

• e.g., `1 norm: ‖x‖1 =∑n

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

• e.g., `1 norm: ‖x‖1 =∑n

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

• e.g., `1 norm: ‖x‖1 =∑n

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

• e.g., `1 norm: ‖x‖1 =∑n

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |

• e.g., `∞ norm: ‖x‖∞ = max |xi |

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

• e.g., `1 norm: ‖x‖1 =∑n

from inside to outside: `1, `2, `∞ norm ball. 23

Minimize squared errors

Our model:

Sale price =

Training data:

sqft sale price prediction error squared error

2000 810K 720K 90K 8100

2100 907K 800K 107K 1072

1100 312K 350K 38K 382

5500 2,600K 2,600K 0 0

· · · · · ·Total 8100 + 1072 + 382 + 0 + · · ·

Adjust price per sqft and fixed expense such that the sum of the squared

error is minimized — i.e., the unexplainable stuff is minimized.

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Linear regression

Setup:

• Input: x ∈ RD (covariates, predictors, features, etc)

• Output: y ∈ R (responses, targets, outcomes, outputs, etc)

• Model: f : x→ y , with f (x) = w0 +∑D

d=1 wdxd = w0 + w>x.

• w = [w1 w2 · · · wD ]>: weights, parameters, or parameter vector

• w0 is called bias.

• Sometimes, we also call w = [w0 w1 w2 · · · wD ]> parameters.

• Training data: D = {(xn, yn), n = 1, 2, . . . ,N}

Minimize the Residual sum of squares:

RSS(w) =N∑

[yn − f (xn)]2 =N∑

[yn − (w0 +D∑

wdxnd)]2

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

RSS(w) =∑

[yn − f (xn)]2 =∑

[yn − (w0 + w1xn)]2

What kind of function is this? CONVEX (has a unique global minimum)

RSS(w) =∑

[yn − f (xn)]2 =∑

[yn − (w0 + w1xn)]2

What kind of function is this?

CONVEX (has a unique global minimum)

RSS(w) =∑

[yn − f (xn)]2 =∑

[yn − (w0 + w1xn)]2

What kind of function is this? CONVEX (has a unique global minimum)28

RSS(w) =∑

[yn − f (xn)]2 =∑

[yn − (w0 + w1xn)]2

Stationary points:

Take derivative with respect to parameters and set it to zero

∂RSS(w)

∂w0= 0⇒ −2

[yn − (w0 + w1xn)] = 0,

∂RSS(w)

∂w1= 0⇒ −2

[yn − (w0 + w1xn)]xn = 0.

RSS(w) =∑

[yn − f (xn)]2 =∑

[yn − (w0 + w1xn)]2

Stationary points:

Take derivative with respect to parameters and set it to zero

∂RSS(w)

∂w0= 0⇒ −2

[yn − (w0 + w1xn)] = 0,

∂RSS(w)

∂w1= 0⇒ −2

[yn − (w0 + w1xn)]xn = 0.

∂RSS(w)

∂w0= 0⇒ −2

[yn − (w0 + w1xn)] = 0

∂RSS(w)

∂w1= 0⇒ −2

[yn − (w0 + w1xn)]xn = 0

Simplify these expressions to get the “Normal Equations”:∑

yn = Nw0 + w1

∑xnyn = w0

∑xn + w1

∑x2n

Solving the system we obtain the least squares coefficient estimates:

∑(xn − x)(yn − y)∑

(xi − x)2and w0 = y − w1x

where x = 1N

∑n xn and y = 1

∑n yn.

∂RSS(w)

∂w0= 0⇒ −2

[yn − (w0 + w1xn)] = 0

∂RSS(w)

∂w1= 0⇒ −2

[yn − (w0 + w1xn)]xn = 0

yn = Nw0 + w1

∑xnyn = w0

∑xn + w1

∑x2n

∑(xn − x)(yn − y)∑

where x = 1N

∑n xn and y = 1

∑n yn.

∂RSS(w)

∂w0= 0⇒ −2

[yn − (w0 + w1xn)] = 0

∂RSS(w)

∂w1= 0⇒ −2

[yn − (w0 + w1xn)]xn = 0

yn = Nw0 + w1

∑xnyn = w0

∑xn + w1

∑x2n

∑(xn − x)(yn − y)∑

where x = 1N

∑n xn and y = 1

∑n yn.

Example

sqft (1000’s) sale price (100k)

2.5 4.5

RSS(w) =∑

[yn − f (xn)]2 =∑

[yn − (w0 + w1xn)]2

The w1 and w0 that minimize this are given by:

∑(xn − x)(yn − y)∑

where x = 1N

∑n xn and y = 1

∑n yn.

Example

sqft (1000’s) sale price (100k)

2.5 4.5

RSS(w) =∑

[yn − f (xn)]2 =∑

[yn − (w0 + w1xn)]2

The w1 and w0 that minimize this are given by:

w1 ≈ 1.6

w0 ≈ 0.45

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Least Mean Squares when x is D-dimensional

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

RSS(w) in matrix form:

RSS(w) =∑

[yn − (w0 +∑

wdxnd)]2 =∑

[yn − w>xn]2,

where we have redefined some variables (by augmenting)

x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>

RSS(w) =∑

[yn − (w0 +∑

wdxnd)]2 =∑

[yn − w>xn]2,

x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>

which leads to

RSS(w) =∑

(yn − w>xn)(yn − x>n w)

w>xnx>n w − 2ynx>n w + const.

)w − 2

}+ const.

RSS(w) =∑

[yn − (w0 +∑

wdxnd)]2 =∑

[yn − w>xn]2,

x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>

which leads to

RSS(w) =∑

(yn − w>xn)(yn − x>n w)

w>xnx>n w − 2ynx>n w + const.

)w − 2

}+ const.

RSS(w) in new notations

From previous slide:

RSS(w) =

)w − 2

}+ const.

Design matrix and target vector:

x>1x>2...

∈ RN×(D+1)

y1y2...

∈ RN

Compact expression:

RSS(w) = ‖Xw − y‖22 =

{w>X>Xw − 2

}+ const

RSS(w) in new notations

From previous slide:

RSS(w) =

)w − 2

}+ const.

x>1x>2...

∈ RN×(D+1), y =

y1y2...

∈ RN

Compact expression:

RSS(w) = ‖Xw − y‖22 =

{w>X>Xw − 2

}+ const

Example: RSS(w) in compact form

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

x>1x>2...

∈ RN×(D+1), y =

y1y2...

∈ RN

. Compact expression:

RSS(w) = ‖Xw − y‖22 =

{w>X>Xw − 2

}+ const

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

x>1x>2...

1 1 2 1

1 2 2 2

1 1.5 3 2

1 2.5 4 2.5

. Compact expression:

RSS(w) = ‖Xw − y‖22 =

{w>X>Xw − 2

}+ const

Solution in matrix form

Compact expression

RSS(w) = ||Xw − y||22 =

{w>X>Xw − 2

}+ const

Gradients of Linear and Quadratic Functions

• ∇x(b>x) = b

• ∇x(x>Ax) = 2Ax (symmetric A)

Normal equation

∇wRSS(w) = 2X>Xw − 2X>y = 0

This leads to the least-mean-squares (LMS) solution

wLMS =(

X>X)−1

Compact expression

RSS(w) = ||Xw − y||22 =

{w>X>Xw − 2

}+ const

• ∇x(b>x) = b

Normal equation

∇wRSS(w) = 2X>Xw − 2X>y = 0

wLMS =(

X>X)−1

Compact expression

RSS(w) = ||Xw − y||22 =

{w>X>Xw − 2

}+ const

• ∇x(b>x) = b

Normal equation

∇wRSS(w) = 2X>Xw − 2X>y = 0

wLMS =(

X>X)−1

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

Write the least-mean-squares (LMS) solution

wLMS =(

X>X)−1

Can use solvers in Matlab, Python etc., to compute this for any given X

and y.

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

Write the least-mean-squares (LMS) solution

wLMS =(

X>X)−1

Can use solvers in Matlab, Python etc., to compute this for any given X

and y.

Exercise: RSS(w) in compact form

Using the general least-mean-squares (LMS) solution

wLMS =(

X>X)−1

recover the uni-variate solution that we had computed earlier:

∑(xn − x)(yn − y)∑

where x = 1N

∑n xn and y = 1

∑n yn.

For the 1-D case, the least-mean-squares solution is

wLMS =(

X>X)−1

[1 1 . . . 1

x1 x2 . . . xN

1 x11 x21 . . .

1 1 . . . 1

x1 x2 . . . xN

y1y2. . .

([N Nx

])−1 [ ∑n yn∑

n xnyn

1∑(xi − x)2)

(xi − x)2 − x∑

(xn − x)(yn − y)∑(xn − x)(yn − y)

where x = 1N

∑n xn and y = 1

∑n yn.

For the 1-D case, the least-mean-squares solution is

wLMS =(

X>X)−1

[1 1 . . . 1

x1 x2 . . . xN

1 x11 x21 . . .

1 1 . . . 1

x1 x2 . . . xN

y1y2. . .

([N Nx

])−1 [ ∑n yn∑

n xnyn

1∑(xi − x)2)

(xi − x)2 − x∑

(xn − x)(yn − y)∑(xn − x)(yn − y)

where x = 1N

∑n xn and y = 1

∑n yn.

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Why is minimizing RSS sensible?

c>1 w = y1

c>2 w = y2w

c>3 w = y3

c>4 w = y4

c>1 w � y1

c>2 w � y2

...c>4 w � y4

= Aw � y

• Want to find w that minimizes the difference between Xw, y

• But since this a vector, we need an operator that can map the

residual vector r(w) = y − Xw to a scalar

• We take the sum of the squares of the elements of r(w)44

Why is minimizing RSS sensible?

• Noisy observation model:

Y = w0 + w1X + η

where η ∼ N(0, σ2) is a Gaussian random variable

• Conditional likelihood of one training sample:

p(yn|xn) = N(w0 + w1xn, σ2) =

1√2πσ

e−[yn−(w0+w1xn)]

Probabilistic interpretation (cont’d)

Log-likelihood of the training data D (assuming i.i.d):

logP(D) = logN∏

p(yn|xn) =∑

log p(yn|xn)

{− [yn − (w0 + w1xn)]2

2σ2− log

√2πσ

= − 1

[yn − (w0 + w1xn)]2 − N

2log σ2 − N log

√2π

= −1

[yn − (w0 + w1xn)]2 + N log σ2

}+ const

What is the relationship between minimizing RSS and maximizing the

log-likelihood?

Maximum likelihood estimation

Estimating σ, w0 and w1 can be done in two steps

• Maximize over w0 and w1:

max logP(D)⇔ min∑

[yn − (w0 + w1xn)]2 ← This is RSS(w)!

• Maximize over s = σ2:

∂ logP(D)

∂s= −1

{− 1

[yn − (w0 + w1xn)]2 + N1

→ σ∗2 = s∗ =1

[yn − (w0 + w1xn)]2

How does this probabilistic interpretation help us?

• It gives a solid footing to our intuition: minimizing RSS(w) is a

sensible thing based on reasonable modeling assumptions.

• Estimating σ∗ tells us how much noise there is in our predictions.

For example, it allows us to place confidence intervals around our

predictions.

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Computational complexity of the Least Squares Solution

Bottleneck of computing the solution?

w =(X>X

)−1Xy

Matrix multiply of X>X ∈ R(D+1)×(D+1)

Inverting the matrix X>X

How many operations do we need?

• O(ND2) for matrix multiplication

• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent

theoretical advances) for matrix inversion

• Impractical for very large D or N

Alternative method: Batch Gradient Descent

(Batch) Gradient descent

• Initialize w to w (0) (e.g., randomly);

set t = 0; choose η > 0

• Loop until convergence

1. Compute the gradient

∇RSS(w) = X>(Xw (t) − y)

2. Update the parameters

w (t+1) = w (t) − η∇RSS(w)

3. t ← t + 1

What is the complexity of each iteration?

Why would this work?

If gradient descent converges, it will converge to the same solution as

using matrix inversion.

This is because RSS(w) is a convex function in its parameters w

Hessian of RSS

RSS(w) = w>X>Xw − 2(X>y

)>w + const

⇒ ∂2RSS(w)

∂ww>= 2X>X

X>X is positive semidefinite, because for any v

v>X>Xv = ‖X>v‖22 ≥ 0

Alternative method: Batch Gradient Descent

(Batch) Gradient descent

• Initialize w to w (0) (e.g., randomly);

set t = 0; choose η > 0

1. Compute the gradient

∇RSS(w) = X>Xw (t) − X>y2. Update the parameters

w (t+1) = w (t) − η∇RSS(w)

3. t ← t + 1

What is the complexity of each iteration?

Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time

• Initialize w to some w (0); set t = 0; choose η > 0

1. random choose a training a sample x t

2. Compute its contribution to the gradient

g t = (x>t w (t) − yt)xt

3. Update the parameters

w (t+1) = w (t) − ηg t

4. t ← t + 1

How does the complexity per iteration compare with gradient descent?

• O(ND) for gradient descent versus O(D) for SGD

SGD versus Batch GD

• SGD reduces per-iteration complexity from O(ND) to O(D)

• But it is noisier and can take longer to converge

How to Choose Learning Rate η in practice?

• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on

this later) and choose the one that gives fastest, stable convergence

• Reduce η by a constant factor (eg. 10) when learning saturates so

that we can reach closer to the true minimum.

• More advanced learning rate schedules such as AdaGrad, Adam,

AdaDelta are used in practice.

Mini-Summary

• Linear regression is the linear combination of features

f : x → y , with f (x) = w0 +∑

d wdxd = w0 + w>x

• If we minimize residual sum of squares as our learning objective, we

get a closed-form solution of parameters

• Probabilistic interpretation: maximum likelihood if assuming residual

is Gaussian distributed

• Gradient Descent and mini-batch SGD can overcome computational

issues

18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear...

Documents