+ All Categories
Home > Documents > Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso...

Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso...

Date post: 08-Aug-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
28
1/19/17 1 CSE 446: Machine Learning Emily Fox University of Washington January 20, 2017 ©2017 Emily Fox Regularized Regression: Geometric intuition of solution Plus: Cross validation CSE 446: Machine Learning Coordinate descent for lasso (for normalized features) ©2017 Emily Fox
Transcript
Page 1: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

1

CSE 446: Machine Learning

CSE 446: Machine Learning Emily Fox University of Washington January 20, 2017

©2017 Emily Fox

Regularized Regression: Geometric intuition of solution Plus: Cross validation

CSE 446: Machine Learning

Coordinate descent for lasso (for normalized features)

©2017 Emily Fox

Page 2: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

2

CSE 446: Machine Learning 3

Coordinate descent for least squares regression

Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D

compute:

set: ŵj = ρj

©2017 Emily Fox

NX

i=1ρj = hj(xi)(yi – ŷi(ŵ-j))

prediction without feature j

residual without feature j

CSE 446: Machine Learning 4

Coordinate descent for lasso

Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D

compute:

set: ŵj =

©2017 Emily Fox

NX

i=1ρj = hj(xi)(yi – ŷi(ŵ-j))

ρj + λ/2 if ρj < -λ/2

ρj – λ/2 if ρj > λ/2 0 if ρj in [-λ/2, λ/2]

Page 3: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

3

CSE 446: Machine Learning 5

Soft thresholding

©2017 Emily Fox

ŵj

ρj

ŵj = ρj + λ/2 if ρj < -λ/2

ρj – λ/2 if ρj > λ/2 0 if ρj in [-λ/2, λ/2]

CSE 446: Machine Learning 6

How to assess convergence?

Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D

compute:

set: ŵj =

©2017 Emily Fox

NX

i=1ρj = hj(xi)(yi – ŷi(ŵ-j))

ρj + λ/2 if ρj < -λ/2

ρj – λ/2 if ρj > λ/2 0 if ρj in [-λ/2, λ/2]

Page 4: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

4

CSE 446: Machine Learning 7

When to stop?

For convex problems, will start to take smaller and smaller steps

Measure size of steps taken in a full loop over all features -  stop when max step < ε

Convergence criteria

©2017 Emily Fox

CSE 446: Machine Learning 8

Other lasso solvers

©2017 Emily Fox

Classically: Least angle regression (LARS) [Efron et al. ‘04]

Then: Coordinate descent algorithm [Fu ‘98, Friedman, Hastie, & Tibshirani ’08]

Now:

•  Parallel CD (e.g., Shotgun, [Bradley et al. ‘11])

•  Other parallel learning approaches for linear models -  Parallel stochastic gradient descent (SGD) (e.g., Hogwild! [Niu et al. ’11])

-  Parallel independent solutions then averaging [Zhang et al. ‘12]

•  Alternating directions method of multipliers (ADMM) [Boyd et al. ’11]

Page 5: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

5

CSE 446: Machine Learning

Coordinate descent for lasso (for unnormalized features)

©2017 Emily Fox

CSE 446: Machine Learning 10

Coordinate descent for lasso with unnormalized features

Precompute:

Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D

compute:

set: ŵj =

©2017 Emily Fox

NX

i=1zj = hj(xi)

2

NX

i=1ρj = hj(xi)(yi – ŷi(ŵ-j))

(ρj + λ/2)/zj if ρj < -λ/2

(ρj – λ/2)/zj if ρj > λ/2 0 if ρj in [-λ/2, λ/2]

Page 6: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

6

CSE 446: Machine Learning

Geometric intuition for sparsity of lasso solution

©2017 Emily Fox

CSE 446: Machine Learning

Geometric intuition for ridge regression

©2017 Emily Fox

Page 7: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

7

CSE 446: Machine Learning 13

Visualizing the ridge cost in 2D

©2017 Emily Fox

NX

i=1

w0

w1

RSS Cost

−10 −5 0 5 10−10

−5

0

5

10

RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w02+w1

2)

2

CSE 446: Machine Learning 14

Visualizing the ridge cost in 2D

©2017 Emily Fox

NX

i=1

w0

w1

L2 penalty

−10 −5 0 5 10−10

−5

0

5

10

RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w02+w1

2)

2

Page 8: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

8

CSE 446: Machine Learning 15

Visualizing the ridge cost in 2D

©2017 Emily Fox

NX

i=1

RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w02+w1

2)

2

CSE 446: Machine Learning 16

Visualizing the ridge solution

©2017 Emily Fox

NX

i=1

5215

5215

5215

52155215

5215

5215

w0

w1

4.75

4.75

level sets intersect

−10 −5 0 5 10−10

−5

0

5

10

RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w02+w1

2)

2

Page 9: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

9

CSE 446: Machine Learning

Geometric intuition for lasso

©2017 Emily Fox

CSE 446: Machine Learning 18

Visualizing the lasso cost in 2D

©2017 Emily Fox

RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)

NX

i=1

w0

w1

RSS Cost

−10 −5 0 5 10−10

−5

0

5

10

Page 10: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

10

CSE 446: Machine Learning 19

Visualizing the lasso cost in 2D

©2017 Emily Fox

RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)

NX

i=1

w0

w1

L1 penalty

−10 −5 0 5 10−10

−5

0

5

10

CSE 446: Machine Learning 20

Visualizing the lasso cost in 2D

©2017 Emily Fox

RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)

NX

i=1

Page 11: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

11

CSE 446: Machine Learning 21

Visualizing the lasso solution

©2017 Emily Fox

RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)

NX

i=1

5215

5215

5215

52155215

5215

5215

w0

w1

2.75

2.75

level sets intersect

−10 −5 0 5 10−10

−5

0

5

10

CSE 446: Machine Learning 22

Revisit polynomial fit demo

What happens if we refit our high-order polynomial, but now using lasso regression?

Will consider a few settings of λ …

©2017 Emily Fox

Page 12: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

12

CSE 446: Machine Learning

How to choose λ: Cross validation

©2017 Emily Fox

CSE 446: Machine Learning 24

If sufficient amount of data…

©2017 Emily Fox

Validation set

Training set Test set

fit ŵλ test performance of ŵλ to select λ*

assess generalization

error of ŵλ*

Page 13: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

13

CSE 446: Machine Learning 25

Start with smallish dataset

©2017 Emily Fox

All data

CSE 446: Machine Learning 26

Still form test set and hold out

©2017 Emily Fox

Rest of data Test set

Page 14: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

14

CSE 446: Machine Learning 27

How do we use the other data?

©2017 Emily Fox

Rest of data

use for both training and validation, but not so naively

CSE 446: Machine Learning 28

Recall naïve approach

Is validation set enough to compare performance of ŵλ across λ values?

©2017 Emily Fox

Valid. set

Training set

small validation set

No

Page 15: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

15

CSE 446: Machine Learning 29

Choosing the validation set

Didn’t have to use the last data points tabulated to form validation set

Can use any data subset

©2017 Emily Fox

Valid. set

small validation set

CSE 446: Machine Learning 30

Choosing the validation set

©2017 Emily Fox

Valid. set

small validation set

Which subset should I use?

ALL! average performance over all choices

Page 16: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

16

CSE 446: Machine Learning 31

(use same split of data for all other steps)

Preprocessing: Randomly assign data to K groups

©2017 Emily Fox

NK

NK

NK

NK

NK

Rest of data

K-fold cross validation

CSE 446: Machine Learning 32

For k=1,…,K 1.  Estimate ŵλ

(k) on the training blocks

2.  Compute error on validation block: errork(λ)

©2017 Emily Fox

Valid set

ŵλ(1)error1(λ)

K-fold cross validation

Page 17: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

17

CSE 446: Machine Learning 33

For k=1,…,K 1.  Estimate ŵλ

(k) on the training blocks

2.  Compute error on validation block: errork(λ)

©2017 Emily Fox

Valid set

ŵλ(2)error2(λ)

K-fold cross validation

CSE 446: Machine Learning 34

For k=1,…,K 1.  Estimate ŵλ

(k) on the training blocks

2.  Compute error on validation block: errork(λ)

©2017 Emily Fox

Valid set

ŵλ(3) error3(λ)

K-fold cross validation

Page 18: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

18

CSE 446: Machine Learning 35

For k=1,…,K 1.  Estimate ŵλ

(k) on the training blocks

2.  Compute error on validation block: errork(λ)

©2017 Emily Fox

Valid set

ŵλ(4) error4(λ)

K-fold cross validation

CSE 446: Machine Learning 36

For k=1,…,K 1.  Estimate ŵλ

(k) on the training blocks

2.  Compute error on validation block: errork(λ)

Compute average error: CV(λ) = errork(λ)

©2017 Emily Fox

Valid set

ŵλ(5) error5(λ)

1

K

KX

k=1

K-fold cross validation

Page 19: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

19

CSE 446: Machine Learning 37

Repeat procedure for each choice of λ

Choose λ* to minimize CV(λ)

©2017 Emily Fox

Valid set

K-fold cross validation

CSE 446: Machine Learning 38

What value of K?

Formally, the best approximation occurs for validation sets of size 1 (K=N) Computationally intensive

-  requires computing N fits of model per λ Typically, K=5 or 10

©2017 Emily Fox

leave-one-out cross validation

5-fold CV 10-fold CV

Page 20: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

20

CSE 446: Machine Learning 39

Choosing λ via cross validation for lasso

Cross validation is choosing the λ that provides best predictive accuracy

Tends to favor less sparse solutions, and thus smaller λ, than optimal choice for feature selection

c.f., “Machine Learning: A Probabilistic Perspective”, Murphy, 2012 for further discussion

©2017 Emily Fox

CSE 446: Machine Learning

Practical concerns with lasso

©2017 Emily Fox

Page 21: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

21

CSE 446: Machine Learning 41

Issues with standard lasso objective 1.  With group of highly correlated features, lasso tends to select amongst

them arbitrarily -  Often prefer to select all together

2.  Often, empirically ridge has better predictive performance than lasso, but lasso leads to sparser solution

Elastic net aims to address these issues

-  hybrid between lasso and ridge regression

-  uses L1 and L2 penalties

See Zou & Hastie ‘05 for further discussion

©2017 Emily Fox

CSE 446: Machine Learning

Summary for feature selection and lasso regression

©2017 Emily Fox

Page 22: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

22

CSE 446: Machine Learning 43

Impact of feature selection and lasso

Lasso has changed machine learning, statistics, & electrical engineering

But, for feature selection in general, be careful about interpreting selected features

-  selection only considers features included

-  sensitive to correlations between features

-  result depends on algorithm used

-  there are theoretical guarantees for lasso under certain conditions

©2017 Emily Fox

CSE 446: Machine Learning 44

What you can do now… •  Describe “all subsets” and greedy variants for feature selection

•  Analyze computational costs of these algorithms

•  Formulate lasso objective

•  Describe what happens to estimated lasso coefficients as tuning parameter λ is varied

•  Interpret lasso coefficient path plot

•  Contrast ridge and lasso regression

•  Estimate lasso regression parameters using an iterative coordinate descent algorithm

•  Implement K-fold cross validation to select lasso tuning parameter λ

©2017 Emily Fox

Page 23: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

23

CSE 446: Machine Learning

CSE 446: Machine Learning Emily Fox University of Washington January 20, 2017

©2017 Emily Fox

Linear classifiers

CSE 446: Machine Learning

Linear classifier: Intuition

©2017 Emily Fox

Page 24: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

24

CSE 446: Machine Learning 47

Classifier

©2017 Emily Fox

Sentence from

review

Classifier MODEL

Input: x Output: y Predicted class

ŷ = +1

ŷ = -1 Sushi was awesome, the food was awesome, but the service was awful.

CSE 446: Machine Learning 48

Score(x) = weighted sum of features of sentence

If Score (x) > 0:

ŷ = Else:

ŷ =

Feature Coefficient

… …

©2017 Emily Fox

Sentence from

review

Input: x

Simple linear classifier

Page 25: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

25

CSE 446: Machine Learning 49

A simple example: Word counts

©2017 Emily Fox

Feature Coefficient good 1.0

great 1.2

awesome 1.7

bad -1.0

terrible -2.1

awful -3.3

restaurant, the, we, where, …

0.0

… …

Input xi: Sushi was great, the food was awesome, but the service was terrible.

Called a linear classifier, because score is weighted sum of features.

CSE 446: Machine Learning 50

More generically…

©2017 Emily Fox

feature 1 = h0(x) … e.g., 1

feature 2 = h1(x) … e.g., x[1] = #awesome

feature 3 = h2(x) … e.g., x[2] = #awful or, log(x[7]) x[2] = log(#bad) x #awful

or, tf-idf(“awful”)

feature D+1 = hD(x) … some other function of x[1],…, x[d]

DX

j=0

Model: ŷi = sign(Score(xi))

Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD

hD(xi)

= wj hj(xi) = wTh(xi)

Page 26: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

26

CSE 446: Machine Learning

Decision boundaries

©2017 Emily Fox

CSE 446: Machine Learning 52

Suppose only two words had non-zero coefficient

©2017 Emily Fox

#awesome 0 1 2 3 4 …

#aw

ful

0

1

2

3

4

Sushi was awesome, the food was awesome, but the service was awful.

Score(x) = 1.0 #awesome – 1.5 #awful

Input Coefficient Value

w0 0.0

#awesome w1 1.0

#awful w2 -1.5

Page 27: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

27

CSE 446: Machine Learning 53

Decision boundary example

©2017 Emily Fox

#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

1.0 #awesome – 1.5

#awful = 0

Score(x) > 0

Score(x) < 0

Score(x) = 1.0 #awesome – 1.5 #awful

Input Coefficient Value

w0 0.0

#awesome w1 1.0

#awful w2 -1.5

Decision boundary separates + and – predictions

CSE 446: Machine Learning 54

For more inputs (linear features)…

©2017 Emily Fox

#awesome

#aw

ful

x[1]

x[3]

Score(x) = w0 + w1

#awesome + w2

#awful + w3

#great

x[2

]

Page 28: Regularized Regression - University of Washington · Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature

1/19/17

28

CSE 446: Machine Learning 55

For general features…

For more general classifiers (not just linear features) è more complicated shapes

©2017 Emily Fox


Recommended