Share this document with a friend

24

Transcript

1

©2005-2013 Carlos Guestrin 1

Regularization, Ridge Regression

Machine Learning – CSEP546 Carlos Guestrin University of Washington

January 13, 2014

2

The regression problem n Instances: <xj, tj> n Learn: Mapping from x to t(x) n Hypothesis space:

¨ Given, basis functions ¨ Find coeffs w={w1,…,wk}

¨ Why is this called linear regression??? n model is linear in the parameters

n Precisely, minimize the residual squared error:

©2005-2014 Carlos Guestrin

2

3

The regression problem in matrix notation

N data points

K basis functions

N data points

observations weights

K basis func

©2005-2014 Carlos Guestrin

4

Regression solution = simple matrix operations

where

k×k matrix for k basis functions

k×1 vector

©2005-2014 Carlos Guestrin

3

Bias-Variance Tradeoff

n Choice of hypothesis class introduces learning bias ¨ More complex class → less bias ¨ More complex class → more variance

©2005-2014 Carlos Guestrin 5

Test set error as a function of model complexity

©2005-2014 Carlos Guestrin 6

4

Overfitting

n Overfitting: a learning algorithm overfits the training data if it outputs a solution w when there exists another solution w’ such that:

©2005-2014 Carlos Guestrin 7

Regularization in Linear Regression

n Overfitting usually leads to very large parameter choices, e.g.:

n Regularized or penalized regression aims to impose a “complexity” penalty by penalizing large weights ¨ “Shrinkage” method

-2.2 + 3.1 X – 0.30 X2 -1.1 + 4,700,910.7 X – 8,585,638.4 X2 + …

8 ©2005-2013 Carlos Guestrin

5

Quadratic Penalty (regularization)

n What we thought we wanted to minimize:

n But weights got too big, penalize large weights:

9 ©2005-2013 Carlos Guestrin

Ridge Regression

10

n Ameliorating issues with overfitting:

n New objective:

©2005-2013 Carlos Guestrin

6

11

Ridge Regression in Matrix Notation

N data points

K+1 basis functions

N data points

observations weights

K+1 basis func

©2005-2013 Carlos Guestrin

h0 . . . hk

+ � wT I0+kw

wridge = argminw

NX

j=1

t(xj)� (w0 +

kX

i=1

wihi(xj))

!2

+ �

kX

i=1

w

2i

Minimizing the Ridge Regression Objective

12

wridge = argminw

NX

j=1

t(xj)� (w0 +

kX

i=1

wihi(xj))

!2

+ �

kX

i=1

w2i

= (Hw � t)T (Hw � t) + � wT I0+kw

©2005-2013 Carlos Guestrin

MLE

7

Shrinkage Properties

13

n If orthonormal features/basis:

wridge = (HTH + � I0+k)�1HT t

HTH = I

©2005-2013 Carlos Guestrin

Ridge Regression: Effect of Regularization

14

n Solution is indexed by the regularization parameter λ n Larger λ

n Smaller λ

n As λ à 0

n As λ à∞

wridge = argminw

NX

j=1

t(xj)� (w0 +

kX

i=1

wihi(xj))

!2

+ �

kX

i=1

w

2i

©2005-2013 Carlos Guestrin

8

Ridge Coefficient Path

n Typical approach: select λ using cross validation, more on this later in the quarter

15

From Kevin Murphy textbook

©2005-2013 Carlos Guestrin

Error as a function of regularization parameter for a fixed model complexity

λ=∞ λ=0

©2005-2013 Carlos Guestrin 16

9

What you need to know…

n Regularization ¨ Penalizes for complex models

n Ridge regression ¨ L2 penalized least-squares regression ¨ Regularization parameter trades off model complexity

with training error

17 ©2005-2013 Carlos Guestrin

©2005-2013 Carlos Guestrin 18

Cross-Validation

Machine Learning – CSEP546 Carlos Guestrin University of Washington

January 13, 2014

10

Test set error as a function of model complexity

©2005-2013 Carlos Guestrin 19

How… How… How???????

n How do we pick the regularization constant λ… ¨ And all other constants in ML, ‘cause one thing ML

doesn’t lack is constants to tune… L

n We could use the test data, but…

20 ©2005-2013 Carlos Guestrin

11

(LOO) Leave-one-out cross validation

n Consider a validation set with 1 example: ¨ D – training data ¨ D\j – training data with j th data point moved to validation set

n Learn classifier hD\j with D\j dataset n Estimate true error as squared error on predicting t(xj):

¨ Unbiased estimate of errortrue(hD\j)!

¨ Seems really bad estimator, but wait!

n LOO cross validation: Average over all data points j: ¨ For each data point you leave out, learn a new classifier hD\j ¨ Estimate error as:

21

errorLOO =1

N

NX

j=1

�t(xj)� hD\j(xj)

�2

©2005-2013 Carlos Guestrin

LOO cross validation is (almost) unbiased estimate of true error of hD!

n When computing LOOCV error, we only use N-1 data points ¨ So it’s not estimate of true error of learning with N data points! ¨ Usually pessimistic, though – learning with less data typically gives worse answer

n LOO is almost unbiased!

n Great news! ¨ Use LOO error for model selection!!! ¨ E.g., picking λ

22 ©2005-2013 Carlos Guestrin

12

Using LOO to Pick λ

λ=∞ λ=0

©2005-2013 Carlos Guestrin 23

error

LOO=

1

N

N X j=1

� t(x

j)�

h

D\j(x

j)� 2

Using LOO error for model selection

24

error

LOO=

1

N

N X j=1

� t(x

j)�

h

D\j(x

j)� 2

©2005-2013 Carlos Guestrin

13

Computational cost of LOO

n Suppose you have 100,000 data points n You implemented a great version of your learning

algorithm ¨ Learns in only 1 second

n Computing LOO will take about 1 day!!! ¨ If you have to do for each choice of basis functions, it will

take fooooooreeeve’!!! n Solution 1: Preferred, but not usually possible

¨ Find a cool trick to compute LOO (e.g., see homework)

25 ©2005-2013 Carlos Guestrin

Solution 2 to complexity of computing LOO: (More typical) Use k-fold cross validation

n Randomly divide training data into k equal parts ¨ D1,…,Dk

n For each i ¨ Learn classifier hD\Di using data point not in Di ¨ Estimate error of hD\Di on validation set Di:

n k-fold cross validation error is average over data splits:

n k-fold cross validation properties: ¨ Much faster to compute than LOO ¨ More (pessimistically) biased – using much less data, only m(k-1)/k ¨ Usually, k = 10 J

26

errorDi =k

N

X

xj2Di

�t(xj)� hD\Di

(xj)�2

©2005-2013 Carlos Guestrin

14

ML Pipeline

27 ©2005-2013 Carlos Guestrin

data

What you need to know…

n Never ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever train on the test data

n Use cross-validation to choose magic parameters such as λ

n Leave-one-out is the best you can do, but sometimes too slow ¨ In that case, use k-fold cross-validation

28 ©2005-2013 Carlos Guestrin

15

©2005-2013 Carlos Guestrin 29

Variable Selection LASSO: Sparse Regression Machine Learning – CSEP546 Carlos Guestrin University of Washington

January 13, 2014

Sparsity n Vector w is sparse, if many entries are zero:

n Very useful for many tasks, e.g., ¨ Efficiency: If size(w) = 100B, each prediction is expensive:

n If part of an online system, too slow n If w is sparse, prediction computation only depends on number of non-zeros

¨ Interpretability: What are the relevant dimension to make a prediction?

n E.g., what are the parts of the brain associated with particular words?

n But computationally intractable to perform “all subsets” regression

30

Eat Push Run

Mean of

independently

learned signatures

over all nine

participants

Participant

P1

Pars opercularis

(z=24mm)

Postcentral gyrus

(z=30mm)

Superior temporal

sulcus (posterior)

(z=12mm)

Figure from Tom

Mitchell

©2005-2013 Carlos Guestrin

16

Simple greedy model selection algorithm

n Pick a dictionary of features ¨ e.g., polynomials for linear regression

n Greedy heuristic: ¨ Start from empty (or simple) set of

features F0 = ∅ ¨ Run learning algorithm for current set

of features Ft n Obtain ht

¨ Select next best feature Xi*

n e.g., Xj that results in lowest training error learner when learning with Ft + {Xj}

¨ Ft+1 ç Ft + {Xi*}

¨ Recurse 31 ©2005-2013 Carlos Guestrin

Greedy model selection

n Applicable in many settings: ¨ Linear regression: Selecting basis functions ¨ Naïve Bayes: Selecting (independent) features P(Xi|Y) ¨ Logistic regression: Selecting features (basis functions) ¨ Decision trees: Selecting leaves to expand

n Only a heuristic! ¨ But, sometimes you can prove something cool about it

n e.g., [Krause & Guestrin ’05]: Near-optimal in some settings that include Naïve Bayes

n There are many more elaborate methods out there

32 ©2005-2013 Carlos Guestrin

17

When do we stop???

n Greedy heuristic: ¨ … ¨ Select next best feature Xi

* n e.g., Xj that results in lowest training error

learner when learning with Ft + {Xj} ¨ Ft+1 ç Ft + {Xi

*} ¨ Recurse

When do you stop??? n When training error is low enough? n When test set error is low enough? n

33 ©2005-2013 Carlos Guestrin

Regularization in Linear Regression

n Overfitting usually leads to very large parameter choices, e.g.:

n Regularized or penalized regression aims to impose a “complexity” penalty by penalizing large weights ¨ “Shrinkage” method

-2.2 + 3.1 X – 0.30 X2 -1.1 + 4,700,910.7 X – 8,585,638.4 X2 + …

34 ©2005-2013 Carlos Guestrin

18

Variable Selection by Regularization

35

n Ridge regression: Penalizes large weights

n What if we want to perform “feature selection”? ¨ E.g., Which regions of the brain are important for word prediction? ¨ Can’t simply choose features with largest coefficients in ridge solution

n Try new penalty: Penalize non-zero weights ¨ Regularization penalty:

¨ Leads to sparse solutions ¨ Just like ridge regression, solution is indexed by a continuous param λ ¨ This simple approach has changed statistics, machine learning &

electrical engineering

©2005-2013 Carlos Guestrin

LASSO Regression

36

n LASSO: least absolute shrinkage and selection operator

n New objective:

©2005-2013 Carlos Guestrin

19

Geometric Intuition for Sparsity

37

4

Picture of Lasso and Ridge regression

β β2. .β

1

β 2

β1

β

Lasso Ridge Regression

From Rob Tibshirani slides

4

Picture of Lasso and Ridge regression

β β2. .β

1

β 2

β1

β

Lasso Ridge Regression

w1

w2 wMLE

w1

w2 wMLE

©2005-2013 Carlos Guestrin

Optimizing the LASSO Objective n LASSO solution:

38 ©2005-2013 Carlos Guestrin

wLASSO = argminw

NX

j=1

t(xj)� (w0 +

kX

i=1

wihi(xj))

!2

+ �

kX

i=1

|wi|

20

Coordinate Descent n Given a function F

¨ Want to find minimum

n Often, hard to find minimum for all coordinates, but easy for one coordinate n Coordinate descent:

n How do we pick next coordinate?

n Super useful approach for *many* problems ¨ Converges to optimum in some cases, such as LASSO

39 ©2005-2013 Carlos Guestrin

How do we find the minimum over each coordinate?

n Key step in coordinate descent: ¨ Find minimum over each coordinate

n Standard approach:

40 ©2005-2013 Carlos Guestrin

Illustration from Wikipedia

21

Optimizing LASSO Objective One Coordinate at a Time

n Taking the derivative: ¨ Residual sum of squares (RSS):

¨ Penalty term:

41 ©2005-2013 Carlos Guestrin

NX

j=1

t(xj)� (w0 +

kX

i=1

wihi(xj))

!2

+ �

kX

i=1

|wi|

@

@w`RSS(w) = �2

NX

j=1

h`(xj)

t(xj)� (w0 +

kX

i=1

wihi(xj))

!

Coordinate Descent for LASSO (aka Shooting Algorithm)

n Repeat until convergence ¨ Pick a coordinate l at (random or sequentially)

n Set:

n Where:

¨ For convergence rates, see Shalev-Shwartz and Tewari 2009 n Other common technique = LARS

¨ Least angle regression and shrinkage, Efron et al. 2004 42

w` =

8<

:

(c` + �)/a` c` < ��0 c` 2 [��,�]

(c` � �)/a` c` > �

a` = 2NX

j=1

(h`(xj))2

c` = 2NX

j=1

h`(xj)

0

@t(xj)� (w0 +X

i 6=`

wihi(xj))

1

A

©2005-2013 Carlos Guestrin

22

Soft Thresholding

43

From Kevin Murphy textbook

w` =

8<

:

(c` + �)/a` c` < ��0 c` 2 [��,�]

(c` � �)/a` c` > �

c`

©2005-2013 Carlos Guestrin

Recall: Ridge Coefficient Path

n Typical approach: select λ using cross validation

44

From Kevin Murphy textbook

©2005-2013 Carlos Guestrin

23

Now: LASSO Coefficient Path

45

From Kevin Murphy textbook

©2005-2013 Carlos Guestrin

LASSO Example

46

6

Estimated coe!cients

Term Least Squares Ridge Lasso

Intercept 2.465 2.452 2.468

lcavol 0.680 0.420 0.533

lweight 0.263 0.238 0.169

age !0.141 !0.046

lbph 0.210 0.162 0.002

svi 0.305 0.227 0.094

lcp !0.288 0.000

gleason !0.021 0.040

pgg45 0.267 0.133

From Rob Tibshirani slides

©2005-2013 Carlos Guestrin

24

Debiasing

47

From Kevin Murphy textbook

©2005-2013 Carlos Guestrin

What you need to know

n Variable Selection: find a sparse solution to learning problem

n L1 regularization is one way to do variable selection ¨ Applies beyond regressions ¨ Hundreds of other approaches out there

n LASSO objective non-differentiable, but convex è Use subgradient

n No closed-form solution for minimization è Use coordinate descent

n Shooting algorithm is very simple approach for solving LASSO

48 ©2005-2013 Carlos Guestrin

Recommended