+ All Categories
Home > Documents > ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 —...

ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 —...

Date post: 09-Jun-2020
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
151
Transcript
Page 1: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for
Page 2: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for
Page 3: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Linear Methods for Regression

Outline

• The simple linear regression model

• Multiple linear regression

• Model selection and shrinkage—the state of the art

1

Page 4: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Preliminaries

Data(x1, y1), . . . (xN , yN ).

xi is the predictor (regressor, covariate, feature, independent variable)

yi is the response (dependent variable, outcome)

We denote theregression functionby

η(x) = E (Y |x)

This is the conditional expectation ofY givenx.

The linear regression model assumes a specific linear form forη

η(x) = α + βx

which is usually thought of as an approximation to the truth.

2

Page 5: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Fitting by least squares

Minimize:

β0, β = argminβ0,β

N∑i=1

(yi − β0 − βxi)2

Solutions are

β =

∑Nj=1(xi − x)yi∑Nj=1(xi − x)2

β0 = y − βx

yi = β0 + βxi are called the fitted or predicted values

ri = yi − β0 − βxi are called the residuals

3

Page 6: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Elements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 3

•• •

••

• ••

• •

••

••

••

••

••

•• ••

• ••

•• •

••

• ••

• •

••

• •••

X1

X2

Y

Figure 3.1: Linear least squares fitting with X ∈ IR2.

We seek the linear function of X that minimizes the

sum of squared residuals from Y .

Figure 3.1 - view of linear regression inIRp+1.

4

Page 7: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Standard errors & confidence intervals

We often assume further that

yi = β0 + βxi + εi

whereE (εi) = 0 andVar (εi) = σ2. Then

se (β) =[

σ2∑(xi − x)2

] 12

Estimateσ2 by σ2 =∑

(yi − yi)2/(N − 2).

Under additional assumption of normality for theεis, a95% confidence

interval forβ is: β ± 1.96se(β)

se (β) =[

σ2∑(xi − x)2

] 12

5

Page 8: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Fitted Line and Standard Errors

η(x) = β0 + βx

= y + β(x− x)

se[η(x)] =[var(y) + var(β)(x− x)2

] 12

=[σ2

n+

σ2(x− x)2∑(xi − x)2

] 12

6

Page 9: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

••

••

••

••

• ••

•••

••

••

• ••

••

••

••

X

Y

-1.0 -0.5 0.0 0.5 1.0

02

46

Fitted regression line with pointwise standard errors:η(x)± 2se[η(x)].

7

Page 10: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Multiple linear regression

Model is

f(xi) = β0 +p∑

j=1

xijβj

Equivalently in matrix notation:

f = Xβ

f is N -vector of predicted values

X is N × p matrix of regressors, with ones in the first column

β is ap-vector of parameters

8

Page 11: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Estimation by least squares

β = argmin∑

i

(yi − β0 −p−1∑j=1

xijβj)2

= argmin(y −Xβ)T (y −Xβ)

Figure 3.2shows theN -dimensional geometry

Solution is

β = (XT X)−1XT y

y = Xβ

Also Var (β) = (XT X)−1σ2

9

Page 12: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Here are someadditional notes (linear.pdf)on multiple linear regression,

with an emphasis on computations.

10

Page 13: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

The Bias-variance tradeoff

A good measure of the quality of an estimatorf(x) is the mean squared

error. Letf0(x) be the true value off(x) at the pointx. Then

Mse [f(x)] = E [f(x)− f0(x)]2

This can be written as

Mse [f(x)] = Var [f(x)] + [E f(x)− f0(x)]2

This isvarianceplus squaredbias.

Typically, when bias is low, variance is high and vice-versa. Choosing

estimators often involves a tradeoff between bias and variance.

11

Page 14: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

• If the linear model is correct for a given problem, then the least

squares predictionf is unbiased, and has the lowest variance among

all unbiased estimators that are linear functions ofy

• But there can be (and often exist) biased estimators with smaller

Mse .

• Generally, byregularizing(shrinking, dampening, controlling) the

estimator in some way, its variance will be reduced; if the

corresponding increase in bias is small, this will be worthwhile.

• Examples of regularization: subset selection (forward, backward, all

subsets); ridge regression, the lasso.

• In reality models are almost never correct, so there is an additional

model biasbetween the closest member of the linear model class and

the truth.

12

Page 15: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Model Selection

Often we prefer a restricted estimate because of its reduced estimation

variance. Elements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 7

RealizationClosest fit in population

Estimation Bias

SPACE

Variance

Estimation

Closest fit

Truth

Model bias

RESTRICTED

Shrunken fit

MODEL SPACE

MODEL

Figure 7.2: Schematic of the behavior of bias and variance.

The model space is the set of all possible predictions from the

model, with the “closest fit” labeled with a black dot. The

model bias from the truth is shown, along with the variance,

indicated by the large yellow circle centered at the black dot

labelled “closest fit in population”. A shrunken or regular-

ized fit is also shown, having additional estimation bias, but

smaller prediction error due to its decreased variance.

13

Page 16: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Analysis of time series data

Two approaches:frequency domain(fourier)—see discussion of wavelet

smoothing.

Time domain. Main tool is auto-regressive (AR) model of orderk:

yt = β1yt−1 + β2yt−2 · · ·+ βkyt−k + εt

Fit by linear least squares regression on lagged data

yt = β1yt−1 + β2yt−2 · · ·βkyt−k

yt−1 = β1yt−2 + β2yt−3 · · ·βkyt−k−1

... =...

yk+1 = β1yk + β2yk−1 · · ·βky1

14

Page 17: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Example: NYSE data

Time series of 6200 daily measurements, 1962-1987

volume — log(trading volume) —outcome

volume.Lj — log(trading volume)day−j , j = 1, 2, 3

ret.Lj — ∆ log(Dow Jones)day−j , j = 1, 2, 3

aret.Lj — |∆log(Dow Jones)|day−j , j = 1, 2, 3

vola.Lj — volatilityday−j , j = 1, 2, 3

Source—Weigend and LeBaron (1994)

We randomly selected a training set of size 50 and a test set of size 500, from the

first 600 observations.

15

Page 18: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Trevor Hastie Stats315a January 15, 2003 Chap3: 16

NYSE data

volume ••

•••••• •

• ••••• •

••

• •••

•••

• ••••

•• •

••••

•••••

••••

••• •

•••••

•••

••

••••• ••

••

••

••• •• •

• •••• ••

••• ••••

• ••

• ••• •• •••

• •••

••

•••• ••

•• •

••

•••

••

-2 1

••

• ••••••

• •• ••••

••

• •••

••••• ••

•••

•••

••

••••

•• • •

••• ••

••••

•••

••

••••• •••

••

•••• •••• •

••• ••

••

• ••

• ••••

• ••• •• •••

•••

•••

•••

• •••

•••

••

•• •• •

••

• ••••

••• •• •

••••

••••

•••

•• ••

••

•••

•••••••••

•• ••

••• ••

•••

••

• ••

••• •

••• •••

•••••• •• •

•••

••• •

••• ••• ••• •

• ••• •••••

• • ••

••

••

•• •••

•••

••

• ••••

-2 1

••

•••••••

• ••••••

••

• ••

•••

••••

••

•••

•••

••

••••

••••

•••••

•••••

•••

••

• ••••• ••

•••

••• • •••••••• •

••••

••••••

• • ••• ••• •

•••

•• •

••

•••••• •

••

•••••

••

•••••

•••• ••

•••

••

••••

••••• •

••

•••

• ••

••

•• ••••••

••• ••

•••••••

••

••••• ••

•••••••• ••••

•• • ••

••

•••

• ••• •

•• •••• •••

••••••

••

••• ••••

•••••••

-1 2

••

•••

• •••

•• •••••

••• •••

• •••• •

••

•• •

•••

••• •••• •• •

•••••

•••

••

•••

••••

•••••••••

•• ••• •• ••

••••

••

• ••

•••••

••• ••••••

• •••••

••

•• ••

•••

••

• ••••

••• ••• •

••• •••

•••

••

•••

•••

•• ••

••

•• •• •

••

••••

•••••

••• ••

•••••

•••

••••

••• •••

••

••• •• ••• ••• • ••

••

•••

•••••

•• •••••••

•••

•••

••

••••

•• •

••

•••••

-1 1

••

•• •••••

••••

••••

•• •

••

•••• ••

••

•••

•••

•••••

•••• •

••• ••

•••

••

•••

••••

••• •••

•••

•• •• ••••••• ••

••

•••

••• ••

•• •• • ••••

•• ••

••

••

•• •••••

•••••••

••

•••

••••

• ••••

•••

•• •••

•••• ••

••

•••

•••

••

• •••• •••

•••••

•••

••

•••

••••

••• •••

•••

•••• ••

•••

••• •

••

•••

• ••••

• ••• •• •• •

••••••

••

••••

•• ••

••

•••••

-1 1 3

••

• ••• •••• •••

••••

•••••

•••

• ••••

•••

•••

••

•••••• ••

••••

• ••

•••••

••

•••• •• •

•••

••

••• ••• •

••• ••

••

•••

•••••

•• •• •• •••

•• ••••

••••••

••••

••

•••••

••

• ••• •••• •••

••••

•• ••

•••

•• ••••

••••••

••

• ••••• ••

•• •••

• ••

•••••

••

•••• •• •

•••

••

••• ••• •

••• ••

••

•••

•••••

•• •• •• •••

•• ••••

••••••

••••

••

•••••

-2 0 2

••

• ••• •••• •••

••••

•• ••

•••

•• ••••

•••

•••

••

•••••• ••

•• •••

• ••

••

•••

••

•••• •• •

•••

••

••• ••• •

•••••

••

•••

•••••

•• •• •• •••

•• ••••

••

••••

••••

••

•••

•• -0.40.00.4

••••• • •••• • •••••

••

•• ••••

•••• •

• ••••• •

•••• •••••••

•• •

•••• •

••••

•••

••• ••

• •• ••

•••••

•••••• • •• •••

•••

••••••

••

••

••

••• •••

••• •

•• •

•• •

•••

•• •

-202

volume.L1 ••• •••• ••• •• ••••

••

•••••

••• •• •

•••••• •

••••••• • •• ••

•••••••

•• •••

•••

••• ••

•••••

•••••••••

••• • ••• ••••

••

••••

••

••

••

••••

••

••••

•••

•• ••

••••

• • ••• •• •• ••• •• ••

•••

•• •

•••

•••• ••••

•••••••• ••• ••••

•• •••

•• •• ••

••

•••

••• ••

••••••• ••

••••

••• ••••• ••• •

••

••••

•••

••

••

•• ••

••

••• •

•••

•• •

•• •

••• •• ••••• ••

• ••••••

••

•••

•••

••• ••

••• •• ••

• •••••••••••

• ••••••

••••

•••

•• •••• ••••• •••••••

•••••

•• •••••

••

• •••

•••

•••

••

•• ••

• •

•• ••

•••

•••

•••

••• • • ••• •••• •• •••

••••

••• ••••

• •• •• ••

• •• •• •• •••••••••

••••••••• •••

•••

• •••••••••• •• ••

•• ••

••• •• •• •••

••

•• •••

••

•••

••

••••••

••• •

• ••

•• •

•••••• •••• •• ••••• •• •

•••

•••• •••

• • •••• • •• ••

•• •••• •• •••••••••• ••••

••

•••

••••••••• ••• •••

• •••

• ••• •• ••••

••

•••

••••

••

••

••••••

••••

•••

••••

•• •

••• ••• ••• •••

• ••• •••

••

•••

•••

••• ••

•••• •••• ••••••••

•••

•••••• ••• •••

•••

• ••••••

•• •• ••••

•• ••

•• •••• •••••

••

• ••••

••

••

••

••• •

••

•• ••

•••

••••

•••

••• •• •• •••••••

••••••

••• ••••

••••••• •• ••

• ••••••• •••••••••••

•• •••

•••

• ••••

•••• •• •••••••

••• ••• •••

• •••

•• ••••••

••

••

•• ••••

•• ••

•••

•• •

•••••• •• ••• ••••

• ••• •••

••

••• ••••

••• •••

• •• •••• •••• •••

•••••••• ••

•• ••

•••

• ••••••

•••• •••

•• •••

•• ••• •• ••••

••

••••

••

••

••

••••••

•• • •

•• •

••• •

•••

••• •• • ••• •••• •••

••••

•• •

•••

••• ••

• •••• ••

• •••••• ••• •

•••• •

• ••••••

•••

•• •••••

•••• •••

•• ••

••• ••

• •••••

••

•• •

•••

••

••

••

•• ••••

•••••••

•• •

•••

••• •• • ••• •••• •••

••••

•• •

•••

••• ••

• •••• ••

• • ••••• ••••

•••• •• ••

••••

•••

•• •••••

•••• •••

•• ••

••• ••

• •••••

••

•• •

• •••

••

••

•• ••••

•••••••

•• •

•••

••• •• • ••• •••

• ••••

•••

•• •

•••

••• ••

• •••• ••• •••••• ••

•••

••• •• ••

••••

•••

•• •••••

•••• •••

•• ••

••• ••

• •••••

••

•• •

•••

••

••

••

•• ••

••

••• ••••

•• •

•••

•••

••••• •

•••

•• ••• •••

••• ••

• • •••

•••

••

••

••

•••• •••

•• ••

• •••

•• •••• ••

••••

••••

•••• ••••

• •• •••

••

• ••• •••

••••

•••••

••

••

••

•••• ••• •• •••

•• •

••••

•••

••

•••••••• •

••••• • ••

••• ••

• •• •••••

• •

••••

•••• •••

•• ••

• ••

••••••

• ••••

••• ••

•••

••••••••

••••

••••• ••••

• •••

•••••

••

••

••

•• •• ••• •• •••

•••

• •••

•••

•• volume.L2 ••

••• ••••

••• •• •• •

••• ••

• •• ••

•••

••

•••••••• •

••••••

• ••

••• •• •

• ••• •

••• ••

•••

•••••• ••

• •••

••

• •• •••

••• •

••

••••

••

••

••

•• • •••• • ••••

•••

••••

•• •

••

•••••••••

••••••• •

•••••

•• •••

•••

••

••

••• •••••••••••

• ••••••••

• ••• ••••• ••

••••••

• •••

••••

••••••••

••••

••

••••

••

••••

••• ••• • •••• ••

•••

• •• •

•••

••

• ••

•• ••••

•• •••

•••

•• •• •

•••••

•••

••

••

••

• •• ••••••

•••

•••••••••

•••••

••• ••

•••••••• •

• •••

• ••

•••

•• ••

••• •

•••

•••

•••

••••••••••••••••

•••

••••

•••••

•••• ••

•••

•• •• ••••

•••••

••••••

• •• •

••

•••• •••

• •• •

•••

•••••• •••

•• •••

••••••••••• •••

••••••

••

••• •••••••

•••

• ••

••

••

••

•• •••••••••••

•••

••••

•• •

••

•••

••• ••••••• ••••

•• •• •• •• •

••••

• •

••

••

• ••••••••••

•••••• •••

•••••

••• •••

••••• •

• •••••

• ••

•• •••• •

••••

••

••••

••

••

••

•••• ••• •••• •

•••

• •••

•••

••

•••

• ••••••••••

•• •

•••• •

••• •••

• •••

••

••

• ••••••• ••••

••••••••••••

••••

• •••

••••• •

• •••

••••

••

• ••• •

••• ••

•••

•••

••

••

••

••• •••••••• •

•••

••••

•••••

•••

•• ••••

•••• ••••

•••• •

••• ••

•• •

••

••

••

•• •••• •

•••••

••••• ••••

•••••

••• •••

••• •••• •••••••

••

• ••• •

•••••

••

••••

••

••

••

••••••••••• •

•• •

• •• •

•••

••

•••

••••

•••

••• ••••

••• • •

• •• ••

•••••

••

••

• ••••••

•••••

•••

•• ••••

•••••

•••• •

•••

• •••• •••

••••

••• ••• ••

•••••

••

•••

••

••

••

••• •••• ••••••

•••

••••

•••

••

•••••• •

•••

••• ••• •

••• • •

• •• ••

•••••

••

••

• • •••••

••••

•••

•• ••••

•••••

•••• •

•••

• •••• •••

••••

••• ••• ••

•••••

••

•• •

••

••

••

••• •••• ••••••

•••

••••

•••

••

•••••• •

•••

••• ••••

••• • •

• •• ••

•••

••

••

••

• ••••••

••••

•••

•• •• ••

•••••

•••• •

•••

• •••• •••

•••••

•• ••• •

••

••••

••

•••

••

••

••

••• •••• ••• ••

•••

••••

•••

••

-202

••••

•••••

•• •

••

•••

•••••

••

••• ••••

••• • •••

• ••••• •••• •••

••

••

••• •

••••

•• ••

• •• ••••• ••

••••

•• •• ••

••••

•• •••

••

••

•••

•••

• ••••

• •••

•••

••••

•-2

0

2••

•••••• •

•••

••

• ••

•• •••

••

•••••• •

•••••••• •••

•• •••• ••

••

•••

•• •••••

•• ••

••••••• •• •

•• •••

•• •••••

••

•• •

•••

• •

••

••• ••

•• ••••

••• •

•••

••••

•••

• •••

• •••

••

••••

•• •••

••

• •• ••••

••• ••••••••

• •• ••

••••

••

••

••• •

••••

•• ••

•••••

•••• ••

•• ••

•• • •••

••••

•• •••

• •

••

••••••••••

••••

•••

•• •••

volume.L3 ••••••

• •••

••

••••

•• •••

••

•• •••••

•• ••• ••

•••••••••

• •••

••

••

••••

••••

• •••

• ••••• •••••

••••••••• •

• ••

••

• •••

••

•••

•• •

•••••• •

••• •

• ••

••• ••

• ••••

••••

•• •

••

•••

• •• ••

••

• •• ••••

••• •• ••

••••••••••••

••••

••• •••••

••••

•••••• •• ••

• • ••

•• •• ••

•••

••

•• •••

• •

•••

••••••••••

••••

••••••••

••••

••

••••

• •

••••

•• •••

•••

• • •• •• •

•• •••• •••• •

• ••••••••

••

••

•• ••

••••

••••

•••• ••• •• ••

••••

••• •• •

•••

••

••• ••

••

••

••••••

••••

••••

•••

• ••••

••• •••

••••••

••••

• •• ••

••

•• •••• •

••••• ••

•••••••••

••••

••

••

••• ••••••••

•••• •• ••• •

• • •••

•••• •• ••••

•••••

••

••

••• •••

••• ••

••• •

•••

••• ••

•••• •

•••••••

••

•••

• • •••

••

•••• •••

•• •• • ••••••

• ••••••••

••

••

••••••••

••••

•••• •• ••••

•• ••

•••• ••

•••

••

•• •••

••

••

•• •

•••••• •

••••

••••••••

••••

••

••••

••

••

•••

•• •••

•••

••• • •••

•• •••• •

••• ••••••

••••

••

••

••••

••••

••••

•• •••

• •••••

••••

•••• ••

••••

•• •

•••

• •

••

••••••••• •

• •• •• ••

•••••

••• •••

••••

••

••

•••

•••••

••

•• •••

••

•• ••• ••••••••• •

•••• •

••••••• •

••• •• ••

•• •••

• ••• ••

•• •••

••• ••• •

••

••••••

• •

••

•• •

••••••••

••••••

•••• •

•••• •••

••••

••

••

•••

•• •••

••

•• •••

••

•• ••• • ••••••••• •

••• •

••••••• •

••• •• ••

•• •••

• ••• ••

•• •••

••• ••• •

••

••••••

• •

••

•• •

••••••••

••••••

•••• •

•••• •••

••••

••

••

•••

•• •••

••

•• •••

••

•• ••• •••••••••• •

••• •

••

••••• •

••• •• ••

•• •••• ••• ••

••••

•••• ••

• ••

••

•••••

• •

••

•• •

••••••••

••••••

•••

• ••

••

••• ••

•••

•••• ••

• ••

•••

••

••

•• •• •

•••

•• ••

•• ••••• ••••

••• •• ••••

•• ••••••

• • •• •• •••

•• •• ••••

•• • •• •

••

••• ••••••

•••

•• • •••••

••

•• •••

••• ••••• •

••• • ••

•••••• •

•••• • •

• ••

•••

••

• •

•••••••••••

••• ••••• ••••••• •••••••

• ••••••• • ••••••

•••• ••••

••••••

••

• •• ••

••••

•• •

•• • •••• •

••

•• •••

• •••• ••• •

• ••• ••

• ••••••

•• ••••• •

•••••

••

•• •••

••

••

•••••••• • •• ••

•••• •••

•• ••• •••••

•• • ••••••

••

•• •••••

••• •••

••

••• ••

••••

•• •

•• • •••••

•••••••

• ••••••• •

• •• • ••

• •• ••••

•• •• ••

•••

•• •••

• •

•• •••

••

••••••• ••• •••••

•••• •• •

• •••

•• •••• •• • •••••

•••••• •••

•• ••••

••

••• ••

••••

•••

•• • •••• •

••

• ••••

• ••••••• •

••••

retd.L1 ••

••• •

•••

•••• ••

• ••

• ••

••

••

•• •• •

••

••

•••

• ••••••••• ••••••

••• ••• ••••• •

••••••••

••

• •••• ••

••• •• •

••••

••••

•••

•• •

••• ••••••

••••••

•• •••••••

•••• •

••• ••

•••

••• •••

•••

•••• •

••

••• •••

••

•••

••••• •• •••••

••••• ••••

••••••

•••• •••••

••

••• •• •••

• •••••

••••

•• ••

• ••

•••

•••••

• • ••

••••••

• ••••••• •

•••• •

•• ••• •

•••

••• •••••

• ••

••

• •

••••••

••

•••

••••••••••• ••••••

••• ••• ••••• ••• •••••

••

••• ••• •

•• ••••

••

•••• •

••••

•••

•••••

••••

••

•••••

•••••••• •

•• •• ••

••••

••••

•••• •••••

•••

••

• •

••• •••

••

•• •

••••••• •••• •

••••••

•• •••••••

••

••••••••

••••••••

•• ••• •

••• •• ••

••••

•••

••••• ••••

••••••

• ••••••••

•••• •

•••• •

•••

••• •••

• ••

•••

••

• •

•• • •••

••

•••

••••• •••••••

•••• •••• ••

•••••• •••••• •••

••

•••• •••

•• ••• •

••••• ••••••

•• •

•••• •

••••

•••••••

••• •• •• ••

•••• •

•• ••••

•••

•••

•••••

•••

••

• •

•••• •••

••

•••••••• ••• •••

•• •• ••••••

• •••• ••

••••• •••

••

•• •• •••

•• ••• •

••

••• ••

••••

•• •

•••• •

••••

••

•••••

•••• •••••

•• •• ••

• ••• •

•••

•••

•••••

•••

••

• •

•••• •••

••

•••••••• •••• •

••• •• ••

••••

• •••• ••

••••• •••

••

•• •• •••

•• ••• •

••

••• ••

••• •

•• •

•••• •

••••

••

•••••

•••• •••••

•• •• ••

• ••• •

•••

•••

•••••

•••

••

• •

•••• •••

••

•••••••• •••• •

••• •• ••

••••

• •••• ••

••••• ••••

••• •• ••

•• ••• •

••

••• ••

••••

•• •

• ••• •• ••

••

••• ••

•••• •••••

•• ••

-202

•••

•••••

• •••••• •

•• •

•• ••

•• •••• •

•• ••

•• ••••• •••• •

•• •• ••••

•••

•••••• •

•• •• •••••

••••

•••

••

•••

••••

• •• •••• ••

• •• • •• •••• ••• •• ••

••

•••••

• ••••• •-1

1 •••

•••• •

••••••• •

•• •

••• •••••

• ••

•••

•• ••••• •••• •

•• ••••••

•••

•••••• •

••••••

•••• •

•••

•••

• ••

••• •

• •• ••

•• ••• •

• • •• •• •• ••• •• ••••

••• •

• ••• ••• •• •••

• •••

•••••

•• ••

•••

••••

•• ••

•• •

•••

••••• •

•• ••••

•• ••••••••

•••••

•• ••••••• ••

•• •

•••

••

••

• ••

••••

• •• ••

•• ••• •

• • •• ••••••••••••

••

••••

• ••• •• • •• •

••

• •••

•• •••

• •••

•• •

••• ••

• •••• •

••••

•• ••• •••••• ••• •• •• •

•••

•••••

•• ••••••

••••••

••••

••

• ••

••••

• •• ••

•• ••• •• • •• •• • •••• • ••••

••

••••

• •• •••• •••

••

• •••

•••••• ••

•••

•• ••

••••

•••

•••

••••••••••

• ••••••••

•••

•••••

• •••• •••

•• •

•••••

••

•••

•••

•• ••••• •

•••• •••

••••••• ••• • ••••

•••

••• •

•• ••• •• retd.L2 •••

••

••••

• •• •••••

•••

••••

••• •

• ••

•••

•••• •••••••••••• •••

•• •

•••••

•••••••••

•••• •• ••

••

••

• ••

••••

•• ••••• ••

• •••••• • •••••••••

••

••••

• •• •••• •• •••

••••

••• •••••

•••

••• •

••••

• ••

•••

••••••••••

•••••• •••

•••

•••••

••••••••

•• ••

• •••

••

••

•••

•• •••• •••••••• ••••••••• ••• ••••

•••

••• •

• • ••• •• ••••

•••••

•••••

• •••

•••

••• •

••• ••••

•• •

••••••

••••••••••••••

•••

•••••••

••••••

• ••

••••

••

••

•••

••• •• ••••

•• • •• •

•••• ••• •••••••••

••

••••

•••••• •••

••

••••

••• •••• •

•••

••• ••

• • ••••

•••

•••• ••••••••••• ••••

•••

••••••••

••• •••• ••••

• ••

••

••

•••

••••• •• ••

••••••

• •• • ••••••••••••

••

••• ••

•• ••••• •• ••

••••

••• ••••••

•• •

••• •

•••••••

•••

••••• ••• ••

••• •• ••••

•••

•••• •• ••

••• •••• ••

• •• •

••

••

•••

•• ••• •••

•••••

••••• • ••• •••• ••••••

•• ••

•• ••• •• •• ••

••••

••• ••• ••

•• •

••• •

••••

•••

••••

••••• •••• •

••• •• ••••

•••

•••• •• ••

••• •••• ••

• •• •

••

••

•••

•• ••• •••

•• •••

••••• • ••• •••• ••••

•••

•• ••

•• ••• •• •• ••

••••

••• •••••

•• •

••• •

••••

•••

•••

••••• •••• •

••• •• •• ••

•••

•••• •• ••

••• •••• ••

• •• •

••

••

•••

•• ••

• ••••

••••••

••• • • •• •••• ••• ••

••

•• ••

•• ••• ••

••••

•••••••

• ••• •••• ••

• • ••• •

••

••

•• ••

•• ••

••

• •••• ••• ••

•••• •• ••••••• • •• •

• ••••• •

• •••••

• •••

•••••

• •• •••• ••

•• ••• ••

•• ••• •• ••• •

•• ••• •

••

•••• • ••

•••

•• •••••

• • •• •• •• ••• ••

••••

••••••••

•• ••••• ••• • ••• •

••

••• •• ••••• •• • •••

••• ••••

•••• ••••• ••••• •• •• •

••• ••

•• ••• ••

•• ••• •• ••••

••••

•••

••

• ••• ••• ••

• ••••• ••••• •• ••••

••• •• ••

••

••

••••

••••

••

•• •• •••• ••

••• ••

• ••••• •• • •••

••• •

•••

•••• ••

••• ••

• •••• •• •••• ••

•• ••• ••

•••••••••••

••••

• ••

••

• •• • ••• •

• ••••

• •• •• ••••• ••

• •••• ••

••

••••••

•• •••

••••• • ••• •

••

• •• ••• •••• •• • •••

••• •

•••• ••••

•• •

• ••• •••• •• •••• ••

•• ••• ••

• •••• • •••• •

••••

• ••

•••••• ••

•••

• ••••

••••• ••• •••••• •• •••

••

••

••• •

••••

••••••• ••••••

••• ••• ••••• •••••

••• •

•••

••••••••••

••• ••

••• •

•••• •

• ••••••• ••• • •••• ••

••••

•••

••

•• •• • •••

••• ••••• ••• • •• ••

•••••• •

••

••

••• •

••••

••

•••• ••••••••• ••• ••••• ••••••

•••••

• •••• • ••

••••

•• •••

•••• •••••

••• •••••••••••••• •

••••• •

••••••• retd.L3 ••

• ••

••• ••

••••••• •• •••

• ••• •••

••

••

••• •

•••••••••• ••••••

••• ••• ••••• ••• •••

••••

•••••• • ••• ••••

•• ••••

•••••••

•••• ••••• ••• •••• ••

••••

•••

••

•• •• •••••

•••••

••• •• •• • •••

••••••••

••

••

•• • •

••••

••

•••• •••••••

•• •••••••• ••••••

•••••••

•••• ••

• •••

•••• •

• ••• ••• • •

••••• ••• •••••••• ••

••••

• ••••

•••• ••••

•••••

•••••• •• •••

••••

•• ••

••

••

••••

••••••

••••••••• ••

•• •••••••• ••••••

••••

••••• ••••

• •••

•• ••

• •• •

•••••

•• •• • •••••••••• • •

•• ••

•••

••

•••• ••• ••

•••••

••••••••

•• ••

• ••• •••

•••

••

•• •

••••

•••• ••••• •••

•••••• •••• •• ••••

••••

•••

•• •• ••• •••

••• ••

• ••••••••

•••• • ••• •••• ••••••

••• •• •

••

••• •• ••

• ••

•••••

••••• •••

•• ••

• ••• •••

•••

••

•• •

••••

••••• •••• •••

•••••

• •••• •• •••••

••••

•••

•• •• ••• •••

••• ••

• ••••• •••

•••• • ••• •••• ••••••

••• •

• ••

••

•• •• ••• ••

••• ••

••••••••

•• ••

• ••• •••

•••

••

•• •

••••

••••• •••• •••

•••••

• •••• •• •••••

••••

•••

•• ••••

• •••

••• ••• •••

•••••

• ••• • • •• •••• ••• •• •

••• •

• ••

••

•• ••-202

••••• •

•••

• •

••• ••• •

••

••

••

• ••

•• •

• ••

••• • •

• •••• ••••

••• •• •••• •• •

•••••

• •••

•• •

••

•• •••

••• •• •

••

••• •••••• ••••

• • •• ••••

••

••• ••

••• ••• •

••

••• •

-1

1•••

••••••

• •

••• • •• •

••

••

••

•••

•••

•••

•••••

• •••• ••••••• •••••

• •• •••••

•• •

•••••

••

•• ••

•••• ••

••

••• ••

•••• •• •

•• • •

• •• ••

••

••• ••

• ••••••

••

• ••• •••

•••

• ••

• •

• ••••

• •

••

••

••

•••

•• •••

••• ••

•••• •• •••••• •••••

••••••••

•• •

•••••

•••

• •••

••• • •••

••

•• ••

•••• •• ••

• • •• ••••

••

•••••

• ••••• ••

• •• • •••

•• •• •

•• •

• •• ••••

•••

•••

•••

•• •

•••

•••••

• ••• •••••

••• •• •• •• ••• ••••

•• •

•••••

••••••

•••• •••

••

• • ••

•••• ••••

• • •• •• • •

••

•• •••

• ••••• •

••

•••• •• •

•••

• ••

• •

••••• ••

•••

•••

• ••

•••

•••

•• •••

•••••••••

••••••••

• ••• •••••

•••••••

••

••••

•••••••

••

••••

• •••• ••••••

•••••

••

•••••

•••••••

••

•••• • • •

•• •

•••

••

••• ••

• •

••

••

••

•••

•• •• •

••• ••

•••••••••

••••••••

••••

•••••

•••••••

•••

• •••

••• •• ••

••

• •••• ••••• •

••• ••••••

••

•••••

•• •••• •

••

•••• •••

•••••

•••

•••••••

••

••

• •

•••

•••

•••

•• •••

••• •• ••••••••• •••

•• ••••••••••••••

••

•• ••

• ••• ••

••••••

••• •• •••••••

•• • ••

••

•••••

• ••••••

••

••••

aretd.L1•• •

•••

•••••

••• ••

••

••

••

••

•••

••••••

•• •• •

••••• ••••

••••••••

••••

•••••

•••••••

••

••• •

••••• ••

•••• •

•• ••• • ••

••••

• ••• •

••

•••••

• ••••• •

••

•••• •• •

•• •••

•• •

•••••

• •

••

••

••

•••

•• •

•••

•• •••

••• •••••••••• ••••

•••••••••

••••• ••

••

•• •••

••••• •

••••• •

•••

•••• ••• ••

• ••••

••

•••••

••• ••••

••

•••• •• •

•••••

•• •

•• •••••

••

••

••

•••

•••

• ••

•• •••

•••• •• ••••• •• ••••

••••

••• ••

••••

• ••

••

•• •••

••••• •

••

••• •

•• ••••• •

••••• ••• •

••

•••••

•••• •• •

••

•••• •• •

•••••

•• •

•• •••••

••

••

••

•••

•••• •

•• •••

•••• ••• •••• •• ••••

••••

••• ••

••••

• ••

••

•• •••

••••• •

••

••• •

•• •• ••• •

••••• ••• •

••

•••••

•••• •• •

••

•••• •• •

•••

•••• •

•• •••••

••

••

••

•••

•••• ••

•• •••

•••• ••• •••• •• •• ••

••••

••• ••

••••

• ••

••

•• •••

••••• •

••

••• •

•• ••••• ••

•••• • •• •

••

•••• •

•••• •• •

••

••••

••••

• •••• •••••••

••

• •

• • •••••

• ••••

••••

• ••••• •••• ••• •• ••••

•• •••••

•• • •• •• •••

•••• •••

•••

•••

••

•••••

• ••••

••• •• •

•••• • •• •• ••

••

••

•••

• ••••• •

•••

•••••• •

••••••

••

• •

• ••••••

•• ••••

••••• ••••

• ••• •••• •••••

••• •••••

•• • ••••••

•• •

• ••••

••••• •

••

•• •• •• •

•••

•• • •• •

•••• • •• •• ••

••

•••• •

• ••• •••

••

•••• ••• •

• ••

•• ••

•••

•••• ••

••••

••

••••

•••• ••• •• •

••• •••••••• •••••

•• • ••••••

•••

• ••••

•••

•• •

••

•••• •• ••

•••• • •

• •••

•• ••••••••

•••••

• ••• •• •

••••

•• ••• •• •

•• •

•••• •

• •••••••••

•••

••••• ••• •

•••• •••• •• •• •

•••• ••••

•• • ••••••

•• ••• ••

•••

•• •

••

••• • •• ••

••••• •

• ••••• ••• • •••

••

•••••

• •• ••••

•••

••• ••• •••

•• •

••

•••

•• ••• •

••••

••

•• ••••••••••••••••••••

•••• ••••

• •••• ••••

•••

••••

••••••

••

• ••••• ••••

• ••••••••• ••• ••••

••

•••••

•• ••• ••

••

••

•••• ••••

••• •

••

••

•••• •••

• •••

••

•• ••••••••••• •

•••••••••

•• •••••

•••••••••

•••

•••••

•••

•••

••

••• ••••

•••

•• •••• •••••••••••

••

••

•••

••••••

•••

•• •••••

••••••

••

••

•••• • •

••• •

••

••• •••• ••

•••••••••• ••••• ••••••

••• ••••••

•• •

• •• ••

•• •

•• •

••

••••• ••

•••••• •••••

• • ••• •••••

••••

• •• ••••

••

••• ••• • •••

•••

••

•••

• •••• •

••• ••

••

•• ••

••••••••• •••••• •••

••• •••••

••• ••••••

•••

• ••••

•••

•••

••

• •••• •••••••• •

••• •••• ••••••

••

••

•••

• • ••• ••aretd.L2

••

••

••••• •••••••

••

••

••••••

••••

••

••• •••• ••

••••••••• ••••••••••••

•••••• •••

•••

••• ••

••••

••

••

•••• •

• ••

•••• ••• •• •

••••••••••

••

•••

•• •••••

••

••

• •••••

••••••••

• •

• •••• •

•• ••

••

•• ••

•••• ••• ••••• •• ••••••• •••• •

• ••••• •••

•••

• •• ••

••••

••

••

• ••• ••••

•••• ••

••• ••••••••••••

••

• ••

•• ••• ••

••••

• •••••

•••• •

••

•• •

• •••• •

•• ••

••

•• • •

•••• •••• ••

•• •• ••••••• •••• •

• ••••• •••

•••

• •• ••

••••

••

••

• ••• ••••

• ••• ••

••• •••••• ••••••

••

• ••

•• ••• ••

••••

• ••• • •••••••

••

• •

• •••• •

•• ••

••

•• ••

•••• •••• ••

•• •• •• ••••• •••• •

• ••••• ••••

••• •• •

••••

••

••

• ••• •••

•••

•• ••••

• •• •••• ••• •

••

••

• ••

•• ••• •• -1

1

••

••••••

•••••

• •••••••

• • •

••

•• ••

•• ••

•••

••• ••• •• •

• ••

• •• •••••• •

•• •

• •••

•• ••••••• •

•••••

•••• ••

•• ••

••

••• ••• •• •••

••

• ••• •

••

••• •-1

1

3

••

•••••• •

••••

• • •••• ••

• ••

•••••

••••

•••

••• ••• •••

•••• •• ••

•• •• ••••

••• •

••• •••••

•••

••• ••

•••• ••

•• ••

••

••• ••• •• •••• •

•••••

••

• ••• ••

• •••

• ••

••• •••••

•• ••

•••

••

••••

••••

•••

• ••• •• •••••

•••• •••• •• ••••

••• •

••• ••• •

•• •

••••••

••• ••

•• ••

••

•••••••••••

• •

•••• •

••

• •• • ••

• •••

• ••

••

• •• •• •

••••

• ••

••••

••

••••

•••

• ••••• •• •

• ••

• ••• ••••• •

•••

••• •

•••••• •

••••

•••• •

•••• ••

•• ••

••

•• •••• • ••••• •

•••• •

••

•••• ••

••••

• ••

••

••••• •

•• ••

•• •

•••••

••• •

•••

••••••••••••

• ••• •••

• •••• •

••• •

•••••• ••••

•• ••••• ••• •

• ••••

••• ••• • ••••

•••

•••••

••

•• •• • •

••••

••••

•••• •••• •• •

•••

••

•• ••

••• •

•••••

•••• ••••••

••• ••••

••••••

••••

•• ••• •••

• ••

•••• •

•• ••••

••• ••••

••••••••••

••

•••• •

••

•••• ••

•• ••

••••

•••••••

•• ••

•••

••

• •••

••••

•••

•••••••••

• ••

•• •••••••• •••

••••

••• ••••

•••

••••

•••• ••

•••••

••

•••••••••

• •

•••••

••

•••• ••

• •••

•••••

••••••

• •• •

• ••

•••••

••• •

•••••

•••••••• ••

••• ••••

••••••

••••

••• •• •••

•••

•• ••••

• • •••

•••••

•••• ••• ••••

•••

•••••

••

•• •• ••

•• •••••••

••• •• •• • ••

•••

••

• •••

•• • •

•••••

•••••••••

••••••••••••••

••••

•••• •• •

•••

••• ••

•• •• • •

•••••

••

• •••••••••

• •

•••• •

••

••••aretd.L3

••

• •••

••••

•••

••••••••

• ••

•••• ••

••• •

•••• •

•• •••••••

•••• ••

• •• •••••

••••

••• •••••

•••

•• •••

•• ••••

•••••

••

• •••• •••••••

•• •• •

••

•• •• ••

• •••

••••

•••

••• ••• ••

• ••

•••• ••

••• •

•••• •

••• ••••••

•••• ••

• •• •••••

••••

••• •••••

•••

•• •••

•• ••••

•••••

••

• •••• •••••••

•• •• •

••

•• •• ••

• •••

••••

•••

•••••• ••

• ••

••••

••

••• •

•••• •

••• ••••••

•••• ••

• •• •••••

••••

••• •••••

•••

•• •••

•• ••••

• •••

•••

• •••• ••• ••••

•• •• •

••

•• ••

••••• •

••••

••

•••••••• •

••

••••

• •••• ••

••

•••

•••••

•••• •••

••••• •••

•••••

• • •••

••••

••

•••

•••

• ••

•••

•••

•••••• ••

••• •

••

••

•• ••

••• ••• •••

••••

•••

• •••

•••••• ••

••

•• •••• •• •

••

••

•••••

• • ••••

••••••

••••

• • •••

••••• ••

••••

•• • ••

••• •

•••

••• •

••••

•••

• ••

•••••• ••

••• •

••

••

•• •••

•• •••• ••

• •••

••

•••

•••

••••

•••

••

•••• •• •••

••

•••

•• ••

•• •••

••

••••••

• •• •• ••••

•••• •••

•••••• • •••

••• •

•••

•••

•••

• ••

••

••••

•••

••• •••

•• ••

•••

••••••••••• ••

••••

••

•• •

••••• •••••

••

••• •

•••• ••

••

••

• •••• ••

•••

••••••••••

• • •••

••• •• •••

••••

•• • •••

••• •

•••

•••• ••

•••

••

••••

•••

••• •••

•• ••

•••

••••

•• •••• • ••

••••

•••••

••••••

••••

••

••• •

•• •••••

•••

••••

•• •••

••

•••••••••••• ••••

••••• ••

••••

••••••

••• •

••

••••••••

••

•••

••••

•••• ••

• ••••

••

••••

••••• ••••

• •• ••

••••

• ••

•• •••• •

••

•••• • •• ••

••

••

••

• ••••• •

••

••••••••

••• •••••

•••• •••

••••

•••••

••••

••

•• • •

••• •

••

•••

•••

••••••••••

••••

••••

•••••• •• •

••••

•••••

•••• ••

••••

••

•••••••••

••

••

•••• •

• • •••

••

• ••

•• •• •

••••••••

•••••• •

••••••• ••

•• ••

•••

•••

•• •

•••

•••••

•• •

•• •• ••

••••

••

• •

••••

••••••• ••

•••••

••

••••

••••••• •

••

••••

• •• •••

••

••

•••• •• •

••

••••••••••• •••••

•••• ••••

••••

••• ••

•• ••

••

•• • •••

•••

••

•••

•• •

•••••••

••••

•••

•• ••

••••• ••••

• ••••

••

••••

•• ••••••

••

••• •

• • ••••

••

•••• ••• ••

••

••••••

• •••• •••••

•••• ••••

••••

•••••

•• ••

••

••• •••

•••

••

•• ••••

•••• • ••

•••••

••

••••

••••• •• ••

••••

••

•••

•••

•• •••••

••

•••••• •••

••

••

••

• • ••• ••

••

• ••

•• •••

••••••••

•••• •••

•••••

•••••

••••

•••

•••

•••••

••••••

••••••••

••• ••

•••

••••••••• • •••

• •• ••

••••

vola.L1 •••••••

•••

••

••• •

•••• •

••

••

••

••••••••

••

• ••

•••••

•••••••

•••••••

••••

•••••

••••

••

•••

•••••

••

•••

•••

••• •••

••••

••

••

••••

•••••••••

••••

••

•••

•••••••

•• •

••

•••••••• •

••

••

••

••••••••

••

••••••

••••

•••••

•• •••••

••••

•••••

••••

••

••••

••••

••

•••

•••

••••••

•• ••

••

• •

••••

•••• •• •••

••••

••

•••

-2

0

2

••••• •

••••

••

••••••

••

••

••

•••

• •••• ••

••

•••

•••••

•••• ••• •

••••• •••

••••

• • •••

••••

••

•••

•••

• ••

•••

•••

•••••• ••

••• •

••

••

•• •••

•• ••• •••

••••

•••

• •-2

0

2••

•••••• ••

••

•• •••••

••

••

••••••

• • ••••

••••••

••••

•• ••

• •

••••• ••

••••

• • •••

••• •

•••

••• •

••••

•••

• ••

••

••

••••

••• •

••

••

•• •••

•• •••• ••

• •••

••

•••

••• ••••

•••

••

•••• •••

••

••••

•• ••

•• •••

••

••••••

• •• ••

•••• •

•••• •••

••••

• • •••

••• •

•••

•••

•••

• ••

••

••••

••

••

••••

••• •

••

••

••••••••••• ••

••••

••

•• •

••• •• •

••••

••

••• •

••••

••

••

•••

• •••• ••

•••

••••••••••

•• ••

• •

••• •• •••

••••

• • •••

••• •

•••

•••• ••

•••

••

••••

••

••

•• •••

•• ••

•••

•••••

• •••• • ••

••••

•••••

••••••

••••

••

••• •

••••

••

••

••

••••

•• •••

••

•••••••••••

• ••••

••••• ••

••••

•••••

••• •

••

••••••••

••

•••

••

••

•••

• ••

• ••••

••

•••• •••••

••••

• •• ••

••••

• •••• •

••• •

••

•••• • •

•••

•••

••

• ••••• •

••

••••••••

•••

•••••

•••• •••

••••

•••••

••••

••

•• • •

••• •

••

•••

••

••

•••

••••••

••••

•••••••••

• •• •

••••

•••••

•••• ••

••••

••

•••••••

••

•••

•••• •

• • •••

••

• ••

•• •• •

••••••••

•••••• •

••••

•• •••

•• ••

•••

•••

•• •

•••

•••••

••

••

•••••

••••

••

• •

•••••••••

•• ••

•••••

••

••••

• •••••• •

••

••••

• ••

••

••

••

••••

• •• •••

•••••••••••

•••••

•••• ••••

••••

•• •••

•• ••

••

•• • •••

•••

••

•••

••

•••••••

••••

••

••

•• •••••••

••••

• ••••

••

••••

•• ••••••

••

••• •

• ••

••

•••

•••• ••• ••

••

••••••

• •••••••••

•••• ••••

••••

•••••

•• ••

••

••• •••

•••

••

•• •••

••

•••• ••

•••••

••

•••••••••

•• ••

••••

••

•••

••••• •

••••

••

•••••••

••

•••

••

• • ••• ••

••

• ••

•• •••

••••••••

•••• •••

••••

•••••

••••

•••

•••

•••••

••••••

••

••

••••

••• ••

•••

•••••••••• •••

• •• ••

••••

••• •••

••••

••

•••••••

••

••

••

•••••••••

••

••••••

••• •••••

••

•••••••

••••

•••••

••••

••

•••

•••••

••

•••

••

••

•••••

••••

••

••

•••••••••••••

••••

••

•••

vola.L2 ••• •••

••• •

••

•••••••

••

••

••

••••

•••••

••

••••••

••••

••••••

•• •••••

••••

•••••

••••

••

••••

••••

••

•••

••

••

•••

•••

• •••

•• •

•••••••• •• •••

••••

••

•••

••••• •

••••

••

••••••

••

••

••

•••

• •••• •

••

••

•••

•••••

•••• ••• •

••

••• •••

••••

• • •••

••••

••

•••• ••

• ••

••

•••

•••

••• ••

•• •

••

••

••

••••••

••••••• •••

••

•• •

-0.4

•••••••• •••

••

•• •••••

••

••

••••••

• • ••••

•••

•••••

•••

• ••• •

••

••• •••

•••

• • •••

••• •

•••

••• •••

•••

••

• ••

••

•••

••

•• •

••

••

•••••

•••

••• ••• ••

••

••

•••

••• ••••

••••

••

•••• •••

••

••••

•• ••

•• ••

••

•••

•••• •

• ••

•••• •

••

•• ••••

••••

• • •••

••• •

•••

••• •••

• ••

••

•••

••

•••

••

•• •

••

••

••

••••••••

• •••• •• •••

•• •

-2 1

••• •• •

••••

••

••• •

••••

••

••

•••

• •••• •

••••

•••

•••••

•••

• ••• •

••

• •• •••

••••

• • •••

••• •

•••

•••• ••

•••

••

•••

••

••• ••

•• •

••

••

••

•••• •

•••• •••• •• •

• •

•••

••••••

•••••

••

••• •

••••

••

••

••

••••

•• ••

•••

•••

•••••

•••

• ••••

••

••• •••

••••

•••••

••• •

••

•••• ••

•••

••

••••

••

•••

• •

•••••

••

••

•• ••••• •

•••• •• ••

••

•••

-2 1

• •••• •

••• •

••

•••• • •

•••

•••

••

• ••••••

••

•••

•••••

•••

•••••

•••• •••

••••

•••••

••••

••

•• • •••

• ••

••

••••

••

•••

••

•••••••

••••••••••

•• ••• •••••

•••

•••• ••

••••

••

•••••••

••

•••

•••• •

• • ••

•••

• ••

•• •• •

••••••••

••

•••• ••

••••

•• •••

•• ••

•••

••••• •••

••

•••

•••

•••••

•••

••

• •

••

•••••

•••

• ••••••

•• •

•••

-2 1 3

••• •••

••• •

••

••••

• ••

••

••

••

••••

• •••

••

•••

•••••

•••

•••••

••

•• ••••

••••

•• •••

•• ••

••

•• • •••

•••

••

••••

••

•••••

•••

••

••

••

•••••

•• ••••• ••

••

••

•••

•••• ••

••••

••

••• •

• ••

••

•••

•••• ••• ••

••

•••

•••• •

••••••••

••

•• ••••

••••

•••••

•• ••

••

••• •••

•••

••

• •••

••

•••• •

•••

••

••

••••••••• •

• •••• •••••

•••

-1 1

••••• •

•••••

••

•••••••

••

•••

••

• • ••• •

••

••

• ••

•• •••

••••••••

••

•• ••••

••••

•••••

••••

•••

•••• ••••

••

•••

••

•••

••

•• ••

•••

••••••••• •

•••• •••

•••

•••

••• •••

••••

••

•••••••

••

••

••

•••••••••

••

•••

•••••

• ••

••• ••

•••••••

•••

•••••

••••

••

••• •••••

••

•••

••

•••••

•••

••

••

••••••••••••••• •••

••

•••

-2 0 2

••• •••

••••

••

••• •

•••

••

••

••

••••

•••••

••

• ••

•••••

•••

••• ••

•••••••

•••

•••••

••••

••

••• •••••

••

•••

••

•• •••

•••

••

••

••

•••••

•••••••• •••

••

•••

vola.L3

-2

0

-2 0

16

Page 19: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

OLS Fit

Results of ordinary least squares analysis of NYSE data

Term Coefficient Std. Error t-Statistic

Intercept -0.02 0.04 -0.64

volume.L1 0.09 0.05 1.80

volume.L2 0.06 0.05 1.19

volume.L3 0.04 0.05 0.81

retd.L1 0.00 0.04 0.11

retd.L2 -0.02 0.05 -0.46

retd.L3 -0.03 0.04 -0.65

aretd.L1 0.08 0.07 1.12

aretd.L2 -0.02 0.05 -0.45

aretd.L3 0.03 0.04 0.77

vola.L1 0.20 0.30 0.66

vola.L2 -0.50 0.40 -1.25

vola.L3 0.27 0.34 0.78

17

Page 20: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Variable subset selection

We retain only a subset of the coefficients and set to zero the coefficients

of the rest.

There are different strategies:

• All subsets regressionfinds for eachs ∈ 0, 1, 2, . . . p the subset of

sizes that gives smallest residual sum of squares. The question of

how to chooses involves the tradeoff between bias and variance: can

use cross-validation (see below)

• Rather than search through all possible subsets, we can seek a good

path through them.Forward stepwise selectionstarts with the

intercept and then sequentially adds into the model the variable that

most improves the fit. The improvement in fit is usually based on the

18

Page 21: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

F ratio

F =RSS(βold)−RSS(βnew)

RSS(βnew)/(N − s)

• Backward stepwise selectionstarts with the full OLS model, and

sequentially deletes variables.

• There are also hybridstepwise selectionstrategies which add in the

best variable and delete the least important variable, in a sequential

manner.

• Each procedure has one or moretuning parameters:

– subset size

– P-values for adding or dropping terms

19

Page 22: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Model Assessment

Objectives:

1. Choose a value of a tuning parameter for a technique

2. Estimate the prediction performance of a given model

For both of these purposes, the best approach is to run the procedure on

an independent test set, if one is available

If possible one should use different test data for (1) and (2) above: a

validation setfor (1) and atest setfor (2)

Often there is insufficient data to create a separate validation or test set. In

this instanceCross-Validationis useful.

20

Page 23: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

K-Fold Cross-Validation

Primary method for estimating a tuning parameterλ (such as subset size)

Divide the data intoK roughly equal parts (typicallyK=5 or 10)

Train Train Train

5

TrainTest

21 3 4

• for eachk = 1, 2, . . . K, fit the model with parameterλ to the otherK − 1

parts, givingβ−k(λ) and compute its error in predicting thekth part:

Ek(λ) =P

i∈kth part(yi − xiβ−k(λ))2.

This gives the cross-validation error

CV (λ) =1

K

KXk=1

Ek(λ)

• do this for many values ofλ and choose the value ofλ that makesCV (λ)

smallest.

21

Page 24: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

• In our variable subsets example,λ is the subset size

• β−k(λ) are the coefficients for the best subset of sizeλ, found from the

training set that leaves out thekth part of the data

• Ek(λ) is the estimated test error for this best subset.

• from theK cross-validation training sets, theK test error estimates are

averaged to give

CV (λ) = (1/K)

KXk=1

Ek(λ).

• Note that different subsets of sizeλ will (probably) be found from each of

theK cross-validation training sets. Doesn’t matter: focus is on subset size,

not the actual subset.

22

Page 25: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

subset size

CV

err

or

2 4 6 8 10 12

0.06

00.

065

0.07

00.

075

••

••

all subsets

CV curve for NYSE data

• The focus is onsubset size—not which variables are in the model.

• Variance increases slowly—typicallyσ2/N per variable.

23

Page 26: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Elements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 3

Subset Size k

Res

idua

l Sum

-of-

Squ

ares

020

4060

8010

0

0 1 2 3 4 5 6 7 8

•••••••

••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••

•••••••

•• • • • • • •

Figure 3.5: All possible subset models for the prostate

cancer example. At each subset size is shown the resid-

ual sum-of-squares for each model of that size.

24

Page 27: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

The Bootstrap approach

• Bootstrap works by samplingN times with replacement from training set to

form a “bootstrap” data set. Then model is estimated on bootstrap data set,

and predictions are made for original training set.

• This process is repeated many times and the results are averaged.

• Bootstrap most useful for estimating standard errors of predictions.

• Can also use modified versions of the bootstrap to estimate prediction error.

Sometimes produces better estimates than cross-validation (topic for current

research)

25

Page 28: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

NYSE example continued

Table shows the coefficients from a number of different selection and shrinkage

methods, applied to the NYSE data.

Term OLS VSS Ridge Lasso PCR PLS

Intercept -0.02 0.00 -0.01 -0.02 -0.02 -0.04

volume.L1 0.09 0.16 0.06 0.09 0.05 0.06

volume.L2 0.06 0.00 0.04 0.02 0.06 0.06

volume.L3 0.04 0.00 0.04 0.03 0.04 0.05

retd.L1 0.00 0.00 0.01 0.01 0.02 0.01

retd.L2 -0.02 0.00 -0.01 0.00 -0.01 -0.02

retd.L3 -0.03 0.00 -0.01 0.00 -0.02 0.00

aretd.L1 0.08 0.00 0.03 0.02 -0.02 0.00

aretd.L2 -0.02 -0.05 -0.03 -0.03 -0.01 -0.01

aretd.L3 0.03 0.00 0.01 0.00 0.02 0.01

vola.L1 0.20 0.00 0.00 0.00 -0.01 -0.01

vola.L2 -0.50 0.00 -0.01 0.00 -0.01 -0.01

vola.L3 0.27 0.00 -0.01 0.00 -0.01 -0.01

Test err 0.050 0.041 0.042 0.039 0.045 0.044

SE 0.007 0.005 0.005 0.005 0.006 0.006

CV was used on the 50 training observations (except for OLS). Test error for

constant: 0.061.

26

Page 29: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Estimated prediction error

curves for the various selection

and shrinkage methods. The

arrow indicates the estimated

minimizing value of the

complexity parameter. Training

sample size = 50.

subset size

CV

err

or

2 4 6 8 10 12

0.05

0.07 •

••

•• •

• ••

• •

all subsets

degrees of freedom

CV

err

or

2 4 6 8 10 12

0.05

0.07 •

••••••••••••

ridge regression

s

CV

err

or0.0 0.2 0.4 0.6 0.8 1.0

0.05

0.07

•• •

•• • • • • •

• •

lasso

# directions

CV

err

or

0 2 4 6 8 10 12

0.05

0.07

• •

• • • • • • ••

•• •

PC regression

# directions

CV

err

or

0 2 4 6 8 10 12

0.05

0.07 •

•• • • • • •

• • • •

partial least squares

27

Page 30: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Elements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 3

Subset Size

CV

Err

or

0 2 4 6 8

0.6

0.8

1.0

1.2

1.4

1.6

1.8

•• • • • • • •

All Subsets

Degrees of Freedom

CV

Err

or

0 2 4 6 8

0.6

0.8

1.0

1.2

1.4

1.6

1.8

•••••••••••••

••

Ridge Regression

Shrinkage Factor s

CV

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

0.6

0.8

1.0

1.2

1.4

1.6

1.8

••

• • • • • • • • • • •

Lasso

Number of Directions

CV

Err

or

0 2 4 6 8

0.6

0.8

1.0

1.2

1.4

1.6

1.8

• •• • • • • •

Principal Components Regression

Number of Directions

CV

Err

or

0 2 4 6 8

0.6

0.8

1.0

1.2

1.4

1.6

1.8

•• • • • • • •

Partial Least Squares

Figure 3.6: Estimated prediction error curves and

their standard errors for the various selection and

shrinkage methods, found by 10-fold cross-validation.

28

Page 31: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Shrinkage methods

Ridge regression

The ridge estimator is defined by

βridge = argmin(y −Xβ)T (y −Xβ) + λβT β

Equivalently,

βridge = argmin (y −Xβ)T (y −Xβ)

subject toX

β2j ≤ s.

The parameterλ > 0 penalizesβj proportional to its sizeβ2j . Solution is

βλ = (XT X + λI)−1XT y

whereI is the identity matrix. This is a biased estimator that for some value of

λ > 0 may have smaller mean squared error than the least squares estimator.

Noteλ = 0 gives the least squares estimator; ifλ →∞, thenβ → 0.

29

Page 32: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor HastieElements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 3

Coe

ffici

ents

0 2 4 6 8

-0.2

0.0

0.2

0.4

0.6

••••

••

••

••

••

••

••

••

••

•••

lcavol

••••••••••••••••••••••••

lweight

••••••••••••••••••••••••

age

•••••••••••••••••••••••••

lbph

••••••••••••••••••••••••

svi

•••

••

••

••

••

••••••••••••

lcp

••••••••••••••••••••••••

•gleason

•••••••••••••••••••••••

pgg45

df(λ)

Figure 3.7: Profiles of ridge coefficients for the

prostate cancer example, as tuning parameter λ is var-

ied. Coefficients are plotted versus df(λ), the effec-

tive degrees of freedom. A vertical line is drawn at

df = 4.16, the value chosen by cross-validation.

30

Page 33: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

The Lasso

The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the

outcomey.

The lasso is defined by

βlasso = argmin (y −Xβ)T (y −Xβ)

subject toX

|βj | ≤ t

• Notice that ridge penaltyP

β2j is replaced by

P|βj |.

• this makes the solutions nonlinear iny, and a quadratic programming

algorithm is used to compute them.

• because of the nature of the constraint, ift is chosen small enough then the

lasso will set some coefficients exactly to zero. Thus the lasso does a kind of

continuous model selection.

31

Page 34: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

• The parametert should be adaptively chosen to minimize an estimate of

expected, using say cross-validation

• Ridge vs Lasso:if inputs are orthogonal, ridgemultipliesleast squares

coefficients by a constant< 1, lassotranslatesthem towards zero by a

constant, truncating at zero.

Ridge

Lasso

Coefficient

OLS

Coefficient

Transformed

32

Page 35: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Lasso in Action

Profiles of coefficients for NYSE data as lasso shrinkage is varied.

Shrinkage Factor s

Coe

ffici

ents

0.0 0.2 0.4 0.6 0.8 1.0 1.2

-0.4

-0.2

0.0

0.2

23456

7

8

9

10

11

12

s = t/t0 ∈ [0, 1], wheret0 =P|βOLS |.

33

Page 36: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor HastieElements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 3

Shrinkage Factor s

Coe

ffici

ents

0.0 0.2 0.4 0.6 0.8 1.0

-0.2

0.0

0.2

0.4

0.6

••

•• • • • • • • • • • • • • • • • lcavol

• • • • ••

••

•• • • • • • • • • • • • • • • • lweight

• • • • • • • • • • • • • ••

• • • • • • • • • •age

• • • • • • • • • ••

••

•• • • • • • • • • • • lbph

• • • • • • ••

••

••

•• • • • • • • • • • • •svi

• • • • • • • • • • • • • • ••

••

••

••

••

• lcp

• • • • • • • • • • • • • • • • • • • • • • • • •gleason• • • • • • • • • •

••

•• • • • • • • • • •

••pgg45

Figure 3.9: Profiles of lasso coefficients, as tuning

parameter t is varied. Coefficients are plotted versus

s = t/∑p

1 |βj |. A vertical line is drawn at s = 0.5, the

value chosen by cross-validation. Compare Figure 3.7

on page 7; the lasso profiles hit zero, while those for

ridge do not.

34

Page 37: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Elements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 3

β^ β^2. .β

1

β2

β1β

Figure 3.12: Estimation picture for the lasso (left)

and ridge regression (right). Shown are contours of the

error and constraint functions. The solid blue areas are

the constraint regions |β1|+ |β2| ≤ t and β21 + β2

2 ≤ t2,

respectively, while the red ellipses are the contours of

the least squares error function.

35

Page 38: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

A family of shrinkage estimators

Consider the criterion

β = argmin β

NXi=1

(yi − xTi β)2

subject toX

|βj |q ≤ s

for q ≥ 0. The contours of constant value ofP

j |βj |q are shown for the case of

two inputs.

Elements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 3

q = 4 q = 2 q = 1 q = 0.5 q = 0.1

Figure 3.13: Contours of constant value of∑

j |βj |q

for given values of q.

Contours of constant value ofP

j |βj |q for given values ofq.

Thinking of |βj |q as the log-prior density forβj , these are also the equi-contours

of the prior.

36

Page 39: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Use of derived input directions

Principal components regression

We choose a set of linear combinations of thexjs, and then regress the outcome

on these linear combinations.

The particular combinations used are the sequence of principal components of the

inputs. These are uncorrelated and ordered by decreasing variance.

If S is the sample covariance matrix ofx1, . . . , xp, then the eigenvector equations

Sq` = d2jq`

define the principal components ofS.

37

Page 40: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor HastieElements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 3

-4 -2 0 2 4

-4-2

02

4

o

o

o

o

oo

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o o

o

o

o o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

oo

o

o

oo

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o oo o

o

o

o

ooo

o

o

o o

o

o

o

o

oo

o

oo

o

o

o o

o

o o

o

oo

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

oo

o

o

o

o

oo

o

o

o

o

o

o

Largest PrincipalComponent

Smallest PrincipalComponent

X1

X2

Figure 3.8: Principal components of some input data

points. The largest principal component is the direc-

tion that maximizes the variance of the projected data,

and the smallest principal component minimizes that

variance. Ridge regression projects y onto these com-

ponents, and then shrinks the coefficients of the low-

variance components more than the high-variance com-

ponents.

38

Page 41: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Digression: some notes onPrincipal Components and the SVD (PCA.pdf)

39

Page 42: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

PCA regression continued

• Write q(j) for the ordered principal components, ordered from largest to

smallest value ofd2j .

• Then principal components regression computes the derived input columns

zj = Xq(j) and then regressesy onz1, z2, . . . zJ for someJ ≤ p.

• Since thezjs are orthogonal, this regression is just a sum of univariate

regressions:

ypcr = y +

JXj=1

γjzj

whereγj is the univariate regression coefficient ofy onzj .

40

Page 43: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

• Principal components regression is very similar to ridge regression: both

operate on the principal components of the input matrix.

• Ridge regression shrinks the coefficients of the principal components, with

relatively more shrinkage applied to the smaller components than the larger;

principal components regression discards thep− J + 1 smallest eigenvalue

components.

Elements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 3

Index

Shr

inka

ge F

acto

r

2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

••

••

• ••

• • • • • • •

• •

ridgepcr

Figure 3.10: Ridge regression shrinks the regres-

sion coefficients of the principal components, using

shrinkage factors d2j/(d2

j + λ) as in (3.47). Princi-

pal component regression truncates them. Shown are

the shrinkage and truncation patterns corresponding to

Figure 3.6, as a function of the principal component

index.

41

Page 44: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Partial least squares

This technique also constructs a set of linear combinations of thexjs for

regression, but unlike principal components regression, it usesy (in addition to

X) for this construction.

• We assume thaty is centered and begin by computing the univariate

regression coefficientγj of y on eachxj

• From this we construct the derived inputz1 =P

γjxj , which is the first

partial least squares direction.

• The outcomey is regressed onz1, giving coefficientβ1, and then we

orthogonalizey,x1, . . .xp with respect toz1: r1 = y − β1z1, and

x∗` = x` − θ`z1

• We continue this process, untilJ directions have been obtained.

42

Page 45: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

• In this manner, partial least squares produces a sequence of derived inputs or

directionsz1, z2, . . . zJ .

• As with principal components regression, if we continue on to construct

J = p new directions we get back the ordinary least squares estimates; use of

J < p directions produces a reduced regression

• Notice that in the construction of eachzj , the inputs are weighted by the

strength of their univariate effect ony.

• It can also be shown that the sequencez1, z2, . . . zp represents the conjugate

gradient sequence for computing the ordinary least squares solutions.

43

Page 46: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Ridge vs PCR vs PLS vs Lasso

Recent study has shown that ridge and PCR outperform PLS in prediction, and

they are simpler to understand.

Lasso outperforms ridge when there are a moderate number of sizable effects,

rather than many small effects. It also produces more interpretable models.

These are still topics for ongoing research.

44

Page 47: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 1

Regularized Optimization, Boosting,

and Some Connections between

Them

Saharon Rosset (IBM Research)Collaborators: Ji Zhu (Michigan), Trevor Hastie (Stanford)

Page 48: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 2

Predictive modeling

Given n data samples (xi, yi)ni=1 , x

Ti ∈ R

p

Generated independetly from a data distribution:

y = f(x) + ε(x)

(f — fixed; ε — random)

We want to find a ”good” model f(x) to describe the deterministic part.

Definition of “good” is typically in terms of EXL(y, f(x)), where L depends on

problem.

Page 49: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 3

Corporate Data Bases

Many tables, relational database.

Page 50: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 4

Motivation

Modern data (Data Mining, Machine Learning etc.) is:

• High dimensional

– By nature: micro-arrays, scientific data, customer databases

– Computational tool: data often projected into high dimensional space:

kernel methods, wavelets, boosting’s weak hypotheses, etc.

• Noisy and dirty (e.g. customer databases)

• Contains many irrelevant predictors (e.g. customer databases, micro-arrays)

Fitting models without controlling complexity results in:

• Badly over-fitted models

• Useless for prediction or interpretation

Page 51: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 5

Illustrative example

100 data points, 80 dimensional space. True model:

yi = xi1 + εi

εiiid∼ N(0, 1)

We are fitting a linear regression model of the form:

f(x) = x · β

Page 52: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 6

Unregularized model projected to x1

Unregularized model: β = arg minβ ‖yi − xiβ‖2

−3 −2 −1 0 1 2

−3

−2

−1

01

23

x_1

y

Page 53: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 7

Appropriately regularized model

We impose an l1 constraint on the model:

β = arg min‖β‖1≤1

‖yi − xiβ‖2

−3 −2 −1 0 1 2

−3

−2

−1

01

23

x_1

y

non−regularizedl1 regularized

Page 54: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 8

Prediction problems

• Training data (x1, y1), . . . , (xn, yn)

• Input xi ∈ Rp

• Output yi

– Regression: yi ∈ R

– Two class classification: yi ∈ 1,−1

• Wish to find a prediction model for future data

f : x ∈ Rp → R

Regression: predict f(x)

Classification: predict sign of f(x)

• Generally take f(x) = xβ (linear model)

– Can be linear in a basis expansion (kernel/wavelets etc.)

Page 55: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 9

The regularized optimization problem

β(λ) = arg minβ

i

C(yi,xiβ) + λJ(β)

Where:

• C is a convex loss, describing the “goodness of fit” of our model to training

data

– Regression: C(y, f) = C(y − f) function of residual

– Classification: C(y, f) = C(yf) function of margin

• J(β) is a model complexity penalty.

Typically J(β) = ‖β‖qq i.e. penalize lq norm of model, q ≥ 1.

• λ ≥ 0 is a regularization parameter

– As λ→ 0, we approach non-regularized model

– As λ→∞, we get that β(λ)→ 0

Page 56: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 10

Examples

• Regularized linear regression:

Squared error loss: C(y, f) = (y − f)2

– Ridge regression uses l2 penalty J(β) = ‖β‖22

– The Lasso (Tibshirani 96) uses l1 penalty J(β) = ‖β‖1

• Support Vector Machines:

Hinge loss: C(y, f) = (1− yf)+

– Standard (2-norm) SVM uses l2 penalty ‖β‖22

– 1-norm SVM uses l1 penalty ‖β‖1

Page 57: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 11

Considerations in selecting loss

β(λ) = arg minβ

i

C(yi,xiβ) + λ‖β‖qq

“Classical” view: loss should correspond to data log-likelihood

• Squared error loss corresponds to Gaussian errors

• Logistic regression uses binomial likelihood

Pragmatic view: need to do well on data

• Robustness considerations: sensitivity to incorrect error model

• Computational considerations: can we solve the problem efficiently

Page 58: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 12

Some loss functions for regression and

classification

−3 −2 −1 0 1 2 30

1

2

3

4

5

6

residual

squared losshuber’s loss

−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 20

1

2

3

4

5

6

7

8

9

10

margin

exponentiallogistichinge

Page 59: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 13

Considerations in selecting penalty

β(λ) = arg minβ

i

C(yi,xiβ) + λ‖β‖qq

Two perspectives on penalty:

• Bayesian: prior over the model space

– reg. optimization solution is maximum posterior likelihood

• Limit model space to avoid over-fitting

Considerations in selecting penalty:

• Adequacy of penalty (implied prior)

– Sparsity considerations (l1 penalty encourages sparsity)

• Computational considerations

Page 60: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 14

l1, l2 and l∞ penalties in R2

−1.5 −1 −0.5 0 0.5 1 1.5

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1l1 penalty

l2 penalty

l∞ penalty

Page 61: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 15

Regularization parameter: balancing loss and

penalty

β(λ) = arg minβ

i

C(yi,xiβ) + λ‖β‖qq

Theoretical approaches to selecting λ:

• Bayesian: λ is “strength of prior”

• Frequentist: use loss + complexity penalty (Cp, AIC etc.)

Practical approach:

1. Solve for many (or all) values of λ.

2. Select based on cross-validation error

Page 62: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 16

Equivalent constrained formulation

β(S) = arg minβ

i

C(yi,xiβ) s.t. ‖β‖qq ≤ S

Both formulations are equivalent when loss and penalty are convex, with the

following property:

β(λ) : λ ∈ R ⊂ β(S) : S ∈ R

Under most conditions we will consider the two sets are actually equal.

We use both formulations exchangeably.

Page 63: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 17

Illustration: Lasso and Huberized lasso

• n = 100, p = 80.

• All xij are i.i.d N(0, 1) and the true model is:

yi = 10 · xi1 + εi

εiiid∼ 0.9 ·N(0, 1) + 0.1 ·N(0, 100)

• Sparsity implies l1 penalty is appropriate

• Compare l1-regularized paths using Huber’s loss and squared error loss

Page 64: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 18

Hub. lasso path Lasso path

0 20 40 60 80

−50

510

0 50 100 150 200 250

−50

510

‖β(λ)‖1‖β(λ)‖1

ββ

Squared error curves for the two solution paths

0 10 20 30 40

010

2030

4050

60

Squa

red

Erro

r LASSOHuberized

‖β(λ)‖1

Page 65: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 19

Boosting: warmup

• Introduced in the machine learning community by Freund and Schapire

(1996).

• Extremely successful in practice

• Main idea:

Iteratively build prediction model by fitting re-weighted versions of the data

– Weights emphasize badly fitted data points

– Each iteration builds a “weak” learner to model current weighted data

• Boosting can be interpreted as “coordinate descent” in high dimensional

predictor space (Mason et al 99, Friedman 2001)

Page 66: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 20

Schematic of boosting

Training sample

Weighted sample

Weighted sample

G1(x)

G2(x)

GM (x)

sgn (

Pi αiGi(x)) Final prediction model

Page 67: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 21

Boosting analysis: outline

• AdaBoost and its interpretations

– Boosting as gradient descent

– Margins view of boosting

• Relation of boosting to `1-constrained optimization

• Convergence of `p-constrained optimization of classification loss functions to

an “ `p-margin” maximizing separator

• Conclusions:

– Boosting approximately corresponds to `1-constrained optimization

– Classification boosting (AdaBoost and LogitBoost) “conver ge” to

`1-optimal separator, compared to `2-optimal for SVM

Page 68: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 22

Schematic of Talk Structure

BoostingConstrainedOptimization Margins

SVM

Page 69: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 23

Boosting basics

Given:

• Data xi, yini=1 with xi ∈ R

p and yi ∈ −1, +1

• Convex loss criterion L(y, f)

• DictionaryH of “weak classifiers” , i.e. ∀h ∈ H, h : Rp → −1, +1

– Example: all decision trees with up to k splits

Page 70: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 24

Boosting basics (ctd)

We want to find a “good” linear combination :

F (x) =∑

hj∈H

βjhj(x)

such that∑

i L(yi, F (xi)) is small.

In boosting this is done incrementally i.e. at step T our model is:

FT (x) =∑

t≤T

αtht(x)

Page 71: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 25

AdaBoost algorithm (Freund and Schapire 1995)

1. Initialize: wi ≡ 1

2. While (improvement on test set)

(a) Look for ht = arg minh∈H

i wiIyi 6= h(xi) (minimizes weighted

misclassification error)

(b)

errt =

i wiIyi 6= ht(xi)∑

i wi

(c) Set αt = log(1−errt)

errt

(d) wi ← wi · exp(αtIyi 6= ht(xi))

3. Output model F (x) =∑

t αtht(x) and classifier: sign(F (x))

Page 72: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 26

AdaBoost as Gradient Descent

It has been shown that AdaBoost is “coordinate descent” with exponential loss:

L(y, Ft(x)) = exp(−yFt(x))

The criterion for selecting the next ht is to minimize:

∂∑

i L(yi, Ft(xi))

∂βj= 〈−∇L(Ft(x)), hj(x)〉

ht is the best ”canonical” improvement direction, to first orde r

The AdaBoost αt is chosen via a line search

• We will consider αt ≡ ε — which is “stronger”, empirically better and

theoretically more tractable

Page 73: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 27

Practical importance of boosting approaches

• Computationally friendly when |H| is large:

– Does not require second derivatives and matrix inversion.

– Greedy search algorithms allow finding best direction “approximately”

– Mainly in situations where there is no explicit β at all, rather a dictionaryH

from which a “best” member is chosen every time using heuristics (e.g.

decision trees using greedy methods).

• Empirically shown to do very well

– AdaBoost (Freund and Schapire 95) and other boosting algorithms are

best “off the shelf” classifiers according to Breiman

Page 74: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 28

Other gradient-based boosting algorithms

This methodology can be applied to any function estimation problem

• Friedman, Hastie and Tibshirani (2000) use binomial log-likelihood loss:

L(y, Ft(x)) = log(1 + e−yFt(x))

• Friedman (2001) applies it to regression problems with various losses

• Rosset and Segal (NIPS 2002) apply it to density estimation with

log-likelihood criterion : L(Ft(x)) = −log(Ft(x)).

Page 75: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 29

Margin Basics

• Margin of separating hyper-plane∑

hj∈Hβjhj(x) = 0 is Euclidean

distance of closest point:

mini

yiβ′h(xi)

‖β‖2

• Non-regularized SVM solution maximizes minimal margin

• SVM literature: large margins⇒ “small” prediction error

−4 −3 −2 −1 0 1 2 3 4−3

−2

−1

0

1

2

3

Page 76: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 30

Margins in Boosting

• Boosting margin of model F (x) =∑

t αtht(x) is defined as:

mini

yiF (xi)∑

t |αt|∈ [−1, +1]

• Basis representation for finite |H|:∑

t αtht =∑

hj∈Hβjhj

• ‖β‖1 =∑

j |βj | ≤∑

t |αt| equality e.g. if αt ≥ 0 ∀t (monotonicity)

−4 −3 −2 −1 0 1 2 3 4−3

−2

−1

0

1

2

3

Page 77: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 31

The two margin definitions

Euclidean distance (SVM margin) between data point and “hyper-plane”∑

hj∈Fβjhj(x) = 0:

yiβ′h(xi)

‖β‖2

Normalized Boosting margin:

yiβ′h(xi)

‖α‖1=

yiβ′h(xi)

‖β‖2·‖β‖2‖β‖1

·‖β‖1‖α‖1

Differences:

• `1 vs `2 norm - encourages ”sparse” representations

• ‖β‖1 ≤ ‖α‖1 - sign consistency (“monotonicity”) assures equality

Page 78: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 32

Boosting as a margin-maximizing process

Boosting the Margin - (Schapire et al. 1998, Annals):

• Prove that “weak learnability” (=separability) increases margins

• Experimentally show boosting increases margins

• Discuss geometric interpretation

• Generalization error bounds for finite basis, infinite basis, as function of

margin distribution e.g.: with probability≥ 1− δ

PTe(yF ≤ 0) ≤ PTr(yF ≤ θ) + O(n−.5(log|H|).5θ−1log(δ)−.5)

Plenty of other papers about boosting and margins

Page 79: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 33

Advantages(?) of margins view

• Explains behavior of Adaboost in separable case:

– Seeks to maximize minimal margin, consequently finds a “good”

separating hyper-plane - similar to SVM

– Loss criterion view does not give such intuitions:

any separating hyper-plane, scaled up, drives exponential loss to 0.

• Generalization error bounds as function of minimal margin:

– Breiman (97) directly maximized margins, attained bad generalization

performance

– That’s not surprising, since margin maximization is clearly overfitting

Page 80: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 34

What we have learned so far

BoostingConstrainedOptimization Margins

SVM

Page 81: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 35

Next steps

BoostingConstrainedOptimization Margins

SVM

Page 82: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 36

Constrained (regularized) optimization

We want to find β(c) which achieves :

min‖β‖1≤c

i

L(yi, β′h(xi))

i.e. the optimal solution with `1 norm c.

What is the relation of this solution to the ε-boosting solution with `1 norm c (i.e.

after c/ε iterations)?

Page 83: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 37

Relation of boosting to regularized optimization

Consider the local “monotone” optimization problem:

minL(β)

s.t. ‖β‖1 − ‖β0‖1 ≤ ε

|β| ≥ |β0| component-wise

It’s easy to see:

limε→0

|(β − β0)k|

ε> 0⇒ k = arg max

j|∇L(β0)j | = arg max〈−∇L(Ft(x)), h(x)〉

k is unique “almost everywhere” in our space, so we are choosing the direction of

the best monotone path .

We may conjecture that if this ”monotonicity” holds on optimal path then

ε-boosting converges to optimal regularized path

Page 84: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 38

ε-Boosting and `1 constrained fitting

For squared error loss regression (from Efron et al. 2002):

Lasso: β(c) = arg min‖β‖1≤c ‖y −Xβ′‖22“Stagewise”: the ε-boosting coefficients

Lasso

0 1000 2000 3000

-500

050

0

123 4 5 67 89 10 1

2

3

4

5

6

78

9

10••• • • •• •• •

Stagwise

0 1000 2000 3000

-500

050

0

123 4 5 67 89 10 1

2

3

4

5

6

78

9

10••• • • •• •• •

PSfrag replacements t =P j^jj !t =P j^jj !P j^jj ! j

Page 85: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 39

What about other loss functions?

For classification with binomial log-likelihood loss:

`1 constrained solutions (left), ε-boosting path (right)

0 2 4 6 8 10 12−2

−1

0

1

2

3

4

5

6Exact constrained solution

||β||1

β va

lues

0 2 4 6 8 10 12−2

−1

0

1

2

3

4

5

6ε−Stagewise

||β||1

β va

lues

Page 86: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 40

Partial theoretical results

Denote:

β(c) = arg min‖β‖1≤c

i L(yi, β′h(xi))

β(ε)(c) is the ε-boosting coefficient vector for `1 norm c.

Theorem 1 if β(c) is strongly monotone in all coordinates ∀c < c0 , then

limε→0 β(ε)(c0) = β(c0)

• Much stronger condition on derivatives along the optimal path

We also have a “local” result:

Theorem 2 Under monotonicity only, if we denote by γ(ε) the ε-stagewise

“direction” starting from β(c0) then:

limε→0

γ(ε) =dβ(c)

dc|c=c0

• (Efron et al 02) proved for squared error loss, we generalized to any convex

loss

Page 87: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 41

`p constrained classification losses

Consider the constrained optimization problem:

β(p)(c) = arg min‖β‖p≤c

i

L(yi, β′h(xi))

With the loss being either the exponential or log-likelihood:

Le(y, β′h(x)) = exp(−yβ′h(x))

Ll(y, β′h(x)) = log(1 + exp(−yβ′h))

Page 88: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 42

Convergence to “ `p- optimal” separating hyper-plane

Define:

β(p) = limc→∞

β(p)(c)

c

Theorem 3 If the data is separable, then with either Le or Ll,

β(p) = arg max‖β‖p=1

mini

yiβ′h(xi)

Interpretation: the normalized constrained optimizer “converges” to an “`p-margin

maximizing” separating hyper-plane

Page 89: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 43

Boosting interpretation

We can conclude that ε-boosting tends to converge to the `1-margin maximizing

separating hyper-plane

100

101

102

103

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

Minimal margins

||β||1

min

imal

mar

gin

exponentiallogistic AdaBoost

100

101

102

103

0.045

0.05

0.055

0.06

0.065

0.07

0.075

0.08

0.085

0.09

0.095Test error

||β||1

test

err

or

exponentiallogistic AdaBoost

Page 90: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 44

Boosting and support vector machines

In the separable case:

• SVM non-regularized solution is β(2)

• Boosting non-regularized solution is β(1)

• Differences:

– Boosting margin vs. SVM margin (Euclidean distance)

– Different loss functions⇒ different regularized paths

• “`2 ε-boosting” follows a different regularized path to “SVM” solution

– Choose coefficient to change according to maxh−∇L(Ft(X))′h(X)

βt,h

In non-separable case even the non-regularized solutions would be different

Page 91: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 45

Simple data example

Same example as before with additional large mass (20 observations) at “far”

point

−2 −1 0 1 2 3 4 5 6−2

0

2

4

6

8

10

Experiment data

Page 92: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 46

Convergence of `1 and `2 boosting paths to optimal

separator

0 5 10 15

0

0.2

0.4

0.6

0.8

1

Normalized L1−boosting coefficients

||β||1

β/||β

|| 1

boost var1boost var2opt var1 opt var2

0 2 4 6 8 10

0

0.2

0.4

0.6

0.8

1

Normalized L2−boosting coefficients

||β||2

β/||β

|| 2boost var1boost var2svm var1 svm var2 opt var1 opt var2

Page 93: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 47

More interesting example: Boosting vs. `2 boosting

Boosting `2 boosting

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

optimal boost 105 iter boost 3*106 iter

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

optimal boost 5*106 iterboost 108 iter

Page 94: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 48

Summary

• Boosting related to `1-constrained fitting

– Can define `p boosting algorithms to correspond to `p constraints

• `p constrained classification loss solutions converge to “`p-margin”

maximizers in separable case

– Has implication on understanding of logistic regression

• A common thread for boosting and SVM:

Computational trick for regularized fitting in high dimensi onal predictor

spaces

– SVM: kernel trick (`2 regularization)

– Boosting: coordinate descent (approximate `1 regularization)

Page 95: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 1

`1 regularization: properties and

computations

Saharon Rosset (IBM Research)Collaborators: Ji Zhu (Michigan), Trevor Hastie, Rob Tibshirani (Stanford), Nathan

Srebro (TTI), Grzegorz Swirszcz (IBM Research)

Page 96: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 2

Results on `1 regularization• Sparsity

• Piecewise linearity

• Applicability in very high or infinite dimensional embedding

spaces

Page 97: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 3

The regularized solution pathFixing the loss, penalty and data, and varying the regularization

parameter we get the “path of solutions”

β(λ) , 0 ≤ λ <∞

This is a 1-dim curve through Rp.

• Interesting statistically, as the set of solutions to problems of

interest (Bayesian interpretation: changing prior variance)

• Often interesting computationally, as it has properties which

allow efficient “tracking” of this path

Page 98: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 4

Example: Lasso solution path in R10

Lasso

0 1000 2000 3000

-500

050

0123 4 5 67 89 10 1

2

3

4

5

6

78

9

10••• • • •• •• •

PSfrag replacements t =P j^jj !t =P j^jj !P j^jj ! j

(from Efron et al. (2004). Least Angle Regression. Annals of Statistics)

Page 99: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 5

Sparseness propert(ies) of `1

regularized path

Page 100: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 6

`1, `2 and `∞ penalties in R2

−1.5 −1 −0.5 0 0.5 1 1.5

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1l1 penalty

l2 penalty

l∞ penalty

Page 101: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 7

Sparseness of `1 penalty: n > pShape of `1 penalty implies sparseness. For large values of λ only few non-zero

coefficients.

Lasso

0 1000 2000 3000

-500

050

0

123 4 5 67 89 10 1

2

3

4

5

6

78

9

10••• • • •• •• •

PSfrag replacements t =P j^jj !t =P j^jj !P j^jj ! j

Page 102: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 8

Sparseness: p > nFor any convex loss, assuming only “non-redundancy”:

Theorem (e.g., Rosset et al. 2004)

Any `1 regularized solution has at most n non-zero components

Proof: Simple application of Caratheodory’s Convex Hull Theorem.

CorollaryThe limiting interpolating (or margin maximizing) solution also has atmost n non-zero components

Page 103: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 9

Some implications of sparseness• Variable selection (obviously)

• `1-regularized problems are “easier” than, say, `2-regularized

ones

– Can give good solutions in p >> n situations

See:

Friedman, Hastie, Rosset, Tibshirani, Zhu (2004). Discussion

of three boosting papers. Annals of Statistics

Ng (2004). Feature selection, `1 vs `2 regularization and

rotational invariance. ICML-04

Page 104: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 10

Piecewise Linear Solution Paths

Page 105: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 11

Piecewise linear property

We want to identify situations where the path of solutions β(λ) , 0 ≤ λ <∞

is easy to generate.

One such situation is when β(λ) is piecewise linear in λ.

+

+

+

+

+

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

+ +

+

+

+

+ + + +

+

+ + +

++

‖β‖1

β

Page 106: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 12

Primary example: the lasso

(Efron et al 03), (Osborne et al 00) show that for the lasso:

β(λ) = arg minβ

i

(yi − xiβ)2 + λ‖β‖1

β(λ) is piecewise linear in λ.

• Yields efficient algorithm for finding β(λ) , 0 ≤ λ <∞

– Cost is “approximately” one least-squares calculation

Page 107: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 13

Some properties of the Lasso regularized path

1. Sparsity: if p > n, any regularized solution β(λ) has at most n non-0

coefficients (property of `1 penalty)

2. High correlation:

β(λ)j 6= 0 ⇒∣

∂C(β)∂βj|β=β(λ)

∣=

∣xT

j (y −Xβ(λ))∣

∣= λ

3. Compactness: Number of “pieces” in the path is approximately min(n, p).

+

+

+

+

+

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

+ +

+

+

+

+ + + +

+

+ + +

++

‖β‖1

β

Page 108: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 14

Our key questions:• What is the fundamental property of (loss, penalty) pairs which

yields piecewise linearity?

• Are there efficient algorithms to generate these regularized

paths?

• Are there statistically interesting members in these families?

Page 109: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 15

What makes paths piecewise linear?

Assume loss and penalty are both twice differentiable everywhere.

With some algebra we get:

∂β(λ)

∂λ= −(∇2C(β(λ)) + λ∇2J(β(λ)))−1∇J(β(λ))

We want this derivative to be constant, thus:

A sufficient condition for piecewise linearity is that:

• The loss C is piecewise quadratic

• The penalty J is piecewise linear

Page 110: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 16

Building blocks for PWL regularized optimization

problems

Piecewise quadratic loss:

• Squared error loss: regression: (y − f)2, classification: (1− yf)2

• Huberized squared error loss (robust):

C(y,xβ) =

(y − xβ)2 if |y − xβ| ≤ m

m2 + 2m(|y − xβ| −m) otherwise

• Piecewise linear loss: regression: |y − f | , classification: (1− yf)+

Piecewise linear penalty:

• `1 penalty: J(β) =∑

j |βj | (gives sparse solutions)

• `∞ penalty: J(β) = maxj |βj |

Page 111: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 17

Some Interesting Examples

Page 112: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 18

Regression: the Huberized lasso vs. the lasso

0 20 40 60 80

−5

05

10

0 50 100 150 200 250

−5

05

10

‖β(λ)‖1‖β(λ)‖1

ββ

Page 113: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 19

Squared error loss with `∞ penalty

0 100 200 300 400 500 600 700 800−800

−600

−400

−200

0

200

400

600

800

||β||∞

β

Page 114: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 20

Classification: 1-norm and 2-norm Support Vector

Machines

0.0 0.4 0.8 1.2

0.0

0.2

0.4

0.6

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

ββ

‖β‖1 ‖β‖22

1-norm SVM 2-norm SVM

Page 115: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 21

Multiple penalty problem: Protein Mass

Spectroscopy

(Tibshirani et al, in preparation)

• Predictors are “experssion levels” along a spectrum of masses for proteins.

• Want to constrain model while keeping coefficients “smooth”.

• Solution: `1 penalty on coefficients, `1 penalty on successive differences:

β(λ1, λ2) = arg minβ

i

(yi − xiβ)2 + λ1‖β‖1 + λ2

j

|βj − βj−1|

• Solution path is piecewise affine in (λ1, λ2)

Page 116: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 22

Almost quadratic loss with `1

penalty

Page 117: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 23

Almost quadratic loss

We define almost quadratic loss as:

C(r) = a2(r) · r2 + a1(r) · r + a0(r)

Where:

• a2, a1, a0 : R → R are piecewise constant functions

• C(r) is (once) differentiable everywhere

• r = (y − xβ) the residual for regression

• r = yxβ the margin for classification

Page 118: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 24

Motivation for this family

• Piecewise linear solution paths

• `1 penalty⇒ sparse solutions

• Allows efficient, relatively simple algorithm

• Includes robust loss functions for regression and classification

Page 119: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 25

Algorithm

• Initialize: β = 0, A = arg maxj |(∇L(β))j |, γ = −sgn(∇L(β))A

• While (max|∇L(β)| > 0)

– d1 = arg mind>0 minj /∈A |∇L(β + dγ)j | = |∇L(β + dγ)A|

– d2 = arg mind>0 minj∈A(β + dγ)j = 0 (hit 0)

– d3 = arg mind>0 mini r(yi,xiβ + dγ) hits a “knot”

– set d = min(d1, d2, d3)

– If d = d1 then add variable attaining equality at d toA.

– If d = d2 then remove variable attaining 0 at d fromA.

– β ← β + dγ

– B =∑

i a(f(yi,xiβ))xA′ixAi

– γ = B−1(−sgn(∇L(β))A)

Page 120: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 26

Loss functions of interest: robust, differentiable

Linear for outliers, squared around “corners”:

• Regression: Huberized squared error loss

• Classification: Huberized squared hinge loss:

−3 −2 −1 0 1 2 30

1

2

3

4

5

6

7hinge loss (svm)Hub. sq. hinge (almost quad.)

Page 121: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 27

Computational complexity

Calculations in each step of our algorithm:

• Step size: find the length of current “piece”

– O(np) calculations (for each observation, figure when it hits a “knot”)

• Direction calculation: calculate the direction of the next “piece”

– O(min(n, p)2), using Sherman-Morrison-Woodbury updating formula

Number of steps of the algorithm:

• Difficult to bound in “worst case”

• Under mild assumptions it’s O(n).

Page 122: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 28

Computational complexity (ctd.)

Overall complexity is thus O(n2p) for both n > p and n < p

Compare to least squares calculation:

• O(np2) when n > p.

• O(n3) when n < p.

Page 123: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 29

Example: “Dexter” dataset (NIPS 03

challenge)• n = 800 observations

• p = 1152 variables

• Use Huberized squared hinge loss

• Path has 452 “pieces”

• Inefficient R implementation takes about 3 minutes to generate

path on laptop.

Page 124: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 30

Validation error and number of non-0 coefficients

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2V

al. e

rror

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

20

40

60

80

100

120

140

160

180

200

‖β‖1

Page 125: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 31

Summary• Regularization is fundamental in modern data modeling

• Considerations in selecting specific formulation:

– Statistical: robustness (loss), sparsity (penalty)

– Computational: efficient computation

• Piecewise linear solution path offer solutions that are:

– Robust: select appropriate loss function

– Adaptable: select regularization parameter adaptively

– Efficient: generate whole regularized path efficiently

Page 126: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 32

`1 regularization in infinite

dimensional feature spaces

Page 127: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 33

Outline• Regularized embeddings: kernels, boosting and all that

• Generalizing `1 regularization to non-countable dimension as

measure constraint

• Properties of `1 regularized solutions in infinite dimensions:

– Existence

– Sparsity: existence of finite-support optimal solution

– Optimality criteria

• Practical, exact `1 regularization in very high dimension via path

following

• Example: additive quadratic splines

Page 128: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 34

Regularized fittingGeneric supervised learning problem, given:

• x1, ...,xn ∈ Rp (or simply Xn∗p)

• y ∈ Rn for regression, y ∈ ±1n for classification

Find model y ≈ f(x)

Linear models set f(x) = xTβ and often use regularized fitting:

β = arg minβ∈Rp

L(y, Xβ) + λJ(β) (or, min L s.t. J ≤ C)

Where L (loss) and J (penalty) are typically convex

J(β) = ‖β‖q is typical choice, usually q ∈ 1, 2

E.g.: Ridge regression, LASSO, Linear SVM,...

Page 129: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 35

Data embeddingWe can increase the representation power of linear model by

embedding the data into high dimensional space, fitting linear

models there:

x→ φ(x) ∈ RΩ (typically |Ω| >> p)

f(x) = φ(x)Tβ

where Ω is index set of the features in the high dimensional space

Simple example: p = 2(+intercept/bias), Ω is set of degree-2

polynomials

x = (1, x1, x2)

φ(x) = (1, x1, x2, x21, x

22, x1x2)

Page 130: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 36

Examples of embedding-based

methods• Kernel methods: φ often not explicitly defined but implicitly

through inner product kernel: K(x,y) =< φ(x), φ(y) >.

Ω usually infinite.

• Wavelets: φ(x) is wavelet basis values at x.

• Boosted trees: φ(x) is set of all trees of certain size, evaluated

at x. Ω can be made finite.

• Spline dictionary: with x ∈ [0, 1], Ω = [0, 1] and

φ(x) = (x− a)k+ : a ∈ Ω. Infinite (non-countable)

dictionary.

Page 131: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 37

Embedding+regularization: kernel

methods, boostingSome of the most successful “modern” methods seem to rely on

right combination of embedding and regularization:

• Kernel methods: implicit embedding into RKHS + exact `2

regularization + representer theorem

⇒ computational and statistical success

• Boosting: embedding into space of trees + (very) approximate `1

regularization + incremental implementation

⇒ computational and statistical success

What about exact `1 regularization in embeddings?

Page 132: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 38

`1 or `2 regularization?Good question! Detailed discussion is outside our scope...

Easy answer (as always): be Bayesian

One important aspect is the sparsity property of `1 regularization:

Sparsity property

If |Ω| > n finite, then any `1 regularized problem has a

solution β containing at most n non-zero entries.

Does this still hold when Ω is infinite?

Page 133: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 39

Generalizing `1 regularizationWe start from:

minβ

i

L(yi, φ(xi)Tβ) s.t. ‖β‖1 ≤ C

By doubling the number of variables: βj = βj,+ − βj,− and adding

positivity constraints we can replace the norm by sum:

minβ

i

L (yi, φ(xi)T(β+ − β−)) s.t.

j

βj,++βj,− ≤ C , β+, β− 0

Now we replace the sum by a positive measure:

minP∈P

i

L(yi,

Ω

φω(xi)dP (ω)) s.t. P (Ω) ≤ C

Page 134: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 40

Understanding our generalizationProbability measure requires probability space, hence a σ-algebra

Σ over Ω.

We require w : ω ∈ Ω ⊂ Σ

• If Ω finite or countable this implies Σ = 2Ω and hence

P (Ω) = ‖β‖1 as required

• In the non-countable case this still works!

Page 135: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 41

When does an optimal solution

exist?Theorem

If the set φω(X) : ω ∈ Ω ⊂ Rn is compact, then our problem

has an optimal solution

Corollary

If the set Ω is compact and the mapping φ.(X) : Ω→ Rn is

continuous, then our problem has an optimal solution.

Bottom line: under mild conditions, an optimal solution is

guaranteed to exist.

Page 136: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 42

The sparsity property in infinite

dimensionTheorem:

Assume an optimal solution exists, then there exists an optimal

solution P (C) supported on at most n + 1 features in Ω.

Main idea of proof:

- Consider A = φω(X) : ω ∈ Ω ⊂ Rn

- Show that any z ∈ co(A) (convex hull) can be represented as

convex combination of n + 1 points

(for finite Ω this is just Caratheodory’s convex hull theorem)

⇒ any infinite-support measure can be approximated by one

supported on n + 1 features

Page 137: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 43

Optimality criterionSuppose we are presented with a finite-support solution P (C).

How can we verify it is optimal?

Answer: we only need to verify it is optimal in any finite feature set

containing its support

Theorem

If an optimal solution to the regularized problem exists, and we are

presented with a finite-support candidate solution P supported on

A = ω1, ..., ωk with k ≤ n + 1 then:

P is optimal solution⇔ ∀B ⊂ Ω s.t. A ⊆ B, |B| <∞, P is

optimal solution for the problem in PB

Page 138: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 44

Summary of mathematical/statistical

properties we prove• Under boundedness + continuity condition an optimal solution

exists

• There is always a sparse optimal solution with at most n + 1

features

• Given a finitely supported solution, we can test its optimality by

considering only finite problems on supersets of its support

Now, can we actually find the solution?

Page 139: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 45

Path following algorithmsSome regularized problems can take advantage of looking at the

solution set: β(λ) : λ ∈ R as a path in R|Ω| and following it

efficiently:

• Lasso (quadratic loss + `1 penalty): LARS-Lasso of Efron et al.

(2004) (also earlier work from Osborne et al.)

• SVM by Hastie et al. (2004), LP-SVM by Zhu et al. (2004)

• etc.

Page 140: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 46

Lasso and LARSLasso:

β(λ) = arg minβ‖y −Xβ‖22 + λ‖β‖1,

with X ∈ Rn×p, y ∈ R

n, β ∈ Rp.

LARS-Lasso (Efron et al 2004) is a homotopy algorithm to generate

the path β(λ) for all λ efficiently. Algebraically, we can derive

LARS-Lasso from KKT conditions:

β(λ)j 6= 0 ⇒ |XTj (y −Xβ(λ))| = λ (1)

β(λ)j = 0 ⇐ |XTj (y −Xβ(λ))| < λ (2)

Page 141: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 47

Schematic of LARS-Lasso1. Preliminaries

2. Loop:

(a) Find next variable to add to active setA:

dadd, step size such that a variable not inA attains equality

in (?? )

(b) Find next variable to remove from active set:

drem, step size such that coefficient from active set hits 0

(c) Make step min(dadd, drem), modify active set accordingly

(d) Calculate new LARS direction:

γ = −(XTAXA)−1sgn(XT

A(y −Xβ(λ)))

Page 142: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 48

Can we do LARS-Lasso in infinite

dimensional embeddings?Going back to schematic of LARS-Lasso:

Only finding dadd requires considering high dimension

Therefore, if:

1. We have sparsity X

2. We can search over Ω for next feature efficiently

⇒ we can apply LARS-Lasso and find full path (optimality

guaranteed by our criterion)

Page 143: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 49

Search problem for LASSOFormally

dadd = mind > 0 : ∃ ω /∈ A

−φω(X)T(y − φA(X)β(λ)0 − dφA(X)γA) = λ0 − d

We can re-write it as dadd = minω∈Ω−A d(ω), where d(ω) is the

value attaining equality for the dictionary function indexed by ω.

Specifically we get:

d(ω) =φω(X)T

r + λ(β)

φω(X)TφA(X)γA + 1

Page 144: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 50

Spline basesAssume our data points xi are in [0, 1].

A polynomial spline of order k is a piecewise polynomial of degree

k − 1 with k − 2 continuous derivatives.

E.g. second order spline is piecewise linear continuous function.

Dictionary for kth order spline:

Φk =

1, x, ..., xk−2, xk−2, (x− a)k−1+ : a ∈ (0, 1]

Page 145: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 51

Total-variation penalties and

regularized splinesStart from the general nonparametric problem with x ∈ R:

f(x) = minf∈C(k−1)

i

(yi − f(xi))2 + λTV (f (k−1))

Most general result:

Theorem (e.g. Mammen & van de Geer 97)

Optimal solution f can be represented as a k-th order spline with at

most n knots

Since roughly TV (f (k−1)) = (k − 1)! · P (Ω), our results prove

this theorem in one line!

Page 146: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 52

What do we know about

TV-penalized spline solutions?• For k < 3 can show (Mammen and VDG 97) that this spline has

knots at the data points — an `1 “representer” theorem!

• They propose efficient algorithms for solving with k ∈ 1, 2—

can be rephrased as versions of LARS-Lasso with n variables

(constant/linear spline basis)

• For k ≥ 3 they only offer LARS-like approximate algorithm with

knots at data points

But if we can solve the next feature search problem, we can apply

our algorithm and get exact solution path

Page 147: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 53

Feature search problem for the

k = 3 case (piecewise quadratic)We want to minimize over Ω:

d(ω) =φω(X)T

r + λ(β)

φω(X)TφA(X)γA + 1

This is a piecewise rational function of ω with quadratics in

numerator and denominator

⇒ can solve analytically

Page 148: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 54

2-dimensional additive spline example ( k = 3)

x1

x2

y

Surface

x1

x2

y

15 steps

x1

x2

y

40 steps

x1x2

y

65 steps

Page 149: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 55

Boston and California housing ( k = 3)

0 50 100 150 200

810

1214

1618

Iterations

Pre

dict

ion

MS

E

linearquadraticspline

0 50 100 150 200 2500.

090

0.09

50.

100

0.10

50.

110

Iterations

Pre

dict

ion

MS

E

linearquadraticspline

Page 150: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 56

Summary• `1 regularization generalizes elegantly to infinite dimensional

embeddings through generalization of norm to measure

• Statistical/mathematical properties:

– Existence

– Sparsity

– Testability

• We can design and implement a path following algorithm

– Practical applicability hinges on feature search problem

• We can practically implement in spline bases

– Optimally solves a total-variation penalized non-parametric

regression problem

Page 151: ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 — Linear Methods for Regression Trevor Hastie • If the linear model is correct for

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 57

Critical open issues• What can we say about learning performance? Which

embeddings are good?

• Characterize in general feature spaces where we can solve the

feature search problem


Recommended