ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 —...

transcript

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Linear Methods for Regression

Outline

• The simple linear regression model

• Multiple linear regression

• Model selection and shrinkage—the state of the art

Preliminaries

Data(x1, y1), . . . (xN , yN ).

xi is the predictor (regressor, covariate, feature, independent variable)

yi is the response (dependent variable, outcome)

We denote theregression functionby

η(x) = E (Y |x)

This is the conditional expectation ofY givenx.

The linear regression model assumes a specific linear form forη

η(x) = α + βx

which is usually thought of as an approximation to the truth.

Fitting by least squares

Minimize:

β0, β = argminβ0,β

N∑i=1

(yi − β0 − βxi)2

Solutions are

∑Nj=1(xi − x)yi∑Nj=1(xi − x)2

β0 = y − βx

yi = β0 + βxi are called the fitted or predicted values

ri = yi − β0 − βxi are called the residuals

Elements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 3

•• •

••

• ••

• •

••

•• ••

• ••

•• •

••

• ••

• •

••

• •••

Figure 3.1: Linear least squares fitting with X ∈ IR2.

We seek the linear function of X that minimizes the

sum of squared residuals from Y .

Figure 3.1 - view of linear regression inIRp+1.

Standard errors & confidence intervals

We often assume further that

yi = β0 + βxi + εi

whereE (εi) = 0 andVar (εi) = σ2. Then

se (β) =[

σ2∑(xi − x)2

Estimateσ2 by σ2 =∑

(yi − yi)2/(N − 2).

Under additional assumption of normality for theεis, a95% confidence

interval forβ is: β ± 1.96se(β)

se (β) =[

σ2∑(xi − x)2

Fitted Line and Standard Errors

η(x) = β0 + βx

= y + β(x− x)

se[η(x)] =[var(y) + var(β)(x− x)2

σ2(x− x)2∑(xi − x)2

••

• ••

•••

••

• ••

••

-1.0 -0.5 0.0 0.5 1.0

Fitted regression line with pointwise standard errors:η(x)± 2se[η(x)].

Multiple linear regression

Model is

f(xi) = β0 +p∑

xijβj

Equivalently in matrix notation:

f = Xβ

f is N -vector of predicted values

X is N × p matrix of regressors, with ones in the first column

β is ap-vector of parameters

Estimation by least squares

β = argmin∑

(yi − β0 −p−1∑j=1

xijβj)2

= argmin(y −Xβ)T (y −Xβ)

Figure 3.2shows theN -dimensional geometry

Solution is

β = (XT X)−1XT y

y = Xβ

Also Var (β) = (XT X)−1σ2

Here are someadditional notes (linear.pdf)on multiple linear regression,

with an emphasis on computations.

The Bias-variance tradeoff

A good measure of the quality of an estimatorf(x) is the mean squared

error. Letf0(x) be the true value off(x) at the pointx. Then

Mse [f(x)] = E [f(x)− f0(x)]2

This can be written as

Mse [f(x)] = Var [f(x)] + [E f(x)− f0(x)]2

This isvarianceplus squaredbias.

Typically, when bias is low, variance is high and vice-versa. Choosing

estimators often involves a tradeoff between bias and variance.

• If the linear model is correct for a given problem, then the least

squares predictionf is unbiased, and has the lowest variance among

all unbiased estimators that are linear functions ofy

• But there can be (and often exist) biased estimators with smaller

• Generally, byregularizing(shrinking, dampening, controlling) the

estimator in some way, its variance will be reduced; if the

corresponding increase in bias is small, this will be worthwhile.

• Examples of regularization: subset selection (forward, backward, all

subsets); ridge regression, the lasso.

• In reality models are almost never correct, so there is an additional

model biasbetween the closest member of the linear model class and

the truth.

Model Selection

Often we prefer a restricted estimate because of its reduced estimation

variance. Elements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 7

RealizationClosest fit in population

Estimation Bias

Variance

Estimation

Closest fit

Model bias

RESTRICTED

Shrunken fit

MODEL SPACE

Figure 7.2: Schematic of the behavior of bias and variance.

The model space is the set of all possible predictions from the

model, with the “closest fit” labeled with a black dot. The

model bias from the truth is shown, along with the variance,

indicated by the large yellow circle centered at the black dot

labelled “closest fit in population”. A shrunken or regular-

ized fit is also shown, having additional estimation bias, but

smaller prediction error due to its decreased variance.

Analysis of time series data

Two approaches:frequency domain(fourier)—see discussion of wavelet

smoothing.

Time domain. Main tool is auto-regressive (AR) model of orderk:

yt = β1yt−1 + β2yt−2 · · ·+ βkyt−k + εt

Fit by linear least squares regression on lagged data

yt = β1yt−1 + β2yt−2 · · ·βkyt−k

yt−1 = β1yt−2 + β2yt−3 · · ·βkyt−k−1

... =...

yk+1 = β1yk + β2yk−1 · · ·βky1

Example: NYSE data

Time series of 6200 daily measurements, 1962-1987

volume — log(trading volume) —outcome

volume.Lj — log(trading volume)day−j , j = 1, 2, 3

ret.Lj — ∆ log(Dow Jones)day−j , j = 1, 2, 3

aret.Lj — |∆log(Dow Jones)|day−j , j = 1, 2, 3

vola.Lj — volatilityday−j , j = 1, 2, 3

Source—Weigend and LeBaron (1994)

We randomly selected a training set of size 50 and a test set of size 500, from the

first 600 observations.

Trevor Hastie Stats315a January 15, 2003 Chap3: 16

NYSE data

volume ••

•••••• •

• ••••• •

••

• •••

•••

• ••••

•• •

••••

•••••

••••

••• •

•••••

•••

••

••••• ••

••

••• •• •

• •••• ••

••• ••••

• ••

• ••• •• •••

• •••

••

•••• ••

•• •

••

•••

••

• ••••••

• •• ••••

••

• •••

••••• ••

•••

••

••••

•• • •

••• ••

••••

•••

••

••••• •••

••

•••• •••• •

••• ••

••

• ••

• ••••

• ••• •• •••

•••

• •••

•••

••

•• •• •

••

• ••••

••• •• •

••••

•••

•• ••

••

•••

•••••••••

•• ••

••• ••

•••

••

• ••

••• •

••• •••

•••••• •• •

•••

••• •

••• ••• ••• •

• ••• •••••

• • ••

••

•• •••

•••

••

• ••••

••

•••••••

• ••••••

••

• ••

•••

••••

••

•••

••

••••

•••••

•••

••

• ••••• ••

•••

••• • •••••••• •

••••

••••••

• • ••• ••• •

•••

•• •

••

•••••• •

••

•••••

••

•••••

•••• ••

•••

••

••••

••••• •

••

•••

• ••

••

•• ••••••

••• ••

•••••••

••

••••• ••

•••••••• ••••

•• • ••

••

•••

• ••• •

•• •••• •••

••••••

••

••• ••••

•••••••

••

•••

• •••

•• •••••

••• •••

• •••• •

••

•• •

•••

••• •••• •• •

•••••

•••

••

•••

••••

•••••••••

•• ••• •• ••

••••

••

• ••

•••••

••• ••••••

• •••••

••

•• ••

•••

••

• ••••

••• ••• •

••• •••

•••

••

•••

•• ••

••

•• •• •

••

••••

•••••

••• ••

•••••

•••

••••

••• •••

••

••• •• ••• ••• • ••

••

•••

•••••

•• •••••••

•••

••

••••

•• •

••

•••••

••

•• •••••

••••

•• •

••

•••• ••

••

•••

•••••

•••• •

••• ••

•••

••

•••

••••

••• •••

•••

•• •• ••••••• ••

••

•••

••• ••

•• •• • ••••

•• ••

••

•• •••••

•••••••

••

•••

••••

• ••••

•••

•• •••

•••• ••

••

•••

••

• •••• •••

•••••

•••

••

•••

••••

••• •••

•••

•••• ••

•••

••• •

••

•••

• ••••

• ••• •• •• •

••••••

••

••••

•• ••

••

•••••

-1 1 3

••

• ••• •••• •••

••••

•••••

•••

• ••••

•••

••

•••••• ••

••••

• ••

•••••

••

•••• •• •

•••

••

••• ••• •

••• ••

••

•••

•••••

•• •• •• •••

•• ••••

••••••

••••

••

•••••

••

• ••• •••• •••

••••

•• ••

•••

•• ••••

••••••

••

• ••••• ••

•• •••

• ••

•••••

••

•••• •• •

•••

••

••• ••• •

••• ••

••

•••

•••••

•• •• •• •••

•• ••••

••••••

••••

••

•••••

-2 0 2

••

• ••• •••• •••

••••

•• ••

•••

•• ••••

•••

••

•••••• ••

•• •••

• ••

••

•••

••

•••• •• •

•••

••

••• ••• •

•••••

••

•••

•••••

•• •• •• •••

•• ••••

••

••••

••

•••

•• -0.40.00.4

••••• • •••• • •••••

••

•• ••••

•••• •

• ••••• •

•••• •••••••

•• •

•••• •

••••

•••

••• ••

• •• ••

•••••

•••••• • •• •••

•••

••••••

••

••• •••

••• •

•• •

•••

•• •

volume.L1 ••• •••• ••• •• ••••

••

•••••

••• •• •

•••••• •

••••••• • •• ••

•••••••

•• •••

•••

••• ••

•••••

•••••••••

••• • ••• ••••

••

••••

••

••••

••

••••

•••

•• ••

••••

• • ••• •• •• ••• •• ••

•••

•• •

•••

•••• ••••

•••••••• ••• ••••

•• •••

•• •• ••

••

•••

••• ••

••••••• ••

••••

••• ••••• ••• •

••

••••

•••

••

•• ••

••

••• •

•••

•• •

••• •• ••••• ••

• ••••••

••

•••

••• ••

••• •• ••

• •••••••••••

• ••••••

••••

•••

•• •••• ••••• •••••••

•••••

•• •••••

••

• •••

•••

••

•• ••

• •

•• ••

•••

••• • • ••• •••• •• •••

••••

••• ••••

• •• •• ••

• •• •• •• •••••••••

••••••••• •••

•••

• •••••••••• •• ••

•• ••

••• •• •• •••

••

•• •••

••

•••

••

••••••

••• •

• ••

•• •

•••••• •••• •• ••••• •• •

•••

•••• •••

• • •••• • •• ••

•• •••• •• •••••••••• ••••

••

•••

••••••••• ••• •••

• •••

• ••• •• ••••

••

•••

••••

••

••••••

••••

•••

••••

•• •

••• ••• ••• •••

• ••• •••

••

•••

••• ••

•••• •••• ••••••••

•••

•••••• ••• •••

•••

• ••••••

•• •• ••••

•• ••

•• •••• •••••

••

• ••••

••

••• •

••

•• ••

•••

••••

•••

••• •• •• •••••••

••••••

••• ••••

••••••• •• ••

• ••••••• •••••••••••

•• •••

•••

• ••••

•••• •• •••••••

••• ••• •••

• •••

•• ••••••

••

•• ••••

•• ••

•••

•• •

•••••• •• ••• ••••

• ••• •••

••

••• ••••

••• •••

• •• •••• •••• •••

•••••••• ••

•• ••

•••

• ••••••

•••• •••

•• •••

•• ••• •• ••••

••

••••

••

••••••

•• • •

•• •

••• •

•••

••• •• • ••• •••• •••

••••

•• •

•••

••• ••

• •••• ••

• •••••• ••• •

•••• •

• ••••••

•••

•• •••••

•••• •••

•• ••

••• ••

• •••••

••

•• •

•••

••

•• ••••

•••••••

•• •

•••

••• •• • ••• •••• •••

••••

•• •

•••

••• ••

• •••• ••

• • ••••• ••••

•••• •• ••

••••

•••

•• •••••

•••• •••

•• ••

••• ••

• •••••

••

•• •

• •••

••

•• ••••

•••••••

•• •

•••

••• •• • ••• •••

• ••••

•••

•• •

•••

••• ••

• •••• ••• •••••• ••

•••

••• •• ••

••••

•••

•• •••••

•••• •••

•• ••

••• ••

• •••••

••

•• •

•••

••

•• ••

••

••• ••••

•• •

•••

••••• •

•••

•• ••• •••

••• ••

• • •••

•••

••

•••• •••

•• ••

• •••

•• •••• ••

••••

•••• ••••

• •• •••

••

• ••• •••

••••

•••••

••

•••• ••• •• •••

•• •

••••

•••

••

•••••••• •

••••• • ••

••• ••

• •• •••••

• •

••••

•••• •••

•• ••

• ••

••••••

• ••••

••• ••

•••

••••••••

••••

••••• ••••

• •••

•••••

••

•• •• ••• •• •••

•••

• •••

•••

•• volume.L2 ••

••• ••••

••• •• •• •

••• ••

• •• ••

•••

••

•••••••• •

••••••

• ••

••• •• •

• ••• •

••• ••

•••

•••••• ••

• •••

••

• •• •••

••• •

••

••••

••

•• • •••• • ••••

•••

••••

•• •

••

•••••••••

••••••• •

•••••

•• •••

•••

••

••• •••••••••••

• ••••••••

• ••• ••••• ••

••••••

• •••

••••

••••••••

••••

••

••••

••

••••

••• ••• • •••• ••

•••

• •• •

•••

••

• ••

•• ••••

•• •••

•••

•• •• •

•••••

•••

••

• •• ••••••

•••

•••••••••

•••••

••• ••

•••••••• •

• •••

• ••

•••

•• ••

••• •

•••

••••••••••••••••

•••

••••

•••••

•••• ••

•••

•• •• ••••

•••••

••••••

• •• •

••

•••• •••

• •• •

•••

•••••• •••

•• •••

••••••••••• •••

••••••

••

••• •••••••

•••

• ••

••

•• •••••••••••

•••

••••

•• •

••

•••

••• ••••••• ••••

•• •• •• •• •

••••

• •

••

• ••••••••••

•••••• •••

•••••

••• •••

••••• •

• •••••

• ••

•• •••• •

••••

••

••••

••

•••• ••• •••• •

•••

• •••

•••

••

•••

• ••••••••••

•• •

•••• •

••• •••

• •••

••

• ••••••• ••••

••••••••••••

••••

• •••

••••• •

• •••

••••

••

• ••• •

••• ••

•••

••

••• •••••••• •

•••

••••

•••••

•••

•• ••••

•••• ••••

•••• •

••• ••

•• •

••

•• •••• •

•••••

••••• ••••

•••••

••• •••

••• •••• •••••••

••

• ••• •

•••••

••

••••

••

••••••••••• •

•• •

• •• •

•••

••

•••

••••

•••

••• ••••

••• • •

• •• ••

•••••

••

• ••••••

•••••

•••

•• ••••

•••••

•••• •

•••

• •••• •••

••••

••• ••• ••

•••••

••

•••

••

••• •••• ••••••

•••

••••

•••

••

•••••• •

•••

••• ••• •

••• • •

• •• ••

•••••

••

• • •••••

••••

•••

•• ••••

•••••

•••• •

•••

• •••• •••

••••

••• ••• ••

•••••

••

•• •

••

••• •••• ••••••

•••

••••

•••

••

•••••• •

•••

••• ••••

••• • •

• •• ••

•••

••

• ••••••

••••

•••

•• •• ••

•••••

•••• •

•••

• •••• •••

•••••

•• ••• •

••

••••

••

•••

••

••• •••• ••• ••

•••

••••

•••

••

••••

•••••

•• •

••

•••

•••••

••

••• ••••

••• • •••

• ••••• •••• •••

••

••• •

••••

•• ••

• •• ••••• ••

••••

•• •• ••

••••

•• •••

••

•••

• ••••

• •••

•••

••••

2••

•••••• •

•••

••

• ••

•• •••

••

•••••• •

•••••••• •••

•• •••• ••

••

•••

•• •••••

•• ••

••••••• •• •

•• •••

•• •••••

••

•• •

•••

• •

••

••• ••

•• ••••

••• •

•••

••••

•••

• •••

••

••••

•• •••

••

• •• ••••

••• ••••••••

• •• ••

••••

••

••• •

••••

•• ••

•••••

•••• ••

•• ••

•• • •••

••••

•• •••

• •

••

••••••••••

••••

•••

•• •••

volume.L3 ••••••

• •••

••

••••

•• •••

••

•• •••••

•• ••• ••

•••••••••

• •••

••

••••

• •••

• ••••• •••••

••••••••• •

• ••

••

• •••

••

•••

•• •

•••••• •

••• •

• ••

••• ••

• ••••

••••

•• •

••

•••

• •• ••

••

• •• ••••

••• •• ••

••••••••••••

••••

••• •••••

••••

•••••• •• ••

• • ••

•• •• ••

•••

••

•• •••

• •

•••

••••••••••

••••

••••••••

••••

••

••••

• •

••••

•• •••

•••

• • •• •• •

•• •••• •••• •

• ••••••••

••

•• ••

••••

•••• ••• •• ••

••••

••• •• •

•••

••

••• ••

••

••••••

••••

•••

• ••••

••• •••

••••••

••••

• •• ••

••

•• •••• •

••••• ••

•••••••••

••••

••

••• ••••••••

•••• •• ••• •

• • •••

•••• •• ••••

•••••

••

••• •••

••• ••

••• •

•••

••• ••

•••• •

•••••••

••

•••

• • •••

••

•••• •••

•• •• • ••••••

• ••••••••

••

••••••••

••••

•••• •• ••••

•• ••

•••• ••

•••

••

•• •••

••

•• •

•••••• •

••••

••••••••

••••

••

••••

••

•••

•• •••

•••

••• • •••

•• •••• •

••• ••••••

••••

••

••••

•• •••

• •••••

••••

•••• ••

••••

•• •

•••

• •

••

••••••••• •

• •• •• ••

•••••

••• •••

••••

••

•••

•••••

••

•• •••

••

•• ••• ••••••••• •

•••• •

••••••• •

••• •• ••

•• •••

• ••• ••

•• •••

••• ••• •

••

••••••

• •

••

•• •

••••••••

••••••

•••• •

•••• •••

••••

••

•••

•• •••

••

•• •••

••

•• ••• • ••••••••• •

••• •

••••••• •

••• •• ••

•• •••

• ••• ••

•• •••

••• ••• •

••

••••••

• •

••

•• •

••••••••

••••••

•••• •

•••• •••

••••

••

•••

•• •••

••

•• •••

••

•• ••• •••••••••• •

••• •

••

••••• •

••• •• ••

•• •••• ••• ••

••••

•••• ••

• ••

••

•••••

• •

••

•• •

••••••••

••••••

•••

• ••

••

••• ••

•••

•••• ••

• ••

•••

••

•• •• •

•••

•• ••

•• ••••• ••••

••• •• ••••

•• ••••••

• • •• •• •••

•• •• ••••

•• • •• •

••

••• ••••••

•••

•• • •••••

••

•• •••

••• ••••• •

••• • ••

•••••• •

•••• • •

• ••

•••

••

• •

•••••••••••

••• ••••• ••••••• •••••••

• ••••••• • ••••••

•••• ••••

••••••

••

• •• ••

••••

•• •

•• • •••• •

••

•• •••

• •••• ••• •

• ••• ••

• ••••••

•• ••••• •

•••••

••

•• •••

••

•••••••• • •• ••

•••• •••

•• ••• •••••

•• • ••••••

••

•• •••••

••• •••

••

••• ••

••••

•• •

•• • •••••

•••••••

• ••••••• •

• •• • ••

• •• ••••

•• •• ••

•••

•• •••

• •

•• •••

••

••••••• ••• •••••

•••• •• •

• •••

•• •••• •• • •••••

•••••• •••

•• ••••

••

••• ••

••••

•••

•• • •••• •

••

• ••••

• ••••••• •

••••

retd.L1 ••

••• •

•••

•••• ••

• ••

••

•• •• •

••

•••

• ••••••••• ••••••

••• ••• ••••• •

••••••••

••

• •••• ••

••• •• •

••••

•••

•• •

••• ••••••

••••••

•• •••••••

•••• •

••• ••

•••

••• •••

•••

•••• •

••

••• •••

••

•••

••••• •• •••••

••••• ••••

••••••

•••• •••••

••

••• •• •••

• •••••

••••

•• ••

• ••

•••

•••••

• • ••

••••••

• ••••••• •

•••• •

•• ••• •

•••

••• •••••

• ••

••

• •

••••••

••

•••

••••••••••• ••••••

••• ••• ••••• ••• •••••

••

••• ••• •

•• ••••

••

•••• •

••••

•••

•••••

••••

••

•••••

•••••••• •

•• •• ••

••••

•••• •••••

•••

••

• •

••• •••

••

•• •

••••••• •••• •

••••••

•• •••••••

••

••••••••

•• ••• •

••• •• ••

••••

•••

••••• ••••

••••••

• ••••••••

•••• •

•••

••• •••

• ••

•••

••

• •

•• • •••

••

•••

••••• •••••••

•••• •••• ••

•••••• •••••• •••

••

•••• •••

•• ••• •

••••• ••••••

•• •

•••• •

••••

•••••••

••• •• •• ••

•••• •

•• ••••

•••

•••••

•••

••

• •

•••• •••

••

•••••••• ••• •••

•• •• ••••••

• •••• ••

••••• •••

••

•• •• •••

•• ••• •

••

••• ••

••••

•• •

•••• •

••••

••

•••••

•••• •••••

•• •• ••

• ••• •

•••

•••••

•••

••

• •

•••• •••

••

•••••••• •••• •

••• •• ••

••••

• •••• ••

••••• •••

••

•• •• •••

•• ••• •

••

••• ••

••• •

•• •

•••• •

••••

••

•••••

•••• •••••

•• •• ••

• ••• •

•••

•••••

•••

••

• •

•••• •••

••

•••••••• •••• •

••• •• ••

••••

• •••• ••

••••• ••••

••• •• ••

•• ••• •

••

••• ••

••••

•• •

• ••• •• ••

••

••• ••

•••• •••••

•• ••

•••

•••••

• •••••• •

•• •

•• ••

•• •••• •

•• ••

•• ••••• •••• •

•• •• ••••

•••

•••••• •

•• •• •••••

••••

•••

••

•••

••••

• •• •••• ••

• •• • •• •••• ••• •• ••

••

•••••

• ••••• •-1

1 •••

•••• •

••••••• •

•• •

••• •••••

• ••

•••

•• ••••• •••• •

•• ••••••

•••

•••••• •

••••••

•••• •

•••

• ••

••• •

• •• ••

•• ••• •

• • •• •• •• ••• •• ••••

••• •

• ••• ••• •• •••

• •••

•••••

•• ••

•••

••••

•• ••

•• •

•••

••••• •

•• ••••

•• ••••••••

•••••

•• ••••••• ••

•• •

•••

••

• ••

••••

• •• ••

•• ••• •

• • •• ••••••••••••

••

••••

• ••• •• • •• •

••

• •••

•• •••

• •••

•• •

••• ••

• •••• •

••••

•• ••• •••••• ••• •• •• •

•••

•••••

•• ••••••

••••••

••••

••

• ••

••••

• •• ••

•• ••• •• • •• •• • •••• • ••••

••

••••

• •• •••• •••

••

• •••

•••••• ••

•••

•• ••

••••

•••

••••••••••

• ••••••••

•••

•••••

• •••• •••

•• •

•••••

••

•••

•• ••••• •

•••• •••

••••••• ••• • ••••

•••

••• •

•• ••• •• retd.L2 •••

••

••••

• •• •••••

•••

••••

••• •

• ••

•••

•••• •••••••••••• •••

•• •

•••••

•••••••••

•••• •• ••

••

• ••

••••

•• ••••• ••

• •••••• • •••••••••

••

••••

• •• •••• •• •••

••••

••• •••••

•••

••• •

••••

• ••

•••

••••••••••

•••••• •••

•••

•••••

••••••••

•• ••

• •••

••

•••

•• •••• •••••••• ••••••••• ••• ••••

•••

••• •

• • ••• •• ••••

•••••

• •••

•••

••• •

••• ••••

•• •

••••••

••••••••••••••

•••

•••••••

••••••

• ••

••••

••

•••

••• •• ••••

•• • •• •

•••• ••• •••••••••

••

••••

•••••• •••

••

••••

••• •••• •

•••

••• ••

• • ••••

•••

•••• ••••••••••• ••••

•••

••••••••

••• •••• ••••

• ••

••

•••

••••• •• ••

••••••

• •• • ••••••••••••

••

••• ••

•• ••••• •• ••

••••

••• ••••••

•• •

••• •

•••••••

•••

••••• ••• ••

••• •• ••••

•••

•••• •• ••

••• •••• ••

• •• •

••

•••

•• ••• •••

•••••

••••• • ••• •••• ••••••

•• ••

•• ••• •• •• ••

••••

••• ••• ••

•• •

••• •

••••

•••

••••

••••• •••• •

••• •• ••••

•••

•••• •• ••

••• •••• ••

• •• •

••

•••

•• ••• •••

•• •••

••••• • ••• •••• ••••

•••

•• ••

•• ••• •• •• ••

••••

••• •••••

•• •

••• •

••••

•••

••••• •••• •

••• •• •• ••

•••

•••• •• ••

••• •••• ••

• •• •

••

•••

•• ••

• ••••

••••••

••• • • •• •••• ••• ••

••

•• ••

•• ••• ••

••••

•••••••

• ••• •••• ••

• • ••• •

••

•• ••

••

• •••• ••• ••

•••• •• ••••••• • •• •

• ••••• •

• •••••

• •••

•••••

• •• •••• ••

•• ••• ••

•• ••• •• ••• •

•• ••• •

••

•••• • ••

•••

•• •••••

• • •• •• •• ••• ••

••••

••••••••

•• ••••• ••• • ••• •

••

••• •• ••••• •• • •••

••• ••••

•••• ••••• ••••• •• •• •

••• ••

•• ••• ••

•• ••• •• ••••

••••

•••

••

• ••• ••• ••

• ••••• ••••• •• ••••

••• •• ••

••

••••

••

•• •• •••• ••

••• ••

• ••••• •• • •••

••• •

•••

•••• ••

••• ••

• •••• •• •••• ••

•• ••• ••

•••••••••••

••••

• ••

••

• •• • ••• •

• ••••

• •• •• ••••• ••

• •••• ••

••

••••••

•• •••

••••• • ••• •

••

• •• ••• •••• •• • •••

••• •

•••• ••••

•• •

• ••• •••• •• •••• ••

•• ••• ••

• •••• • •••• •

••••

• ••

•••••• ••

•••

• ••••

••••• ••• •••••• •• •••

••

••• •

••••

••••••• ••••••

••• ••• ••••• •••••

••• •

•••

••••••••••

••• ••

••• •

•••• •

• ••••••• ••• • •••• ••

••••

•••

••

•• •• • •••

••• ••••• ••• • •• ••

•••••• •

••

••• •

••••

••

•••• ••••••••• ••• ••••• ••••••

•••••

• •••• • ••

••••

•• •••

•••• •••••

••• •••••••••••••• •

••••• •

••••••• retd.L3 ••

• ••

••• ••

••••••• •• •••

• ••• •••

••

••• •

•••••••••• ••••••

••• ••• ••••• ••• •••

••••

•••••• • ••• ••••

•• ••••

•••••••

•••• ••••• ••• •••• ••

••••

•••

••

•• •• •••••

•••••

••• •• •• • •••

••••••••

••

•• • •

••••

••

•••• •••••••

•• •••••••• ••••••

•••••••

•••• ••

• •••

•••• •

• ••• ••• • •

••••• ••• •••••••• ••

••••

• ••••

•••• ••••

•••••

•••••• •• •••

••••

•• ••

••

••••

••••••

••••••••• ••

•• •••••••• ••••••

••••

••••• ••••

• •••

•• ••

• •• •

•••••

•• •• • •••••••••• • •

•• ••

•••

••

•••• ••• ••

•••••

••••••••

•• ••

• ••• •••

•••

••

•• •

••••

•••• ••••• •••

•••••• •••• •• ••••

••••

•••

•• •• ••• •••

••• ••

• ••••••••

•••• • ••• •••• ••••••

••• •• •

••

••• •• ••

• ••

•••••

••••• •••

•• ••

• ••• •••

•••

••

•• •

••••

••••• •••• •••

•••••

• •••• •• •••••

••••

•••

•• •• ••• •••

••• ••

• ••••• •••

•••• • ••• •••• ••••••

••• •

• ••

••

•• •• ••• ••

••• ••

••••••••

•• ••

• ••• •••

•••

••

•• •

••••

••••• •••• •••

•••••

• •••• •• •••••

••••

•••

•• ••••

• •••

••• ••• •••

•••••

• ••• • • •• •••• ••• •• •

••• •

• ••

••

•• ••-202

••••• •

•••

• •

••• ••• •

••

• ••

•• •

• ••

••• • •

• •••• ••••

••• •• •••• •• •

•••••

• •••

•• •

••

•• •••

••• •• •

••

••• •••••• ••••

• • •• ••••

••

••• ••

••• ••• •

••

••• •

1•••

••••••

• •

••• • •• •

••

•••

•••••

• •••• ••••••• •••••

• •• •••••

•• •

•••••

••

•• ••

•••• ••

••

••• ••

•••• •• •

•• • •

• •• ••

••

••• ••

• ••••••

••

• ••• •••

•••

• ••

• •

• ••••

• •

••

•••

•• •••

••• ••

•••• •• •••••• •••••

••••••••

•• •

•••••

•••

• •••

••• • •••

••

•• ••

•••• •• ••

• • •• ••••

••

•••••

• ••••• ••

• •• • •••

•• •• •

•• •

• •• ••••

•••

•• •

•••

•••••

• ••• •••••

••• •• •• •• ••• ••••

•• •

•••••

••••••

•••• •••

••

• • ••

•••• ••••

• • •• •• • •

••

•• •••

• ••••• •

••

•••• •• •

•••

• ••

• •

••••• ••

•••

• ••

•••

•• •••

•••••••••

••••••••

• ••• •••••

•••••••

••

••••

•••••••

••

••••

• •••• ••••••

•••••

••

•••••

•••••••

••

•••• • • •

•• •

•••

••

••• ••

• •

••

•••

•• •• •

••• ••

•••••••••

••••••••

••••

•••••

•••••••

•••

• •••

••• •• ••

••

• •••• ••••• •

••• ••••••

••

•••••

•• •••• •

••

•••• •••

•••••

•••

•••••••

••

• •

•••

•• •••

••• •• ••••••••• •••

•• ••••••••••••••

••

•• ••

• ••• ••

••••••

••• •• •••••••

•• • ••

••

•••••

• ••••••

••

••••

aretd.L1•• •

•••

•••••

••• ••

••

•••

••••••

•• •• •

••••• ••••

••••••••

••••

•••••

•••••••

••

••• •

••••• ••

•••• •

•• ••• • ••

••••

• ••• •

••

•••••

• ••••• •

••

•••• •• •

•• •••

•• •

•••••

• •

••

•••

•• •

•••

•• •••

••• •••••••••• ••••

•••••••••

••••• ••

••

•• •••

••••• •

•••

•••• ••• ••

• ••••

••

•••••

••• ••••

••

•••• •• •

•••••

•• •

•• •••••

••

•••

• ••

•• •••

•••• •• ••••• •• ••••

••••

••• ••

••••

• ••

••

•• •••

••••• •

••

••• •

•• ••••• •

••••• ••• •

••

•••••

•••• •• •

••

•••• •• •

•••••

•• •

•• •••••

••

•••

•••• •

•• •••

•••• ••• •••• •• ••••

••••

••• ••

••••

• ••

••

•• •••

••••• •

••

••• •

•• •• ••• •

••••• ••• •

••

•••••

•••• •• •

••

•••• •• •

•••

•••• •

•• •••••

••

•••

•••• ••

•• •••

•••• ••• •••• •• •• ••

••••

••• ••

••••

• ••

••

•• •••

••••• •

••

••• •

•• ••••• ••

•••• • •• •

••

•••• •

•••• •• •

••

••••

• •••• •••••••

••

• •

• • •••••

• ••••

••••

• ••••• •••• ••• •• ••••

•• •••••

•• • •• •• •••

•••• •••

•••

••

•••••

• ••••

••• •• •

•••• • •• •• ••

••

•••

• ••••• •

•••

•••••• •

••••••

••

• •

• ••••••

•• ••••

••••• ••••

• ••• •••• •••••

••• •••••

•• • ••••••

•• •

• ••••

••••• •

••

•• •• •• •

•••

•• • •• •

•••• • •• •• ••

••

•••• •

• ••• •••

••

•••• ••• •

• ••

•• ••

•••

•••• ••

••••

••

••••

•••• ••• •• •

••• •••••••• •••••

•• • ••••••

•••

• ••••

•••

•• •

••

•••• •• ••

•••• • •

• •••

•• ••••••••

•••••

• ••• •• •

••••

•• ••• •• •

•• •

•••• •

• •••••••••

•••

••••• ••• •

•••• •••• •• •• •

•••• ••••

•• • ••••••

•• ••• ••

•••

•• •

••

••• • •• ••

••••• •

• ••••• ••• • •••

••

•••••

• •• ••••

•••

••• ••• •••

•• •

••

•••

•• ••• •

••••

••

•• ••••••••••••••••••••

•••• ••••

• •••• ••••

•••

••••

••••••

••

• ••••• ••••

• ••••••••• ••• ••••

••

•••••

•• ••• ••

••

•••• ••••

••• •

••

•••• •••

• •••

••

•• ••••••••••• •

•••••••••

•• •••••

•••••••••

•••

•••••

•••

••

••• ••••

•••

•• •••• •••••••••••

••

•••

••••••

•••

•• •••••

••••••

••

•••• • •

••• •

••

••• •••• ••

•••••••••• ••••• ••••••

••• ••••••

•• •

• •• ••

•• •

••

••••• ••

•••••• •••••

• • ••• •••••

••••

• •• ••••

••

••• ••• • •••

•••

••

•••

• •••• •

••• ••

••

•• ••

••••••••• •••••• •••

••• •••••

••• ••••••

•••

• ••••

•••

••

• •••• •••••••• •

••• •••• ••••••

••

•••

• • ••• ••aretd.L2

••

••••• •••••••

••

••••••

••••

••

••• •••• ••

••••••••• ••••••••••••

•••••• •••

•••

••• ••

••••

••

•••• •

• ••

•••• ••• •• •

••••••••••

••

•••

•• •••••

••

• •••••

••••••••

• •

• •••• •

•• ••

••

•• ••

•••• ••• ••••• •• ••••••• •••• •

• ••••• •••

•••

• •• ••

••••

••

• ••• ••••

•••• ••

••• ••••••••••••

••

• ••

•• ••• ••

••••

• •••••

•••• •

••

•• •

• •••• •

•• ••

••

•• • •

•••• •••• ••

•• •• ••••••• •••• •

• ••••• •••

•••

• •• ••

••••

••

• ••• ••••

• ••• ••

••• •••••• ••••••

••

• ••

•• ••• ••

••••

• ••• • •••••••

••

• •

• •••• •

•• ••

••

•• ••

•••• •••• ••

•• •• •• ••••• •••• •

• ••••• ••••

••• •• •

••••

••

• ••• •••

•••

•• ••••

• •• •••• ••• •

••

• ••

•• ••• •• -1

••

••••••

•••••

• •••••••

• • •

••

•• ••

•••

••• ••• •• •

• ••

• •• •••••• •

•• •

• •••

•• ••••••• •

•••••

•••• ••

•• ••

••

••• ••• •• •••

••

• ••• •

••

••• •-1

••

•••••• •

••••

• • •••• ••

• ••

•••••

••••

•••

••• ••• •••

•••• •• ••

•• •• ••••

••• •

••• •••••

•••

••• ••

•••• ••

•• ••

••

••• ••• •• •••• •

•••••

••

• ••• ••

• •••

• ••

••• •••••

•• ••

•••

••

••••

•••

• ••• •• •••••

•••• •••• •• ••••

••• •

••• ••• •

•• •

••••••

••• ••

•• ••

••

•••••••••••

• •

•••• •

••

• •• • ••

• •••

• ••

••

• •• •• •

••••

• ••

••••

••

••••

•••

• ••••• •• •

• ••

• ••• ••••• •

•••

••• •

•••••• •

••••

•••• •

•••• ••

•• ••

••

•• •••• • ••••• •

•••• •

••

•••• ••

••••

• ••

••

••••• •

•• ••

•• •

•••••

••• •

•••

••••••••••••

• ••• •••

• •••• •

••• •

•••••• ••••

•• ••••• ••• •

• ••••

••• ••• • ••••

•••

•••••

••

•• •• • •

••••

•••• •••• •• •

•••

••

•• ••

••• •

•••••

•••• ••••••

••• ••••

••••••

••••

•• ••• •••

• ••

•••• •

•• ••••

••• ••••

••••••••••

••

•••• •

••

•••• ••

•• ••

••••

•••••••

•• ••

•••

••

• •••

••••

•••

•••••••••

• ••

•• •••••••• •••

••••

••• ••••

•••

••••

•••• ••

•••••

••

•••••••••

• •

•••••

••

•••• ••

• •••

•••••

••••••

• •• •

• ••

•••••

••• •

•••••

•••••••• ••

••• ••••

••••••

••••

••• •• •••

•••

•• ••••

• • •••

•••••

•••• ••• ••••

•••

•••••

••

•• •• ••

•• •••••••

••• •• •• • ••

•••

••

• •••

•• • •

•••••

•••••••••

••••••••••••••

••••

•••• •• •

•••

••• ••

•• •• • •

•••••

••

• •••••••••

• •

•••• •

••

••••aretd.L3

••

• •••

••••

•••

••••••••

• ••

•••• ••

••• •

•••• •

•• •••••••

•••• ••

• •• •••••

••••

••• •••••

•••

•• •••

•• ••••

•••••

••

• •••• •••••••

•• •• •

••

•• •• ••

• •••

••••

•••

••• ••• ••

• ••

•••• ••

••• •

•••• •

••• ••••••

•••• ••

• •• •••••

••••

••• •••••

•••

•• •••

•• ••••

•••••

••

• •••• •••••••

•• •• •

••

•• •• ••

• •••

••••

•••

•••••• ••

• ••

••••

••

••• •

•••• •

••• ••••••

•••• ••

• •• •••••

••••

••• •••••

•••

•• •••

•• ••••

• •••

•••

• •••• ••• ••••

•• •• •

••

•• ••

••••• •

••••

••

•••••••• •

••

••••

• •••• ••

••

•••

•••••

•••• •••

••••• •••

•••••

• • •••

••••

••

•••

• ••

•••

•••••• ••

••• •

••

•• ••

••• ••• •••

••••

•••

• •••

•••••• ••

••

•• •••• •• •

••

•••••

• • ••••

••••••

••••

• • •••

••••• ••

••••

•• • ••

••• •

•••

••• •

••••

•••

• ••

•••••• ••

••• •

••

•• •••

•• •••• ••

• •••

••

•••

••••

•••

••

•••• •• •••

••

•••

•• ••

•• •••

••

••••••

• •• •• ••••

•••• •••

•••••• • •••

••• •

•••

• ••

••

••••

•••

••• •••

•• ••

•••

••••••••••• ••

••••

••

•• •

••••• •••••

••

••• •

•••• ••

••

• •••• ••

•••

••••••••••

• • •••

••• •• •••

••••

•• • •••

••• •

•••

•••• ••

•••

••

••••

•••

••• •••

•• ••

•••

••••

•• •••• • ••

••••

•••••

••••••

••••

••

••• •

•• •••••

•••

••••

•• •••

••

•••••••••••• ••••

••••• ••

••••

••••••

••• •

••

••••••••

••

•••

••••

•••• ••

• ••••

••

••••

••••• ••••

• •• ••

••••

• ••

•• •••• •

••

•••• • •• ••

••

• ••••• •

••

••••••••

••• •••••

•••• •••

••••

•••••

••••

••

•• • •

••• •

••

•••

••••••••••

••••

•••••• •• •

••••

•••••

•••• ••

••••

••

•••••••••

••

•••• •

• • •••

••

• ••

•• •• •

••••••••

•••••• •

••••••• ••

•• ••

•••

•• •

•••

•••••

•• •

•• •• ••

••••

••

• •

••••

••••••• ••

•••••

••

••••

••••••• •

••

••••

• •• •••

••

•••• •• •

••

••••••••••• •••••

•••• ••••

••••

••• ••

•• ••

••

•• • •••

•••

••

•••

•• •

•••••••

••••

•••

•• ••

••••• ••••

• ••••

••

••••

•• ••••••

••

••• •

• • ••••

••

•••• ••• ••

••

••••••

• •••• •••••

•••• ••••

••••

•••••

•• ••

••

••• •••

•••

••

•• ••••

•••• • ••

•••••

••

••••

••••• •• ••

••••

••

•••

•• •••••

••

•••••• •••

••

• • ••• ••

••

• ••

•• •••

••••••••

•••• •••

•••••

••••

•••

•••••

••••••

••••••••

••• ••

•••

••••••••• • •••

• •• ••

••••

vola.L1 •••••••

•••

••

••• •

•••• •

••

••••••••

••

• ••

•••••

•••••••

••••

•••••

••••

••

•••

•••••

••

•••

••• •••

••••

••

••••

•••••••••

••••

••

•••

•••••••

•• •

••

•••••••• •

••

••••••••

••

••••••

••••

•••••

•• •••••

••••

•••••

••••

••

••••

••

•••

••••••

•• ••

••

• •

••••

•••• •• •••

••••

••

•••

••••• •

••••

••

••••••

••

•••

• •••• ••

••

•••

•••••

•••• ••• •

••••• •••

••••

• • •••

••••

••

•••

• ••

•••

•••••• ••

••• •

••

•• •••

•• ••• •••

••••

•••

• •-2

2••

•••••• ••

••

•• •••••

••

••••••

• • ••••

••••••

••••

•• ••

• •

••••• ••

••••

• • •••

••• •

•••

••• •

••••

•••

• ••

••

••••

••• •

••

•• •••

•• •••• ••

• •••

••

•••

••• ••••

•••

••

•••• •••

••

••••

•• ••

•• •••

••

••••••

• •• ••

•••• •

•••• •••

••••

• • •••

••• •

•••

• ••

••

••••

••

••••

••• •

••

••••••••••• ••

••••

••

•• •

••• •• •

••••

••

••• •

••••

••

•••

• •••• ••

•••

••••••••••

•• ••

• •

••• •• •••

••••

• • •••

••• •

•••

•••• ••

•••

••

••••

••

•• •••

•• ••

•••

•••••

• •••• • ••

••••

•••••

••••••

••••

••

••• •

••••

••

••••

•• •••

••

•••••••••••

• ••••

••••• ••

••••

•••••

••• •

••

••••••••

••

•••

••

•••

• ••

• ••••

••

•••• •••••

••••

• •• ••

••••

• •••• •

••• •

••

•••• • •

•••

••

• ••••• •

••

••••••••

•••

•••••

•••• •••

••••

•••••

••••

••

•• • •

••• •

••

•••

••

•••

••••••

••••

•••••••••

• •• •

••••

•••••

•••• ••

••••

••

•••••••

••

•••

•••• •

• • •••

••

• ••

•• •• •

••••••••

•••••• •

••••

•• •••

•• ••

•••

•• •

•••

•••••

••

•••••

••••

••

• •

•••••••••

•• ••

•••••

••

••••

• •••••• •

••

••••

• ••

••

••••

• •• •••

•••••••••••

•••••

•••• ••••

••••

•• •••

•• ••

••

•• • •••

•••

••

•••

••

•••••••

••••

••

•• •••••••

••••

• ••••

••

••••

•• ••••••

••

••• •

• ••

••

•••

•••• ••• ••

••

••••••

• •••••••••

•••• ••••

••••

•••••

•• ••

••

••• •••

•••

••

•• •••

••

•••• ••

•••••

••

•••••••••

•• ••

••••

••

•••

••••• •

••••

••

•••••••

••

•••

••

• • ••• ••

••

• ••

•• •••

••••••••

•••• •••

••••

•••••

••••

•••

•••••

••••••

••

••••

••• ••

•••

•••••••••• •••

• •• ••

••••

••• •••

••••

••

•••••••

••

•••••••••

••

••••••

••• •••••

••

•••••••

••••

•••••

••••

••

•••

•••••

••

•••

••

•••••

••••

••

•••••••••••••

••••

••

•••

vola.L2 ••• •••

••• •

••

•••••••

••

••••

•••••

••

••••••

••••

••••••

•• •••••

••••

•••••

••••

••

••••

••

•••

••

•••

• •••

•• •

•••••••• •• •••

••••

••

•••

••••• •

••••

••

••••••

••

•••

• •••• •

••

•••

•••••

•••• ••• •

••

••• •••

••••

• • •••

••••

••

•••• ••

• ••

••

•••

••• ••

•• •

••

••••••

••••••• •••

••

•• •

•••••••• •••

••

•• •••••

••

••••••

• • ••••

•••

•••••

•••

• ••• •

••

••• •••

•••

• • •••

••• •

•••

••• •••

•••

••

• ••

••

•••

••

•• •

••

•••••

•••

••• ••• ••

••

•••

••• ••••

••••

••

•••• •••

••

••••

•• ••

••

•••

•••• •

• ••

•••• •

••

•• ••••

••••

• • •••

••• •

•••

••• •••

• ••

••

•••

••

•••

••

•• •

••

••••••••

• •••• •• •••

•• •

••• •• •

••••

••

••• •

••••

••

•••

• •••• •

••••

•••

•••••

•••

• ••• •

••

• •• •••

••••

• • •••

••• •

•••

•••• ••

•••

••

•••

••

••• ••

•• •

••

•••• •

•••• •••• •• •

• •

•••

••••••

•••••

••

••• •

••••

••

••••

•• ••

•••

•••••

•••

• ••••

••

••• •••

••••

•••••

••• •

••

•••• ••

•••

••

••••

••

•••

• •

•••••

••

•• ••••• •

•••• •• ••

••

•••

• •••• •

••• •

••

•••• • •

•••

••

• ••••••

••

•••

•••••

•••

•••••

•••• •••

••••

•••••

••••

••

•• • •••

• ••

••

••••

••

•••

••

•••••••

••••••••••

•• ••• •••••

•••

•••• ••

••••

••

•••••••

••

•••

•••• •

• • ••

•••

• ••

•• •• •

••••••••

••

•••• ••

••••

•• •••

•• ••

•••

••••• •••

••

•••

•••••

•••

••

• •

••

•••••

•••

• ••••••

•• •

•••

-2 1 3

••• •••

••• •

••

••••

• ••

••

••••

• •••

••

•••

•••••

•••

•••••

••

•• ••••

••••

•• •••

•• ••

••

•• • •••

•••

••

••••

••

•••••

•••

••

•••••

•• ••••• ••

••

•••

•••• ••

••••

••

••• •

• ••

••

•••

•••• ••• ••

••

•••

•••• •

••••••••

••

•• ••••

••••

•••••

•• ••

••

••• •••

•••

••

• •••

••

•••• •

•••

••

••••••••• •

• •••• •••••

•••

••••• •

•••••

••

•••••••

••

•••

••

• • ••• •

••

• ••

•• •••

••••••••

••

•• ••••

••••

•••••

••••

•••

•••• ••••

••

•••

••

•••

••

•• ••

•••

••••••••• •

•••• •••

•••

••• •••

••••

••

•••••••

••

•••••••••

••

•••

•••••

• ••

••• ••

•••••••

•••

•••••

••••

••

••• •••••

••

•••

••

•••••

•••

••

••••••••••••••• •••

••

•••

-2 0 2

••• •••

••••

••

••• •

•••

••

••••

•••••

••

• ••

•••••

•••

••• ••

•••••••

•••

•••••

••••

••

••• •••••

••

•••

••

•• •••

•••

••

•••••

•••••••• •••

••

•••

vola.L3

OLS Fit

Results of ordinary least squares analysis of NYSE data

Term Coefficient Std. Error t-Statistic

Intercept -0.02 0.04 -0.64

volume.L1 0.09 0.05 1.80

volume.L2 0.06 0.05 1.19

volume.L3 0.04 0.05 0.81

retd.L1 0.00 0.04 0.11

retd.L2 -0.02 0.05 -0.46

retd.L3 -0.03 0.04 -0.65

aretd.L1 0.08 0.07 1.12

aretd.L2 -0.02 0.05 -0.45

aretd.L3 0.03 0.04 0.77

vola.L1 0.20 0.30 0.66

vola.L2 -0.50 0.40 -1.25

vola.L3 0.27 0.34 0.78

Variable subset selection

We retain only a subset of the coefficients and set to zero the coefficients

of the rest.

There are different strategies:

• All subsets regressionfinds for eachs ∈ 0, 1, 2, . . . p the subset of

sizes that gives smallest residual sum of squares. The question of

how to chooses involves the tradeoff between bias and variance: can

use cross-validation (see below)

• Rather than search through all possible subsets, we can seek a good

path through them.Forward stepwise selectionstarts with the

intercept and then sequentially adds into the model the variable that

most improves the fit. The improvement in fit is usually based on the

F ratio

F =RSS(βold)−RSS(βnew)

RSS(βnew)/(N − s)

• Backward stepwise selectionstarts with the full OLS model, and

sequentially deletes variables.

• There are also hybridstepwise selectionstrategies which add in the

best variable and delete the least important variable, in a sequential

manner.

• Each procedure has one or moretuning parameters:

– subset size

– P-values for adding or dropping terms

Model Assessment

Objectives:

1. Choose a value of a tuning parameter for a technique

2. Estimate the prediction performance of a given model

For both of these purposes, the best approach is to run the procedure on

an independent test set, if one is available

If possible one should use different test data for (1) and (2) above: a

validation setfor (1) and atest setfor (2)

Often there is insufficient data to create a separate validation or test set. In

this instanceCross-Validationis useful.

K-Fold Cross-Validation

Primary method for estimating a tuning parameterλ (such as subset size)

Divide the data intoK roughly equal parts (typicallyK=5 or 10)

Train Train Train

TrainTest

21 3 4

• for eachk = 1, 2, . . . K, fit the model with parameterλ to the otherK − 1

parts, givingβ−k(λ) and compute its error in predicting thekth part:

Ek(λ) =P

i∈kth part(yi − xiβ−k(λ))2.

This gives the cross-validation error

CV (λ) =1

Ek(λ)

• do this for many values ofλ and choose the value ofλ that makesCV (λ)

smallest.

• In our variable subsets example,λ is the subset size

• β−k(λ) are the coefficients for the best subset of sizeλ, found from the

training set that leaves out thekth part of the data

• Ek(λ) is the estimated test error for this best subset.

• from theK cross-validation training sets, theK test error estimates are

averaged to give

CV (λ) = (1/K)

Ek(λ).

• Note that different subsets of sizeλ will (probably) be found from each of

theK cross-validation training sets. Doesn’t matter: focus is on subset size,

not the actual subset.

subset size

2 4 6 8 10 12

••

all subsets

CV curve for NYSE data

• The focus is onsubset size—not which variables are in the model.

• Variance increases slowly—typicallyσ2/N per variable.

Subset Size k

0 1 2 3 4 5 6 7 8

•••••••

••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••

•••••••

•• • • • • • •

Figure 3.5: All possible subset models for the prostate

cancer example. At each subset size is shown the resid-

ual sum-of-squares for each model of that size.

The Bootstrap approach

• Bootstrap works by samplingN times with replacement from training set to

form a “bootstrap” data set. Then model is estimated on bootstrap data set,

and predictions are made for original training set.

• This process is repeated many times and the results are averaged.

• Bootstrap most useful for estimating standard errors of predictions.

• Can also use modified versions of the bootstrap to estimate prediction error.

Sometimes produces better estimates than cross-validation (topic for current

research)

NYSE example continued

Table shows the coefficients from a number of different selection and shrinkage

methods, applied to the NYSE data.

Term OLS VSS Ridge Lasso PCR PLS

Intercept -0.02 0.00 -0.01 -0.02 -0.02 -0.04

volume.L1 0.09 0.16 0.06 0.09 0.05 0.06

volume.L2 0.06 0.00 0.04 0.02 0.06 0.06

volume.L3 0.04 0.00 0.04 0.03 0.04 0.05

retd.L1 0.00 0.00 0.01 0.01 0.02 0.01

retd.L2 -0.02 0.00 -0.01 0.00 -0.01 -0.02

retd.L3 -0.03 0.00 -0.01 0.00 -0.02 0.00

aretd.L1 0.08 0.00 0.03 0.02 -0.02 0.00

aretd.L2 -0.02 -0.05 -0.03 -0.03 -0.01 -0.01

aretd.L3 0.03 0.00 0.01 0.00 0.02 0.01

vola.L1 0.20 0.00 0.00 0.00 -0.01 -0.01

vola.L2 -0.50 0.00 -0.01 0.00 -0.01 -0.01

vola.L3 0.27 0.00 -0.01 0.00 -0.01 -0.01

Test err 0.050 0.041 0.042 0.039 0.045 0.044

SE 0.007 0.005 0.005 0.005 0.006 0.006

CV was used on the 50 training observations (except for OLS). Test error for

constant: 0.061.

Estimated prediction error

curves for the various selection

and shrinkage methods. The

arrow indicates the estimated

minimizing value of the

complexity parameter. Training

sample size = 50.

subset size

2 4 6 8 10 12

0.07 •

••

•• •

• ••

• •

all subsets

degrees of freedom

2 4 6 8 10 12

0.07 •

••••••••••••

ridge regression

or0.0 0.2 0.4 0.6 0.8 1.0

•• •

•• • • • • •

• •

# directions

0 2 4 6 8 10 12

• •

• • • • • • ••

•• •

PC regression

# directions

0 2 4 6 8 10 12

0.07 •

•• • • • • •

• • • •

partial least squares

Subset Size

0 2 4 6 8

•• • • • • • •

All Subsets

Degrees of Freedom

0 2 4 6 8

•••••••••••••

••

Ridge Regression

Shrinkage Factor s

0.0 0.2 0.4 0.6 0.8 1.0

••

• • • • • • • • • • •

Number of Directions

0 2 4 6 8

• •• • • • • •

Principal Components Regression

Number of Directions

0 2 4 6 8

•• • • • • • •

Partial Least Squares

Figure 3.6: Estimated prediction error curves and

their standard errors for the various selection and

shrinkage methods, found by 10-fold cross-validation.

Shrinkage methods

Ridge regression

The ridge estimator is defined by

βridge = argmin(y −Xβ)T (y −Xβ) + λβT β

Equivalently,

βridge = argmin (y −Xβ)T (y −Xβ)

subject toX

β2j ≤ s.

The parameterλ > 0 penalizesβj proportional to its sizeβ2j . Solution is

βλ = (XT X + λI)−1XT y

whereI is the identity matrix. This is a biased estimator that for some value of

λ > 0 may have smaller mean squared error than the least squares estimator.

Noteλ = 0 gives the least squares estimator; ifλ →∞, thenβ → 0.

ESL Chap3 — Linear Methods for Regression Trevor HastieElements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 3

0 2 4 6 8

••••

••

•••

lcavol

••••••••••••••••••••••••

lweight

••••••••••••••••••••••••

•••••••••••••••••••••••••

••••••••••••••••••••••••

•••

••

••••••••••••

••••••••••••••••••••••••

•gleason

•••••••••••••••••••••••

df(λ)

Figure 3.7: Profiles of ridge coefficients for the

prostate cancer example, as tuning parameter λ is var-

ied. Coefficients are plotted versus df(λ), the effec-

tive degrees of freedom. A vertical line is drawn at

df = 4.16, the value chosen by cross-validation.

The Lasso

The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the

outcomey.

The lasso is defined by

βlasso = argmin (y −Xβ)T (y −Xβ)

subject toX

|βj | ≤ t

• Notice that ridge penaltyP

β2j is replaced by

P|βj |.

• this makes the solutions nonlinear iny, and a quadratic programming

algorithm is used to compute them.

• because of the nature of the constraint, ift is chosen small enough then the

lasso will set some coefficients exactly to zero. Thus the lasso does a kind of

continuous model selection.

• The parametert should be adaptively chosen to minimize an estimate of

expected, using say cross-validation

• Ridge vs Lasso:if inputs are orthogonal, ridgemultipliesleast squares

coefficients by a constant< 1, lassotranslatesthem towards zero by a

constant, truncating at zero.

Coefficient

Transformed

Lasso in Action

Profiles of coefficients for NYSE data as lasso shrinkage is varied.

Shrinkage Factor s

0.0 0.2 0.4 0.6 0.8 1.0 1.2

s = t/t0 ∈ [0, 1], wheret0 =P|βOLS |.

Shrinkage Factor s

0.0 0.2 0.4 0.6 0.8 1.0

••

•• • • • • • • • • • • • • • • • lcavol

• • • • ••

••

•• • • • • • • • • • • • • • • • lweight

• • • • • • • • • • • • • ••

• • • • • • • • • •age

• • • • • • • • • ••

••

•• • • • • • • • • • • lbph

• • • • • • ••

••

•• • • • • • • • • • • •svi

• • • • • • • • • • • • • • ••

••

• lcp

• • • • • • • • • • • • • • • • • • • • • • • • •gleason• • • • • • • • • •

••

•• • • • • • • • • •

••pgg45

Figure 3.9: Profiles of lasso coefficients, as tuning

parameter t is varied. Coefficients are plotted versus

s = t/∑p

1 |βj |. A vertical line is drawn at s = 0.5, the

value chosen by cross-validation. Compare Figure 3.7

on page 7; the lasso profiles hit zero, while those for

ridge do not.

β^ β^2. .β

Figure 3.12: Estimation picture for the lasso (left)

and ridge regression (right). Shown are contours of the

error and constraint functions. The solid blue areas are

the constraint regions |β1|+ |β2| ≤ t and β21 + β2

2 ≤ t2,

respectively, while the red ellipses are the contours of

the least squares error function.

A family of shrinkage estimators

Consider the criterion

β = argmin β

(yi − xTi β)2

subject toX

|βj |q ≤ s

for q ≥ 0. The contours of constant value ofP

j |βj |q are shown for the case of

two inputs.

q = 4 q = 2 q = 1 q = 0.5 q = 0.1

Figure 3.13: Contours of constant value of∑

j |βj |q

for given values of q.

Contours of constant value ofP

j |βj |q for given values ofq.

Thinking of |βj |q as the log-prior density forβj , these are also the equi-contours

of the prior.

Use of derived input directions

Principal components regression

We choose a set of linear combinations of thexjs, and then regress the outcome

on these linear combinations.

The particular combinations used are the sequence of principal components of the

inputs. These are uncorrelated and ordered by decreasing variance.

If S is the sample covariance matrix ofx1, . . . , xp, then the eigenvector equations

Sq` = d2jq`

define the principal components ofS.

-4 -2 0 2 4

o oo o

Largest PrincipalComponent

Smallest PrincipalComponent

Figure 3.8: Principal components of some input data

points. The largest principal component is the direc-

tion that maximizes the variance of the projected data,

and the smallest principal component minimizes that

variance. Ridge regression projects y onto these com-

ponents, and then shrinks the coefficients of the low-

variance components more than the high-variance com-

ponents.

Digression: some notes onPrincipal Components and the SVD (PCA.pdf)

PCA regression continued

• Write q(j) for the ordered principal components, ordered from largest to

smallest value ofd2j .

• Then principal components regression computes the derived input columns

zj = Xq(j) and then regressesy onz1, z2, . . . zJ for someJ ≤ p.

• Since thezjs are orthogonal, this regression is just a sum of univariate

regressions:

ypcr = y +

whereγj is the univariate regression coefficient ofy onzj .

• Principal components regression is very similar to ridge regression: both

operate on the principal components of the input matrix.

• Ridge regression shrinks the coefficients of the principal components, with

relatively more shrinkage applied to the smaller components than the larger;

principal components regression discards thep− J + 1 smallest eigenvalue

components.

2 4 6 8

••

• ••

• • • • • • •

• •

ridgepcr

Figure 3.10: Ridge regression shrinks the regres-

sion coefficients of the principal components, using

shrinkage factors d2j/(d2

j + λ) as in (3.47). Princi-

pal component regression truncates them. Shown are

the shrinkage and truncation patterns corresponding to

Figure 3.6, as a function of the principal component

index.

Partial least squares

This technique also constructs a set of linear combinations of thexjs for

regression, but unlike principal components regression, it usesy (in addition to

X) for this construction.

• We assume thaty is centered and begin by computing the univariate

regression coefficientγj of y on eachxj

• From this we construct the derived inputz1 =P

γjxj , which is the first

partial least squares direction.

• The outcomey is regressed onz1, giving coefficientβ1, and then we

orthogonalizey,x1, . . .xp with respect toz1: r1 = y − β1z1, and

x∗` = x` − θ`z1

• We continue this process, untilJ directions have been obtained.

• In this manner, partial least squares produces a sequence of derived inputs or

directionsz1, z2, . . . zJ .

• As with principal components regression, if we continue on to construct

J = p new directions we get back the ordinary least squares estimates; use of

J < p directions produces a reduced regression

• Notice that in the construction of eachzj , the inputs are weighted by the

strength of their univariate effect ony.

• It can also be shown that the sequencez1, z2, . . . zp represents the conjugate

gradient sequence for computing the ordinary least squares solutions.

Ridge vs PCR vs PLS vs Lasso

Recent study has shown that ridge and PCR outperform PLS in prediction, and

they are simpler to understand.

Lasso outperforms ridge when there are a moderate number of sizable effects,

rather than many small effects. It also produces more interpretable models.

These are still topics for ongoing research.

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 1

Regularized Optimization, Boosting,

and Some Connections between

Saharon Rosset (IBM Research)Collaborators: Ji Zhu (Michigan), Trevor Hastie (Stanford)

Predictive modeling

Given n data samples (xi, yi)ni=1 , x

Ti ∈ R

Generated independetly from a data distribution:

y = f(x) + ε(x)

(f — fixed; ε — random)

We want to find a ”good” model f(x) to describe the deterministic part.

Definition of “good” is typically in terms of EXL(y, f(x)), where L depends on

problem.

Corporate Data Bases

Many tables, relational database.

Motivation

Modern data (Data Mining, Machine Learning etc.) is:

• High dimensional

– By nature: micro-arrays, scientific data, customer databases

– Computational tool: data often projected into high dimensional space:

kernel methods, wavelets, boosting’s weak hypotheses, etc.

• Noisy and dirty (e.g. customer databases)

• Contains many irrelevant predictors (e.g. customer databases, micro-arrays)

Fitting models without controlling complexity results in:

• Badly over-fitted models

• Useless for prediction or interpretation

Illustrative example

100 data points, 80 dimensional space. True model:

yi = xi1 + εi

εiiid∼ N(0, 1)

We are fitting a linear regression model of the form:

f(x) = x · β

Unregularized model projected to x1

Unregularized model: β = arg minβ ‖yi − xiβ‖2

−3 −2 −1 0 1 2

Appropriately regularized model

We impose an l1 constraint on the model:

β = arg min‖β‖1≤1

‖yi − xiβ‖2

−3 −2 −1 0 1 2

non−regularizedl1 regularized

Prediction problems

• Training data (x1, y1), . . . , (xn, yn)

• Input xi ∈ Rp

• Output yi

– Regression: yi ∈ R

– Two class classification: yi ∈ 1,−1

• Wish to find a prediction model for future data

f : x ∈ Rp → R

Regression: predict f(x)

Classification: predict sign of f(x)

• Generally take f(x) = xβ (linear model)

– Can be linear in a basis expansion (kernel/wavelets etc.)

The regularized optimization problem

β(λ) = arg minβ

C(yi,xiβ) + λJ(β)

Where:

• C is a convex loss, describing the “goodness of fit” of our model to training

– Regression: C(y, f) = C(y − f) function of residual

– Classification: C(y, f) = C(yf) function of margin

• J(β) is a model complexity penalty.

Typically J(β) = ‖β‖qq i.e. penalize lq norm of model, q ≥ 1.

• λ ≥ 0 is a regularization parameter

– As λ→ 0, we approach non-regularized model

– As λ→∞, we get that β(λ)→ 0

Examples

• Regularized linear regression:

Squared error loss: C(y, f) = (y − f)2

– Ridge regression uses l2 penalty J(β) = ‖β‖22

– The Lasso (Tibshirani 96) uses l1 penalty J(β) = ‖β‖1

• Support Vector Machines:

Hinge loss: C(y, f) = (1− yf)+

– Standard (2-norm) SVM uses l2 penalty ‖β‖22

– 1-norm SVM uses l1 penalty ‖β‖1

Considerations in selecting loss

β(λ) = arg minβ

C(yi,xiβ) + λ‖β‖qq

“Classical” view: loss should correspond to data log-likelihood

• Squared error loss corresponds to Gaussian errors

• Logistic regression uses binomial likelihood

Pragmatic view: need to do well on data

• Robustness considerations: sensitivity to incorrect error model

• Computational considerations: can we solve the problem efficiently

Some loss functions for regression and

classification

−3 −2 −1 0 1 2 30

residual

squared losshuber’s loss

−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 20

margin

exponentiallogistichinge

Considerations in selecting penalty

β(λ) = arg minβ

Two perspectives on penalty:

• Bayesian: prior over the model space

– reg. optimization solution is maximum posterior likelihood

• Limit model space to avoid over-fitting

Considerations in selecting penalty:

• Adequacy of penalty (implied prior)

– Sparsity considerations (l1 penalty encourages sparsity)

• Computational considerations

l1, l2 and l∞ penalties in R2

−1.5 −1 −0.5 0 0.5 1 1.5

−0.8

−0.6

−0.4

−0.2

1l1 penalty

l2 penalty

l∞ penalty

Regularization parameter: balancing loss and

penalty

β(λ) = arg minβ

Theoretical approaches to selecting λ:

• Bayesian: λ is “strength of prior”

• Frequentist: use loss + complexity penalty (Cp, AIC etc.)

Practical approach:

1. Solve for many (or all) values of λ.

2. Select based on cross-validation error

Equivalent constrained formulation

β(S) = arg minβ

C(yi,xiβ) s.t. ‖β‖qq ≤ S

Both formulations are equivalent when loss and penalty are convex, with the

following property:

β(λ) : λ ∈ R ⊂ β(S) : S ∈ R

Under most conditions we will consider the two sets are actually equal.

We use both formulations exchangeably.

Illustration: Lasso and Huberized lasso

• n = 100, p = 80.

• All xij are i.i.d N(0, 1) and the true model is:

yi = 10 · xi1 + εi

εiiid∼ 0.9 ·N(0, 1) + 0.1 ·N(0, 100)

• Sparsity implies l1 penalty is appropriate

• Compare l1-regularized paths using Huber’s loss and squared error loss

Hub. lasso path Lasso path

0 20 40 60 80

0 50 100 150 200 250

‖β(λ)‖1‖β(λ)‖1

Squared error curves for the two solution paths

0 10 20 30 40

r LASSOHuberized

‖β(λ)‖1

Boosting: warmup

• Introduced in the machine learning community by Freund and Schapire

(1996).

• Extremely successful in practice

• Main idea:

Iteratively build prediction model by fitting re-weighted versions of the data

– Weights emphasize badly fitted data points

– Each iteration builds a “weak” learner to model current weighted data

• Boosting can be interpreted as “coordinate descent” in high dimensional

predictor space (Mason et al 99, Friedman 2001)

Schematic of boosting

Training sample

Weighted sample

GM (x)

Pi αiGi(x)) Final prediction model

Boosting analysis: outline

• AdaBoost and its interpretations

– Boosting as gradient descent

– Margins view of boosting

• Relation of boosting to `1-constrained optimization

• Convergence of `p-constrained optimization of classification loss functions to

an “ `p-margin” maximizing separator

• Conclusions:

– Boosting approximately corresponds to `1-constrained optimization

– Classification boosting (AdaBoost and LogitBoost) “conver ge” to

`1-optimal separator, compared to `2-optimal for SVM

Schematic of Talk Structure

BoostingConstrainedOptimization Margins

Boosting basics

Given:

• Data xi, yini=1 with xi ∈ R

p and yi ∈ −1, +1

• Convex loss criterion L(y, f)

• DictionaryH of “weak classifiers” , i.e. ∀h ∈ H, h : Rp → −1, +1

– Example: all decision trees with up to k splits

Boosting basics (ctd)

We want to find a “good” linear combination :

F (x) =∑

hj∈H

βjhj(x)

such that∑

i L(yi, F (xi)) is small.

In boosting this is done incrementally i.e. at step T our model is:

FT (x) =∑

αtht(x)

AdaBoost algorithm (Freund and Schapire 1995)

1. Initialize: wi ≡ 1

2. While (improvement on test set)

(a) Look for ht = arg minh∈H

i wiIyi 6= h(xi) (minimizes weighted

misclassification error)

errt =

i wiIyi 6= ht(xi)∑

(c) Set αt = log(1−errt)

(d) wi ← wi · exp(αtIyi 6= ht(xi))

3. Output model F (x) =∑

t αtht(x) and classifier: sign(F (x))

AdaBoost as Gradient Descent

It has been shown that AdaBoost is “coordinate descent” with exponential loss:

L(y, Ft(x)) = exp(−yFt(x))

The criterion for selecting the next ht is to minimize:

∂∑

i L(yi, Ft(xi))

∂βj= 〈−∇L(Ft(x)), hj(x)〉

ht is the best ”canonical” improvement direction, to first orde r

The AdaBoost αt is chosen via a line search

• We will consider αt ≡ ε — which is “stronger”, empirically better and

theoretically more tractable

Practical importance of boosting approaches

• Computationally friendly when |H| is large:

– Does not require second derivatives and matrix inversion.

– Greedy search algorithms allow finding best direction “approximately”

– Mainly in situations where there is no explicit β at all, rather a dictionaryH

from which a “best” member is chosen every time using heuristics (e.g.

decision trees using greedy methods).

• Empirically shown to do very well

– AdaBoost (Freund and Schapire 95) and other boosting algorithms are

best “off the shelf” classifiers according to Breiman

Other gradient-based boosting algorithms

This methodology can be applied to any function estimation problem

• Friedman, Hastie and Tibshirani (2000) use binomial log-likelihood loss:

L(y, Ft(x)) = log(1 + e−yFt(x))

• Friedman (2001) applies it to regression problems with various losses

• Rosset and Segal (NIPS 2002) apply it to density estimation with

log-likelihood criterion : L(Ft(x)) = −log(Ft(x)).

Margin Basics

• Margin of separating hyper-plane∑

hj∈Hβjhj(x) = 0 is Euclidean

distance of closest point:

yiβ′h(xi)

‖β‖2

• Non-regularized SVM solution maximizes minimal margin

• SVM literature: large margins⇒ “small” prediction error

−4 −3 −2 −1 0 1 2 3 4−3

Margins in Boosting

• Boosting margin of model F (x) =∑

t αtht(x) is defined as:

yiF (xi)∑

t |αt|∈ [−1, +1]

• Basis representation for finite |H|:∑

t αtht =∑

hj∈Hβjhj

• ‖β‖1 =∑

j |βj | ≤∑

t |αt| equality e.g. if αt ≥ 0 ∀t (monotonicity)

−4 −3 −2 −1 0 1 2 3 4−3

The two margin definitions

Euclidean distance (SVM margin) between data point and “hyper-plane”∑

hj∈Fβjhj(x) = 0:

yiβ′h(xi)

‖β‖2

Normalized Boosting margin:

yiβ′h(xi)

‖α‖1=

yiβ′h(xi)

‖β‖2·‖β‖2‖β‖1

·‖β‖1‖α‖1

Differences:

• `1 vs `2 norm - encourages ”sparse” representations

• ‖β‖1 ≤ ‖α‖1 - sign consistency (“monotonicity”) assures equality

Boosting as a margin-maximizing process

Boosting the Margin - (Schapire et al. 1998, Annals):

• Prove that “weak learnability” (=separability) increases margins

• Experimentally show boosting increases margins

• Discuss geometric interpretation

• Generalization error bounds for finite basis, infinite basis, as function of

margin distribution e.g.: with probability≥ 1− δ

PTe(yF ≤ 0) ≤ PTr(yF ≤ θ) + O(n−.5(log|H|).5θ−1log(δ)−.5)

Plenty of other papers about boosting and margins

Advantages(?) of margins view

• Explains behavior of Adaboost in separable case:

– Seeks to maximize minimal margin, consequently finds a “good”

separating hyper-plane - similar to SVM

– Loss criterion view does not give such intuitions:

any separating hyper-plane, scaled up, drives exponential loss to 0.

• Generalization error bounds as function of minimal margin:

– Breiman (97) directly maximized margins, attained bad generalization

performance

– That’s not surprising, since margin maximization is clearly overfitting

What we have learned so far

Next steps

Constrained (regularized) optimization

We want to find β(c) which achieves :

min‖β‖1≤c

L(yi, β′h(xi))

i.e. the optimal solution with `1 norm c.

What is the relation of this solution to the ε-boosting solution with `1 norm c (i.e.

after c/ε iterations)?

Relation of boosting to regularized optimization

Consider the local “monotone” optimization problem:

minL(β)

s.t. ‖β‖1 − ‖β0‖1 ≤ ε

|β| ≥ |β0| component-wise

It’s easy to see:

limε→0

|(β − β0)k|

ε> 0⇒ k = arg max

j|∇L(β0)j | = arg max〈−∇L(Ft(x)), h(x)〉

k is unique “almost everywhere” in our space, so we are choosing the direction of

the best monotone path .

We may conjecture that if this ”monotonicity” holds on optimal path then

ε-boosting converges to optimal regularized path

ε-Boosting and `1 constrained fitting

For squared error loss regression (from Efron et al. 2002):

Lasso: β(c) = arg min‖β‖1≤c ‖y −Xβ′‖22“Stagewise”: the ε-boosting coefficients

0 1000 2000 3000

123 4 5 67 89 10 1

10••• • • •• •• •

Stagwise

0 1000 2000 3000

123 4 5 67 89 10 1

10••• • • •• •• •

PSfrag replacements t =P j^jj !t =P j^jj !P j^jj ! j

What about other loss functions?

For classification with binomial log-likelihood loss:

`1 constrained solutions (left), ε-boosting path (right)

0 2 4 6 8 10 12−2

6Exact constrained solution

||β||1

0 2 4 6 8 10 12−2

6ε−Stagewise

||β||1

Partial theoretical results

Denote:

β(c) = arg min‖β‖1≤c

i L(yi, β′h(xi))

β(ε)(c) is the ε-boosting coefficient vector for `1 norm c.

Theorem 1 if β(c) is strongly monotone in all coordinates ∀c < c0 , then

limε→0 β(ε)(c0) = β(c0)

• Much stronger condition on derivatives along the optimal path

We also have a “local” result:

Theorem 2 Under monotonicity only, if we denote by γ(ε) the ε-stagewise

“direction” starting from β(c0) then:

limε→0

γ(ε) =dβ(c)

dc|c=c0

• (Efron et al 02) proved for squared error loss, we generalized to any convex

`p constrained classification losses

Consider the constrained optimization problem:

β(p)(c) = arg min‖β‖p≤c

L(yi, β′h(xi))

With the loss being either the exponential or log-likelihood:

Le(y, β′h(x)) = exp(−yβ′h(x))

Ll(y, β′h(x)) = log(1 + exp(−yβ′h))

Convergence to “ `p- optimal” separating hyper-plane

Define:

β(p) = limc→∞

β(p)(c)

Theorem 3 If the data is separable, then with either Le or Ll,

β(p) = arg max‖β‖p=1

yiβ′h(xi)

Interpretation: the normalized constrained optimizer “converges” to an “`p-margin

maximizing” separating hyper-plane

Boosting interpretation

We can conclude that ε-boosting tends to converge to the `1-margin maximizing

separating hyper-plane

−0.8

−0.6

−0.4

−0.2

Minimal margins

||β||1

exponentiallogistic AdaBoost

0.095Test error

||β||1

exponentiallogistic AdaBoost

Boosting and support vector machines

In the separable case:

• SVM non-regularized solution is β(2)

• Boosting non-regularized solution is β(1)

• Differences:

– Boosting margin vs. SVM margin (Euclidean distance)

– Different loss functions⇒ different regularized paths

• “`2 ε-boosting” follows a different regularized path to “SVM” solution

– Choose coefficient to change according to maxh−∇L(Ft(X))′h(X)

In non-separable case even the non-regularized solutions would be different

Simple data example

Same example as before with additional large mass (20 observations) at “far”

−2 −1 0 1 2 3 4 5 6−2

Experiment data

Convergence of `1 and `2 boosting paths to optimal

separator

0 5 10 15

Normalized L1−boosting coefficients

||β||1

β/||β

boost var1boost var2opt var1 opt var2

0 2 4 6 8 10

Normalized L2−boosting coefficients

||β||2

β/||β

|| 2boost var1boost var2svm var1 svm var2 opt var1 opt var2

More interesting example: Boosting vs. `2 boosting

Boosting `2 boosting

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−0.5

optimal boost 105 iter boost 3*106 iter

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−0.5

optimal boost 5*106 iterboost 108 iter

Summary

• Boosting related to `1-constrained fitting

– Can define `p boosting algorithms to correspond to `p constraints

• `p constrained classification loss solutions converge to “`p-margin”

maximizers in separable case

– Has implication on understanding of logistic regression

• A common thread for boosting and SVM:

Computational trick for regularized fitting in high dimensi onal predictor

spaces

– SVM: kernel trick (`2 regularization)

– Boosting: coordinate descent (approximate `1 regularization)

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 1

`1 regularization: properties and

computations

Saharon Rosset (IBM Research)Collaborators: Ji Zhu (Michigan), Trevor Hastie, Rob Tibshirani (Stanford), Nathan

Srebro (TTI), Grzegorz Swirszcz (IBM Research)

Results on `1 regularization• Sparsity

• Piecewise linearity

• Applicability in very high or infinite dimensional embedding

spaces

The regularized solution pathFixing the loss, penalty and data, and varying the regularization

parameter we get the “path of solutions”

β(λ) , 0 ≤ λ <∞

This is a 1-dim curve through Rp.

• Interesting statistically, as the set of solutions to problems of

interest (Bayesian interpretation: changing prior variance)

• Often interesting computationally, as it has properties which

allow efficient “tracking” of this path

Example: Lasso solution path in R10

0 1000 2000 3000

0123 4 5 67 89 10 1

10••• • • •• •• •

(from Efron et al. (2004). Least Angle Regression. Annals of Statistics)

Sparseness propert(ies) of `1

regularized path

`1, `2 and `∞ penalties in R2

−1.5 −1 −0.5 0 0.5 1 1.5

−0.8

−0.6

−0.4

−0.2

1l1 penalty

l2 penalty

l∞ penalty

Sparseness of `1 penalty: n > pShape of `1 penalty implies sparseness. For large values of λ only few non-zero

coefficients.

0 1000 2000 3000

123 4 5 67 89 10 1

10••• • • •• •• •

Sparseness: p > nFor any convex loss, assuming only “non-redundancy”:

Theorem (e.g., Rosset et al. 2004)

Any `1 regularized solution has at most n non-zero components

Proof: Simple application of Caratheodory’s Convex Hull Theorem.

CorollaryThe limiting interpolating (or margin maximizing) solution also has atmost n non-zero components

Some implications of sparseness• Variable selection (obviously)

• `1-regularized problems are “easier” than, say, `2-regularized

– Can give good solutions in p >> n situations

Friedman, Hastie, Rosset, Tibshirani, Zhu (2004). Discussion

of three boosting papers. Annals of Statistics

Ng (2004). Feature selection, `1 vs `2 regularization and

rotational invariance. ICML-04

Piecewise Linear Solution Paths

Piecewise linear property

We want to identify situations where the path of solutions β(λ) , 0 ≤ λ <∞

is easy to generate.

One such situation is when β(λ) is piecewise linear in λ.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

−1.0

−0.5

+ + + +

‖β‖1

Primary example: the lasso

(Efron et al 03), (Osborne et al 00) show that for the lasso:

β(λ) = arg minβ

(yi − xiβ)2 + λ‖β‖1

β(λ) is piecewise linear in λ.

• Yields efficient algorithm for finding β(λ) , 0 ≤ λ <∞

– Cost is “approximately” one least-squares calculation

Some properties of the Lasso regularized path

1. Sparsity: if p > n, any regularized solution β(λ) has at most n non-0

coefficients (property of `1 penalty)

2. High correlation:

β(λ)j 6= 0 ⇒∣

∂C(β)∂βj|β=β(λ)

j (y −Xβ(λ))∣

∣= λ

3. Compactness: Number of “pieces” in the path is approximately min(n, p).

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

−1.0

−0.5

+ + + +

‖β‖1

Our key questions:• What is the fundamental property of (loss, penalty) pairs which

yields piecewise linearity?

• Are there efficient algorithms to generate these regularized

paths?

• Are there statistically interesting members in these families?

What makes paths piecewise linear?

Assume loss and penalty are both twice differentiable everywhere.

With some algebra we get:

∂β(λ)

∂λ= −(∇2C(β(λ)) + λ∇2J(β(λ)))−1∇J(β(λ))

We want this derivative to be constant, thus:

A sufficient condition for piecewise linearity is that:

• The loss C is piecewise quadratic

• The penalty J is piecewise linear

Building blocks for PWL regularized optimization

problems

Piecewise quadratic loss:

• Squared error loss: regression: (y − f)2, classification: (1− yf)2

• Huberized squared error loss (robust):

C(y,xβ) =

(y − xβ)2 if |y − xβ| ≤ m

m2 + 2m(|y − xβ| −m) otherwise

• Piecewise linear loss: regression: |y − f | , classification: (1− yf)+

Piecewise linear penalty:

• `1 penalty: J(β) =∑

j |βj | (gives sparse solutions)

• `∞ penalty: J(β) = maxj |βj |

Some Interesting Examples

Regression: the Huberized lasso vs. the lasso

0 20 40 60 80

0 50 100 150 200 250

‖β(λ)‖1‖β(λ)‖1

Squared error loss with `∞ penalty

0 100 200 300 400 500 600 700 800−800

−600

−400

−200

||β||∞

Classification: 1-norm and 2-norm Support Vector

Machines

0.0 0.4 0.8 1.2

0.0 0.2 0.4 0.6 0.8

‖β‖1 ‖β‖22

1-norm SVM 2-norm SVM

Multiple penalty problem: Protein Mass

Spectroscopy

(Tibshirani et al, in preparation)

• Predictors are “experssion levels” along a spectrum of masses for proteins.

• Want to constrain model while keeping coefficients “smooth”.

• Solution: `1 penalty on coefficients, `1 penalty on successive differences:

β(λ1, λ2) = arg minβ

(yi − xiβ)2 + λ1‖β‖1 + λ2

|βj − βj−1|

• Solution path is piecewise affine in (λ1, λ2)

Almost quadratic loss with `1

penalty

Almost quadratic loss

We define almost quadratic loss as:

C(r) = a2(r) · r2 + a1(r) · r + a0(r)

Where:

• a2, a1, a0 : R → R are piecewise constant functions

• C(r) is (once) differentiable everywhere

• r = (y − xβ) the residual for regression

• r = yxβ the margin for classification

Motivation for this family

• Piecewise linear solution paths

• `1 penalty⇒ sparse solutions

• Allows efficient, relatively simple algorithm

• Includes robust loss functions for regression and classification

Algorithm

• Initialize: β = 0, A = arg maxj |(∇L(β))j |, γ = −sgn(∇L(β))A

• While (max|∇L(β)| > 0)

– d1 = arg mind>0 minj /∈A |∇L(β + dγ)j | = |∇L(β + dγ)A|

– d2 = arg mind>0 minj∈A(β + dγ)j = 0 (hit 0)

– d3 = arg mind>0 mini r(yi,xiβ + dγ) hits a “knot”

– set d = min(d1, d2, d3)

– If d = d1 then add variable attaining equality at d toA.

– If d = d2 then remove variable attaining 0 at d fromA.

– β ← β + dγ

– B =∑

i a(f(yi,xiβ))xA′ixAi

– γ = B−1(−sgn(∇L(β))A)

Loss functions of interest: robust, differentiable

Linear for outliers, squared around “corners”:

• Regression: Huberized squared error loss

• Classification: Huberized squared hinge loss:

−3 −2 −1 0 1 2 30

7hinge loss (svm)Hub. sq. hinge (almost quad.)

Computational complexity

Calculations in each step of our algorithm:

• Step size: find the length of current “piece”

– O(np) calculations (for each observation, figure when it hits a “knot”)

• Direction calculation: calculate the direction of the next “piece”

– O(min(n, p)2), using Sherman-Morrison-Woodbury updating formula

Number of steps of the algorithm:

• Difficult to bound in “worst case”

• Under mild assumptions it’s O(n).

Computational complexity (ctd.)

Overall complexity is thus O(n2p) for both n > p and n < p

Compare to least squares calculation:

• O(np2) when n > p.

• O(n3) when n < p.

Example: “Dexter” dataset (NIPS 03

challenge)• n = 800 observations

• p = 1152 variables

• Use Huberized squared hinge loss

• Path has 452 “pieces”

• Inefficient R implementation takes about 3 minutes to generate

path on laptop.

Validation error and number of non-0 coefficients

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

‖β‖1

Summary• Regularization is fundamental in modern data modeling

• Considerations in selecting specific formulation:

– Statistical: robustness (loss), sparsity (penalty)

– Computational: efficient computation

• Piecewise linear solution path offer solutions that are:

– Robust: select appropriate loss function

– Adaptable: select regularization parameter adaptively

– Efficient: generate whole regularized path efficiently

`1 regularization in infinite

dimensional feature spaces

Outline• Regularized embeddings: kernels, boosting and all that

• Generalizing `1 regularization to non-countable dimension as

measure constraint

• Properties of `1 regularized solutions in infinite dimensions:

– Existence

– Sparsity: existence of finite-support optimal solution

– Optimality criteria

• Practical, exact `1 regularization in very high dimension via path

following

• Example: additive quadratic splines

Regularized fittingGeneric supervised learning problem, given:

• x1, ...,xn ∈ Rp (or simply Xn∗p)

• y ∈ Rn for regression, y ∈ ±1n for classification

Find model y ≈ f(x)

Linear models set f(x) = xTβ and often use regularized fitting:

β = arg minβ∈Rp

L(y, Xβ) + λJ(β) (or, min L s.t. J ≤ C)

Where L (loss) and J (penalty) are typically convex

J(β) = ‖β‖q is typical choice, usually q ∈ 1, 2

E.g.: Ridge regression, LASSO, Linear SVM,...

Data embeddingWe can increase the representation power of linear model by

embedding the data into high dimensional space, fitting linear

models there:

x→ φ(x) ∈ RΩ (typically |Ω| >> p)

f(x) = φ(x)Tβ

where Ω is index set of the features in the high dimensional space

Simple example: p = 2(+intercept/bias), Ω is set of degree-2

polynomials

x = (1, x1, x2)

φ(x) = (1, x1, x2, x21, x

22, x1x2)

Examples of embedding-based

methods• Kernel methods: φ often not explicitly defined but implicitly

through inner product kernel: K(x,y) =< φ(x), φ(y) >.

Ω usually infinite.

• Wavelets: φ(x) is wavelet basis values at x.

• Boosted trees: φ(x) is set of all trees of certain size, evaluated

at x. Ω can be made finite.

• Spline dictionary: with x ∈ [0, 1], Ω = [0, 1] and

φ(x) = (x− a)k+ : a ∈ Ω. Infinite (non-countable)

dictionary.

Embedding+regularization: kernel

methods, boostingSome of the most successful “modern” methods seem to rely on

right combination of embedding and regularization:

• Kernel methods: implicit embedding into RKHS + exact `2

regularization + representer theorem

⇒ computational and statistical success

• Boosting: embedding into space of trees + (very) approximate `1

regularization + incremental implementation

⇒ computational and statistical success

What about exact `1 regularization in embeddings?

`1 or `2 regularization?Good question! Detailed discussion is outside our scope...

Easy answer (as always): be Bayesian

One important aspect is the sparsity property of `1 regularization:

Sparsity property

If |Ω| > n finite, then any `1 regularized problem has a

solution β containing at most n non-zero entries.

Does this still hold when Ω is infinite?

Generalizing `1 regularizationWe start from:

L(yi, φ(xi)Tβ) s.t. ‖β‖1 ≤ C

By doubling the number of variables: βj = βj,+ − βj,− and adding

positivity constraints we can replace the norm by sum:

L (yi, φ(xi)T(β+ − β−)) s.t.

βj,++βj,− ≤ C , β+, β− 0

Now we replace the sum by a positive measure:

minP∈P

φω(xi)dP (ω)) s.t. P (Ω) ≤ C

Understanding our generalizationProbability measure requires probability space, hence a σ-algebra

Σ over Ω.

We require w : ω ∈ Ω ⊂ Σ

• If Ω finite or countable this implies Σ = 2Ω and hence

P (Ω) = ‖β‖1 as required

• In the non-countable case this still works!

When does an optimal solution

exist?Theorem

If the set φω(X) : ω ∈ Ω ⊂ Rn is compact, then our problem

has an optimal solution

Corollary

If the set Ω is compact and the mapping φ.(X) : Ω→ Rn is

continuous, then our problem has an optimal solution.

Bottom line: under mild conditions, an optimal solution is

guaranteed to exist.

The sparsity property in infinite

dimensionTheorem:

Assume an optimal solution exists, then there exists an optimal

solution P (C) supported on at most n + 1 features in Ω.

Main idea of proof:

- Consider A = φω(X) : ω ∈ Ω ⊂ Rn

- Show that any z ∈ co(A) (convex hull) can be represented as

convex combination of n + 1 points

(for finite Ω this is just Caratheodory’s convex hull theorem)

⇒ any infinite-support measure can be approximated by one

supported on n + 1 features

Optimality criterionSuppose we are presented with a finite-support solution P (C).

How can we verify it is optimal?

Answer: we only need to verify it is optimal in any finite feature set

containing its support

Theorem

If an optimal solution to the regularized problem exists, and we are

presented with a finite-support candidate solution P supported on

A = ω1, ..., ωk with k ≤ n + 1 then:

P is optimal solution⇔ ∀B ⊂ Ω s.t. A ⊆ B, |B| <∞, P is

optimal solution for the problem in PB

Summary of mathematical/statistical

properties we prove• Under boundedness + continuity condition an optimal solution

exists

• There is always a sparse optimal solution with at most n + 1

features

• Given a finitely supported solution, we can test its optimality by

considering only finite problems on supersets of its support

Now, can we actually find the solution?

Path following algorithmsSome regularized problems can take advantage of looking at the

solution set: β(λ) : λ ∈ R as a path in R|Ω| and following it

efficiently:

• Lasso (quadratic loss + `1 penalty): LARS-Lasso of Efron et al.

(2004) (also earlier work from Osborne et al.)

• SVM by Hastie et al. (2004), LP-SVM by Zhu et al. (2004)

• etc.

Lasso and LARSLasso:

β(λ) = arg minβ‖y −Xβ‖22 + λ‖β‖1,

with X ∈ Rn×p, y ∈ R

n, β ∈ Rp.

LARS-Lasso (Efron et al 2004) is a homotopy algorithm to generate

the path β(λ) for all λ efficiently. Algebraically, we can derive

LARS-Lasso from KKT conditions:

β(λ)j 6= 0 ⇒ |XTj (y −Xβ(λ))| = λ (1)

β(λ)j = 0 ⇐ |XTj (y −Xβ(λ))| < λ (2)

Schematic of LARS-Lasso1. Preliminaries

2. Loop:

(a) Find next variable to add to active setA:

dadd, step size such that a variable not inA attains equality

in (?? )

(b) Find next variable to remove from active set:

drem, step size such that coefficient from active set hits 0

(c) Make step min(dadd, drem), modify active set accordingly

(d) Calculate new LARS direction:

γ = −(XTAXA)−1sgn(XT

A(y −Xβ(λ)))

Can we do LARS-Lasso in infinite

dimensional embeddings?Going back to schematic of LARS-Lasso:

Only finding dadd requires considering high dimension

Therefore, if:

1. We have sparsity X

2. We can search over Ω for next feature efficiently

⇒ we can apply LARS-Lasso and find full path (optimality

guaranteed by our criterion)

Search problem for LASSOFormally

dadd = mind > 0 : ∃ ω /∈ A

−φω(X)T(y − φA(X)β(λ)0 − dφA(X)γA) = λ0 − d

We can re-write it as dadd = minω∈Ω−A d(ω), where d(ω) is the

value attaining equality for the dictionary function indexed by ω.

Specifically we get:

d(ω) =φω(X)T

r + λ(β)

φω(X)TφA(X)γA + 1

Spline basesAssume our data points xi are in [0, 1].

A polynomial spline of order k is a piecewise polynomial of degree

k − 1 with k − 2 continuous derivatives.

E.g. second order spline is piecewise linear continuous function.

Dictionary for kth order spline:

1, x, ..., xk−2, xk−2, (x− a)k−1+ : a ∈ (0, 1]

Total-variation penalties and

regularized splinesStart from the general nonparametric problem with x ∈ R:

f(x) = minf∈C(k−1)

(yi − f(xi))2 + λTV (f (k−1))

Most general result:

Theorem (e.g. Mammen & van de Geer 97)

Optimal solution f can be represented as a k-th order spline with at

most n knots

Since roughly TV (f (k−1)) = (k − 1)! · P (Ω), our results prove

this theorem in one line!

What do we know about

TV-penalized spline solutions?• For k < 3 can show (Mammen and VDG 97) that this spline has

knots at the data points — an `1 “representer” theorem!

• They propose efficient algorithms for solving with k ∈ 1, 2—

can be rephrased as versions of LARS-Lasso with n variables

(constant/linear spline basis)

• For k ≥ 3 they only offer LARS-like approximate algorithm with

knots at data points

But if we can solve the next feature search problem, we can apply

our algorithm and get exact solution path

Feature search problem for the

k = 3 case (piecewise quadratic)We want to minimize over Ω:

d(ω) =φω(X)T

r + λ(β)

φω(X)TφA(X)γA + 1

This is a piecewise rational function of ω with quadratics in

numerator and denominator

⇒ can solve analytically

2-dimensional additive spline example ( k = 3)

Surface

15 steps

40 steps

65 steps

Boston and California housing ( k = 3)

0 50 100 150 200

Iterations

linearquadraticspline

0 50 100 150 200 2500.

Iterations

linearquadraticspline

Summary• `1 regularization generalizes elegantly to infinite dimensional

embeddings through generalization of norm to measure

• Statistical/mathematical properties:

– Existence

– Sparsity

– Testability

• We can design and implement a path following algorithm

– Practical applicability hinges on feature search problem

• We can practically implement in spline bases

– Optimally solves a total-variation penalized non-parametric

regression problem

Critical open issues• What can we say about learning performance? Which

embeddings are good?

• Characterize in general feature spaces where we can solve the

feature search problem

ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 —...

Documents