ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 —...

ESL Chap3 — Linear Methods for Regression Trevor Hastie

Linear Methods for Regression

Outline

• The simple linear regression model

• Multiple linear regression

• Model selection and shrinkage—the state of the art

1


Preliminaries

Data(x1, y1), . . . (xN , yN ).

xi is the predictor (regressor, covariate, feature, independent variable)

yi is the response (dependent variable, outcome)

We denote theregression functionby

η(x) = E (Y |x)

This is the conditional expectation ofY givenx.

The linear regression model assumes a specific linear form forη

η(x) = α + βx

which is usually thought of as an approximation to the truth.

2


Fitting by least squares

Minimize:

β0, β = argminβ0,β

N∑i=1

(yi − β0 − βxi)2

Solutions are

β =

∑Nj=1(xi − x)yi∑Nj=1(xi − x)2

β0 = y − βx

yi = β0 + βxi are called the fitted or predicted values

ri = yi − β0 − βxi are called the residuals

3


Elements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 3

•• •

••

• ••

•

• •

••

•

•

•

••

•

••

••

•

•

••

•

•• ••

•

•

•

•

•

• ••

•

•

•

•

•

•

•

•

•

•

•• •

•

•

•

••

•

• ••

• •

••

• •••

•

•

•

•

X1

X2

Y

Figure 3.1: Linear least squares fitting with X ∈ IR2.

We seek the linear function of X that minimizes the

sum of squared residuals from Y .

Figure 3.1 - view of linear regression inIRp+1.

4


Standard errors & confidence intervals

We often assume further that

yi = β0 + βxi + εi

whereE (εi) = 0 andVar (εi) = σ2. Then

se (β) =[

σ2∑(xi − x)2

] 12

Estimateσ2 by σ2 =∑

(yi − yi)2/(N − 2).

Under additional assumption of normality for theεis, a95% confidence

interval forβ is: β ± 1.96se(β)

se (β) =[

σ2∑(xi − x)2

] 12

5


Fitted Line and Standard Errors

η(x) = β0 + βx

= y + β(x− x)

se[η(x)] =[var(y) + var(β)(x− x)2

] 12

=[σ2

n+

σ2(x− x)2∑(xi − x)2

] 12

6


••

•

••

••

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

• ••

•••

••

•

••

•

• ••

•

•

•

•

••

••

•

••

X

Y

-1.0 -0.5 0.0 0.5 1.0

02

46

Fitted regression line with pointwise standard errors:η(x)± 2se[η(x)].

7


Multiple linear regression

Model is

f(xi) = β0 +p∑

j=1

xijβj

Equivalently in matrix notation:

f = Xβ

f is N -vector of predicted values

X is N × p matrix of regressors, with ones in the first column

β is ap-vector of parameters

8


Estimation by least squares

β = argmin∑

i

(yi − β0 −p−1∑j=1

xijβj)2

= argmin(y −Xβ)T (y −Xβ)

Figure 3.2shows theN -dimensional geometry

Solution is

β = (XT X)−1XT y

y = Xβ

Also Var (β) = (XT X)−1σ2

9

http://www-stat.stanford.edu/~hastie/Printer/315-LECTURES/FIGURES/chap3/fig3-2.pdf


Here are someadditional notes (linear.pdf)on multiple linear regression,

with an emphasis on computations.

10

http://www-stat.stanford.edu/~hastie/Printer/315-LECTURES/linear.pdf


The Bias-variance tradeoff

A good measure of the quality of an estimatorf(x) is the mean squared

error. Letf0(x) be the true value off(x) at the pointx. Then

Mse [f(x)] = E [f(x)− f0(x)]2

This can be written as

Mse [f(x)] = Var [f(x)] + [E f(x)− f0(x)]2

This isvarianceplus squaredbias.

Typically, when bias is low, variance is high and vice-versa. Choosing

estimators often involves a tradeoff between bias and variance.

11


• If the linear model is correct for a given problem, then the least

squares predictionf is unbiased, and has the lowest variance among

all unbiased estimators that are linear functions ofy

• But there can be (and often exist) biased estimators with smaller

Mse .

• Generally, byregularizing(shrinking, dampening, controlling) the

estimator in some way, its variance will be reduced; if the

corresponding increase in bias is small, this will be worthwhile.

• Examples of regularization: subset selection (forward, backward, all

subsets); ridge regression, the lasso.

• In reality models are almost never correct, so there is an additional

model biasbetween the closest member of the linear model class and

the truth.

12


Model Selection

Often we prefer a restricted estimate because of its reduced estimation

variance. Elements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 7

RealizationClosest fit in population

Estimation Bias

SPACE

Variance

Estimation

Closest fit

Truth

Model bias

RESTRICTED

Shrunken fit

MODEL SPACE

MODEL

Figure 7.2: Schematic of the behavior of bias and variance.

The model space is the set of all possible predictions from the

model, with the “closest fit” labeled with a black dot. The

model bias from the truth is shown, along with the variance,

indicated by the large yellow circle centered at the black dot

labelled “closest fit in population”. A shrunken or regular-

ized fit is also shown, having additional estimation bias, but

smaller prediction error due to its decreased variance.

13


Analysis of time series data

Two approaches:frequency domain(fourier)—see discussion of wavelet

smoothing.

Time domain. Main tool is auto-regressive (AR) model of orderk:

yt = β1yt−1 + β2yt−2 · · ·+ βkyt−k + εt

Fit by linear least squares regression on lagged data

yt = β1yt−1 + β2yt−2 · · ·βkyt−k

yt−1 = β1yt−2 + β2yt−3 · · ·βkyt−k−1

... =...

yk+1 = β1yk + β2yk−1 · · ·βky1

14


Example: NYSE data

Time series of 6200 daily measurements, 1962-1987

volume — log(trading volume) —outcome

volume.Lj — log(trading volume)day−j , j = 1, 2, 3

ret.Lj — ∆ log(Dow Jones)day−j , j = 1, 2, 3

aret.Lj — |∆log(Dow Jones)|day−j , j = 1, 2, 3

vola.Lj — volatilityday−j , j = 1, 2, 3

Source—Weigend and LeBaron (1994)

We randomly selected a training set of size 50 and a test set of size 500, from the

first 600 observations.

15


Trevor Hastie Stats315a January 15, 2003 Chap3: 16

NYSE data

volume ••

•••••• •

• ••••• •

••

• •••

•••

• ••••

•• •

••••

•••••

••••

••• •

•

•••••

•••

••

••••• ••

••

••

••• •• •

• •••• ••

••• ••••

• ••

•

• ••• •• •••

•

•

•

• •••

••

•

•

•••• ••

•• •

•

••

•••

••

-2 1

••

• ••••••

• •• ••••

••

• •••

••••• ••

•

•••

•••

••

••••

•• • •

••• ••

••••

•••

•

••

••••• •••

••

•••• •••• •

••• ••

••

• ••

• ••••

•

• ••• •• •••

•

•

•

•••

•••

•

•

•••

• •••

•••

••

•• •• •

••

• ••••

••• •• •

••••

••••

•••

•• ••

••

•••

•••••••••

•• ••

••• ••

•••

••

• ••

••• •

••• •••

•••••• •• •

•••

••• •

••• ••• ••• •

•

• ••• •••••

•

•

•

• • ••

••

•

•

••

•• •••

•••

••

• ••••

-2 1

••

•••••••

• ••••••

••

• ••

•••

••••

••

•••

•••

••

••••

••••

•••••

•••••

•••

••

• ••••• ••

•••

••• • •••••••• •

••••

••••••

•

• • ••• ••• •

•

•

•

•••

•• •

•

•

••

•••••• •

•

••

•••••

••

•••••

•••• ••

•••

••

••••

••••• •

••

•••

• ••

••

•• ••••••

••• ••

•••••••

•

••

••••• ••

•••••••• ••••

•• • ••

••

•••

• ••• •

•

•• •••• •••

•

•

•

••••••

•

•

••

••• ••••

•

•••••••

-1 2

••

•••

• •••

•• •••••

••• •••

• •••• •

••

•• •

•••

••• •••• •• •

•••••

•••

••

•••

••••

•••••••••

•• ••• •• ••

••••

••

• ••

•••••

•

••• ••••••

•

•

•

• •••••

•

•

••

•• ••

•••

•

••

• ••••

••• ••• •

••• •••

•••

••

•••

•••

•• ••

••

•• •• •

••

••••

•••••

••• ••

•••••

•••

••••

••• •••

••

••• •• ••• ••• • ••

••

•••

•••••

•

•• •••••••

•

•

•

•••

•••

•

•

••

••••

•• •

•

••

•••••

-1 1

••

•• •••••

••••

••••

•• •

••

•••• ••

••

•••

•••

•••••

•••• •

••• ••

•••

••

•••

••••

••• •••

•••

•• •• ••••••• ••

••

•••

••• ••

•

•• •• • ••••

•

•

•

•• ••

••

•

•

••

•• •••••

•

•••••••

••

•••

••••

• ••••

•••

•• •••

•••• ••

••

•••

•••

••

• •••• •••

•••••

•••

••

•••

••••

••• •••

•••

•••• ••

•••

••• •

••

•••

• ••••

•

• ••• •• •• •

•

•

•

••••••

•

•

••

••••

•• ••

••

•••••

-1 1 3

••

• ••• •••• •••

••••

•••••

•••

• ••••

•••

•••

••

•••••• ••

••••

•

• ••

•••••

••

•••• •• •

•••

••

••• ••• •

••• ••

••

•••

•••••

•

•• •• •• •••

•

•

•

•• ••••

•

•

••••••

••••

••

•••••

••

• ••• •••• •••

••••

•• ••

•••

•• ••••

••••••

••

• ••••• ••

•• •••

• ••

•••••

••

•••• •• •

•••

••

••• ••• •

••• ••

••

•••

•••••

•

•• •• •• •••

•

•

•

•• ••••

•

•

••••••

••••

••

•••••

-2 0 2

••

• ••• •••• •••

••••

•• ••

•••

•• ••••

•••

•••

••

•••••• ••

•• •••

• ••

••

•••

••

•••• •• •

•••

••

••• ••• •

•••••

••

•••

•••••

•

•• •• •• •••

•

•

•

•• ••••

•

•

••

••••

••••

••

•••

•• -0.40.00.4

••••• • •••• • •••••

••

•

•• ••••

•••• •

• ••••• •

•••• •••••••

•• •

•••• •

••••

•

•••

••• ••

• •• ••

•••••

•••••• • •• •••

•••

••••••

••

•

•

••

•

••

••• •••

•

•

••• •

•• •

•• •

•

•••

•• •

-202

volume.L1 ••• •••• ••• •• ••••

••

•

•••••

••• •• •

•••••• •

••••••• • •• ••

•••••••

•• •••

•••

••• ••

•••••

•••••••••

••• • ••• ••••

••

••••

••

•

•

••

•

••

••••

••

•

•

••••

•••

•• ••

••••

• • ••• •• •• ••• •• ••

•••

•

•

•• •

•••

•••• ••••

•••••••• ••• ••••

•• •••

•• •• ••

••

•••

••• ••

••••••• ••

••••

••• ••••• ••• •

••

••••

•••

•

••

•

••

•• ••

••

•

•

••• •

•••

•• •

•

•• •

••• •• ••••• ••

• ••••••

••

•

•••

•••

••• ••

••• •• ••

• •••••••••••

• ••••••

••••

•

•••

•• •••• ••••• •••••••

•••••

•• •••••

••

• •••

•••

•

•••

••

•• ••

• •

•

•

•• ••

•••

•••

•

•••

••• • • ••• •••• •• •••

••••

•

••• ••••

• •• •• ••

• •• •• •• •••••••••

••••••••• •••

•••

• •••••••••• •• ••

•• ••

••• •• •• •••

••

•• •••

••

•

•

•••

••

••••••

•

•

••• •

• ••

•• •

•

•••••• •••• •• ••••• •• •

•••

•

•

•••• •••

• • •••• • •• ••

•• •••• •• •••••••••• ••••

••

•••

••••••••• ••• •••

• •••

• ••• •• ••••

••

•••

••••

•

•

••

•

••

••••••

•

•

••••

•••

••••

•• •

••• ••• ••• •••

• ••• •••

••

•

•••

•••

••• ••

•••• •••• ••••••••

•••

•••••• ••• •••

•••

• ••••••

•• •• ••••

•• ••

•• •••• •••••

••

• ••••

••

•

••

•

••

••• •

••

•

•

•• ••

•••

••••

•••

••• •• •• •••••••

••••••

•

•

••• ••••

••••••• •• ••

• ••••••• •••••••••••

•• •••

•••

• ••••

•••• •• •••••••

••• ••• •••

• •••

•• ••••••

•

••

•

••

•• ••••

•

•

•• ••

•••

•• •

•

•••••• •• ••• ••••

• ••• •••

••

•

••• ••••

••• •••

• •• •••• •••• •••

•••••••• ••

•• ••

•

•••

• ••••••

•••• •••

•• •••

•• ••• •• ••••

••

••••

••

•

•

••

•

••

••••••

•

•

•• • •

•• •

••• •

•••

••• •• • ••• •••• •••

••••

•

•

•• •

•••

••• ••

• •••• ••

• •••••• ••• •

•••• •

• ••••••

•

•••

•• •••••

•••• •••

•• ••

••• ••

• •••••

••

•• •

•••

••

•

••

•

••

•• ••••

•

•

•••••••

•• •

•

•••

••• •• • ••• •••• •••

••••

•

•

•• •

•••

••• ••

• •••• ••

• • ••••• ••••

•••• •• ••

••••

•

•••

•• •••••

•••• •••

•• ••

••• ••

• •••••

••

•• •

• •••

•

•

••

•

••

•• ••••

•

•

•••••••

•• •

•

•••

••• •• • ••• •••

• ••••

•••

•

•

•• •

•••

••• ••

• •••• ••• •••••• ••

•••

••• •• ••

••••

•

•••

•• •••••

•••• •••

•• ••

••• ••

• •••••

••

•• •

•••

••

•

••

•

••

•• ••

••

•

•

••• ••••

•• •

•

•••

•••

••••• •

•••

•• ••• •••

•

••• ••

• • •••

•••

••

••

••

•••• •••

•• ••

•

• •••

•• •••• ••

••••

••••

•••• ••••

• •• •••

••

• ••• •••

••••

•••••

•

••

•

••

••

•••• ••• •• •••

•

•• •

••••

•••

••

•••••••• •

••••• • ••

•

••• ••

• •• •••••

• •

••••

•••• •••

•• ••

•

• ••

••••••

• ••••

••• ••

•••

••••••••

••••

••••• ••••

• •••

•••••

•

••

•

••

••

•• •• ••• •• •••

•

•••

• •••

•••

•• volume.L2 ••

••• ••••

••• •• •• •

•

••• ••

• •• ••

•••

••

•••••••• •

••••••

•

• ••

••• •• •

• ••• •

••• ••

•••

•••••• ••

• •••

••

• •• •••

••• •

••

••••

•

••

•

••

••

•• • •••• • ••••

•

•••

••••

•• •

••

•••••••••

••••••• •

•

•••••

•• •••

•••

••

••

••• •••••••••••

• ••••••••

• ••• ••••• ••

••••••

• •••

••••

••••••••

••••

••

••••

•

••

•

••••

••• ••• • •••• ••

•••

• •• •

•••

••

• ••

•• ••••

•• •••

•••

•

•• •• •

•••••

•••

••

••

••

• •• ••••••

•••

•••••••••

•••••

••• ••

•••••••• •

• •••

• ••

•••

•• ••

••• •

•••

•••

•

•••

••••••••••••••••

•

•••

••••

•••••

•••• ••

•••

•• •• ••••

•

•••••

••••••

• •• •

••

•••• •••

• •• •

•••

•••••• •••

•• •••

••••••••••• •••

••••••

••

••• •••••••

•••

• ••

•

••

•

••

••

•• •••••••••••

•••

••••

•• •

••

•••

••• ••••••• ••••

•

•• •• •• •• •

••••

• •

••

••

• ••••••••••

•

•••••• •••

•••••

••• •••

••••• •

• •••••

• ••

•• •••• •

••••

••

••••

•

••

•

••

••

•••• ••• •••• •

•

•••

• •••

•••

••

•••

• ••••••••••

•• •

•

•••• •

••• •••

• •••

••

••

• ••••••• ••••

••••••••••••

••••

• •••

••••• •

• •••

••••

••

• ••• •

••• ••

•••

•••

•

••

•

••

••

••• •••••••• •

•

•••

••••

•••••

•••

•• ••••

•••• ••••

•

•••• •

••• ••

•• •

••

••

••

•• •••• •

•••••

••••• ••••

•••••

••• •••

••• •••• •••••••

••

• ••• •

•••••

••

••••

•

••

•

••

••

••••••••••• •

•

•• •

• •• •

•••

••

•••

••••

•••

••• ••••

•

••• • •

• •• ••

•••••

••

••

• ••••••

•••••

•••

•• ••••

•••••

•••• •

•••

• •••• •••

••••

••• ••• ••

•••••

••

•••

•

••

•

••

••

••• •••• ••••••

•••

••••

•••

••

•••••• •

•••

••• ••• •

•

••• • •

• •• ••

•••••

••

••

• • •••••

••••

•

•••

•• ••••

•••••

•••• •

•••

• •••• •••

••••

••• ••• ••

•••••

••

•• •

•

••

•

••

••

••• •••• ••••••

•••

••••

•••

••

•••••• •

•••

••• ••••

•

••• • •

• •• ••

•••

••

••

••

• ••••••

••••

•

•••

•• •• ••

•••••

•••• •

•••

• •••• •••

•••••

•• ••• •

••

••••

••

•••

•

••

•

••

••

••• •••• ••• ••

•

•••

••••

•••

••

-202

••••

•••••

•• •

••

•••

•••••

••

•

•

••• ••••

••• • •••

• ••••• •••• •••

•

••

••

••• •

••••

•• ••

• •• ••••• ••

•

••••

•• •• ••

••••

•• •••

•

••

•

••

•

•

•

•••

•••

• ••••

•

• •••

•••

••••

•-2

0

2••

•••••• •

•••

••

• ••

•• •••

••

•

•

•••••• •

•••••••• •••

•• •••• ••

•

•

••

•••

•• •••••

•• ••

••••••• •• •

•

•• •••

•• •••••

••

•• •

•••

• •

•

••

•

•

•

••• ••

•• ••••

•

••• •

•••

••••

•••

• •••

• •••

••

••••

•

•• •••

••

•

•

• •• ••••

••• ••••••••

• •• ••

••••

•

••

••

••• •

••••

•• ••

•••••

•••• ••

•• ••

•• • •••

••••

•• •••

•

• •

•

••

•

•

•

••••••••••

•

•

••••

•••

•• •••

volume.L3 ••••••

• •••

••

••••

•

•• •••

••

•

•

•• •••••

•• ••• ••

•••••••••

• •••

•

••

••

••••

••••

• •••

• ••••• •••••

••••••••• •

• ••

••

• •••

•

••

•

•••

•

•

•• •

•••••• •

•

•

••• •

• ••

••• ••

• ••••

••••

•• •

••

•••

• •• ••

••

•

•

• •• ••••

••• •• ••

••••••••••••

•

•

••••

••• •••••

••••

•••••• •• ••

•

• • ••

•• •• ••

•••

••

•• •••

• •

•

•••

•

•

••••••••••

•

•

••••

••••••••

••••

••

••••

• •

••••

•

•• •••

•••

•

• • •• •• •

•• •••• •••• •

• ••••••••

•

••

••

•• ••

••••

••••

•••• ••• •• ••

••••

••• •• •

•••

••

••• ••

••

•

••

•

•

•

••••••

••••

•

•

••••

•••

• ••••

••• •••

••••••

••••

•

• •• ••

••

•

•

•• •••• •

••••• ••

•••••••••

••••

•

••

••

••• ••••••••

•

•••• •• ••• •

•

• • •••

•••• •• ••••

•••••

••

•

••

•

•

•

••• •••

••• ••

•

••• •

•••

••• ••

•••• •

•••••••

••

•••

• • •••

••

•

•

•••• •••

•• •• • ••••••

• ••••••••

•

••

••

••••••••

••••

•••• •• ••••

•

•• ••

•••• ••

•••

••

•• •••

••

•

••

•

•

•

•• •

•••••• •

•

•

••••

••••••••

••••

••

••••

••

••

•••

•• •••

•••

•

••• • •••

•• •••• •

••• ••••••

••••

•

••

••

••••

••••

••••

•• •••

• •••••

••••

•••• ••

••••

•• •

•••

• •

•

••

•

•

•

••••••••• •

•

•

• •• •• ••

•••••

••• •••

••••

••

••

•••

•••••

••

•

•

•• •••

••

•• ••• ••••••••• •

•••• •

•

••••••• •

••• •• ••

•

•• •••

• ••• ••

•• •••

••• ••• •

••

••••••

• •

•

••

•

•

•

•• •

••••••••

•

••••••

•••• •

•••• •••

••••

••

••

•••

•• •••

••

•

•

•• •••

••

•• ••• • ••••••••• •

••• •

•

••••••• •

••• •• ••

•

•• •••

• ••• ••

•• •••

••• ••• •

••

••••••

• •

•

••

•

•

•

•• •

••••••••

•

••••••

•••• •

•••• •••

••••

••

••

•••

•• •••

••

•

•

•• •••

••

•• ••• •••••••••• •

••• •

•

••

••••• •

••• •• ••

•

•• •••• ••• ••

••••

•••• ••

• ••

••

•••••

• •

•

••

•

•

•

•• •

••••••••

•

••••••

•••

• ••

••

••• ••

•••

•••• ••

• ••

•••

••

••

•• •• •

•••

•• ••

•• ••••• ••••

••• •• ••••

•• ••••••

• • •• •• •••

•• •• ••••

•• • •• •

••

••• ••••••

•••

•• • •••••

•

••

•• •••

••• ••••• •

•

••• • ••

•••••• •

•

•••• • •

• ••

•••

••

• •

•••••••••••

••• ••••• ••••••• •••••••

• ••••••• • ••••••

•••• ••••

•

••••••

••

• •• ••

••••

•• •

•• • •••• •

•

••

•• •••

• •••• ••• •

•

• ••• ••

• ••••••

•

•• ••••• •

•

•••••

••

•• •••

••

••

•••••••• • •• ••

•••• •••

•• ••• •••••

•• • ••••••

••

•• •••••

••• •••

••

••• ••

••••

•• •

•• • •••••

•

•••••••

• ••••••• •

•

• •• • ••

• •• ••••

•

•• •• ••

•••

•• •••

• •

•• •••

••

••••••• ••• •••••

•••• •• •

• •••

•• •••• •• • •••••

•••••• •••

•

•• ••••

••

••• ••

••••

•••

•• • •••• •

•

••

• ••••

• ••••••• •

•

••••

retd.L1 ••

••• •

•••

•

•••• ••

• ••

• ••

••

••

•• •• •

••

••

•••

• ••••••••• ••••••

••• ••• ••••• •

••••••••

••

• •••• ••

••• •• •

••••

••••

•••

•• •

••• ••••••

••••••

•

•• •••••••

•

•••• •

••• ••

•••

•

••• •••

•••

•••• •

••

••• •••

••

•••

••••• •• •••••

••••• ••••

••••••

•••• •••••

••

••• •• •••

• •••••

••••

•• ••

• ••

•••

•••••

• • ••

••••••

•

• ••••••• •

•

•••• •

•• ••• •

•••

••• •••••

•

• ••

••

• •

••••••

••

•••

••••••••••• ••••••

••• ••• ••••• ••• •••••

••

••• ••• •

•

•• ••••

••

•••• •

••••

•••

•••••

••••

••

•••••

•••••••• •

•

•• •• ••

••••

••••

•••• •••••

•••

••

• •

••• •••

••

•• •

••••••• •••• •

••••••

•• •••••••

••

••••••••

••••••••

•

•• ••• •

••• •• ••

••••

•••

••••• ••••

••••••

•

• ••••••••

•

•••• •

•••• •

•••

•

••• •••

• ••

•••

••

• •

•• • •••

••

•••

••••• •••••••

•••• •••• ••

•••••• •••••• •••

••

•••• •••

•• ••• •

••••• ••••••

•• •

•••• •

••••

•••••••

••• •• •• ••

•

•••• •

•• ••••

•••

•••

•••••

•

•••

••

• •

•••• •••

••

•••••••• ••• •••

•• •• ••••••

• •••• ••

••••• •••

••

•• •• •••

•• ••• •

••

••• ••

••••

•• •

•••• •

••••

••

•••••

•••• •••••

•

•• •• ••

• ••• •

•••

•••

•••••

•

•••

••

• •

•••• •••

••

•••••••• •••• •

••• •• ••

••••

• •••• ••

••••• •••

••

•• •• •••

•• ••• •

••

••• ••

••• •

•• •

•••• •

••••

••

•••••

•••• •••••

•

•• •• ••

• ••• •

•••

•••

•••••

•

•••

••

• •

•••• •••

••

•••••••• •••• •

••• •• ••

••••

• •••• ••

••••• ••••

••• •• ••

•

•• ••• •

••

••• ••

••••

•• •

• ••• •• ••

•

••

••• ••

•••• •••••

•

•• ••

-202

•••

•••••

•

• •••••• •

•

•• •

•

•• ••

•• •••• •

•• ••

•• ••••• •••• •

•• •• ••••

•••

•••••• •

•• •• •••••

••••

•••

••

•••

••••

• •• •••• ••

• •• • •• •••• ••• •• ••

••

•

•••••

• ••••• •-1

1 •••

•••• •

•

••••••• •

•

•• •

•

••• •••••

• ••

•••

•

•• ••••• •••• •

•• ••••••

•••

•••••• •

••••••

•••• •

•••

•••

•

• ••

••• •

• •• ••

•• ••• •

• • •• •• •• ••• •• ••••

•

••• •

•

• ••• ••• •• •••

• •••

•••••

•• ••

•••

•

••••

•• ••

•• •

•••

•

••••• •

•• ••••

•• ••••••••

•••••

•• ••••••• ••

•• •

•••

••

••

• ••

••••

• •• ••

•• ••• •

• • •• ••••••••••••

••

••••

•

• ••• •• • •• •

••

• •••

•• •••

• •••

•• •

•

••• ••

• •••• •

••••

•• ••• •••••• ••• •• •• •

•••

•••••

•• ••••••

••••••

••••

••

•

• ••

••••

• •• ••

•• ••• •• • •• •• • •••• • ••••

••

••••

•

• •• •••• •••

••

• •••

•••••• ••

•

•••

•

•• ••

••••

•••

•••

•

••••••••••

• ••••••••

•••

•••••

• •••• •••

•• •

•••••

••

•••

•••

•• ••••• •

•••• •••

••••••• ••• • ••••

•••

••• •

•

•• ••• •• retd.L2 •••

••

••••

• •• •••••

•

•••

•

••••

••• •

• ••

•••

•

•••• •••••••••••• •••

•• •

•••••

•••••••••

•••• •• ••

••

••

• ••

••••

•• ••••• ••

• •••••• • •••••••••

••

•

••••

•

• •• •••• •• •••

••••

••• •••••

•

•••

•

••• •

••••

• ••

•••

•

••••••••••

•••••• •••

•••

•••••

••••••••

•• ••

• •••

••

••

•

•••

•• •••• •••••••• ••••••••• ••• ••••

•••

••• •

•

• • ••• •• ••••

•••••

•••••

• •••

•••

•

••• •

••• ••••

•• •

•

••••••

••••••••••••••

•••

•••••••

••••••

• ••

••••

••

••

•

•••

••• •• ••••

•• • •• •

•••• ••• •••••••••

••

••••

•

•••••• •••

••

••••

••• •••• •

•

•••

•

••• ••

• • ••••

•••

•

•••• ••••••••••• ••••

•••

••••••••

••• •••• ••••

• ••

••

••

•••

••••• •• ••

••••••

• •• • ••••••••••••

••

••• ••

•• ••••• •• ••

••••

•

••• ••••••

•• •

•

••• •

•••••••

•••

•

••••• ••• ••

••• •• ••••

•••

•••• •• ••

••• •••• ••

• •• •

••

••

•

•••

•• ••• •••

•••••

••••• • ••• •••• ••••••

•

•• ••

•

•• ••• •• •• ••

••••

•

••• ••• ••

•

•• •

•

••• •

••••

•••

••••

••••• •••• •

••• •• ••••

•••

•••• •• ••

••• •••• ••

• •• •

••

••

•

•••

•• ••• •••

•• •••

••••• • ••• •••• ••••

•••

•• ••

•

•• ••• •• •• ••

••••

•

••• •••••

•

•• •

•

••• •

••••

•••

•••

•

••••• •••• •

••• •• •• ••

•••

•••• •• ••

••• •••• ••

• •• •

••

••

•

•••

•• ••

• ••••

••••••

••• • • •• •••• ••• ••

••

•• ••

•

•• ••• ••

••••

•

•

•••••••

• ••• •••• ••

• • ••• •

•

••

••

•• ••

•• ••

••

• •••• ••• ••

•••• •• ••••••• • •• •

• ••••• •

• •••••

• •••

•••••

• •• •••• ••

•

•• ••• ••

•• ••• •• ••• •

•• ••• •

••

•••• • ••

•••

•

•• •••••

• • •• •• •• ••• ••

••••

••••••••

•• ••••• ••• • ••• •

••

••• •• ••••• •• • •••

••• ••••

•••• ••••• ••••• •• •• •

••• ••

•

•• ••• ••

•• ••• •• ••••

••••

•••

••

• ••• ••• ••

•

• ••••• ••••• •• ••••

••• •• ••

••

••

••••

••••

••

•• •• •••• ••

••• ••

• ••••• •• • •••

••• •

•••

•••• ••

••• ••

• •••• •• •••• ••

•

•• ••• ••

•••••••••••

••••

• ••

••

• •• • ••• •

•

•

• ••••

• •• •• ••••• ••

• •••• ••

••

••••••

•• •••

••••• • ••• •

••

• •• ••• •••• •• • •••

••• •

•••• ••••

•• •

• ••• •••• •• •••• ••

•

•• ••• ••

• •••• • •••• •

••••

• ••

•••••• ••

•••

•

• ••••

••••• ••• •••••• •• •••

••

••

••• •

••••

••••••• ••••••

••• ••• ••••• •••••

••• •

•••

••••••••••

••• ••

••• •

•••• •

•

• ••••••• ••• • •••• ••

••••

•••

••

•• •• • •••

•

•

••• ••••• ••• • •• ••

•••••• •

•

••

••

••• •

••••

••

•••• ••••••••• ••• ••••• ••••••

•••••

• •••• • ••

••••

•• •••

•••• •••••

•

••• •••••••••••••• •

••••• •

••••••• retd.L3 ••

• ••

•

••• ••

••••••• •• •••

• ••• •••

••

••

••• •

•••••••••• ••••••

••• ••• ••••• ••• •••

••••

•••••• • ••• ••••

•• ••••

•••••••

•

•••• ••••• ••• •••• ••

••••

•••

••

•• •• •••••

•

•••••

••• •• •• • •••

••••••••

••

••

•• • •

••••

••

•••• •••••••

•• •••••••• ••••••

•••••••

•••• ••

• •••

•••• •

• ••• ••• • •

•

••••• ••• •••••••• ••

••••

• ••••

•••• ••••

•

•

•••••

•••••• •• •••

••••

•• ••

••

••

••••

••••••

••••••••• ••

•• •••••••• ••••••

••••

••••• ••••

• •••

•• ••

• •• •

•••••

•

•• •• • •••••••••• • •

•• ••

•••

••

•••• ••• ••

•

•••••

••••••••

•• ••

• ••• •••

•••

••

•• •

••••

•••• ••••• •••

•••••• •••• •• ••••

•

••••

•••

•• •• ••• •••

••• ••

• ••••••••

•

•••• • ••• •••• ••••••

••• •• •

••

••• •• ••

• ••

•

•••••

••••• •••

•• ••

• ••• •••

•••

••

•• •

••••

••••• •••• •••

•••••

• •••• •• •••••

••••

•••

•• •• ••• •••

••• ••

• ••••• •••

•

•••• • ••• •••• ••••••

••• •

• ••

••

•• •• ••• ••

•

••• ••

••••••••

•• ••

• ••• •••

•••

••

•• •

••••

••••• •••• •••

•••••

• •••• •• •••••

••••

•••

•• ••••

• •••

••• ••• •••

•••••

•

• ••• • • •• •••• ••• •• •

••• •

• ••

••

•• ••-202

••••• •

•••

• •

••• ••• •

••

••

••

• ••

•• •

• ••

••• • •

•

• •••• ••••

••• •• •••• •• •

•••••

• •••

•• •

••

•• •••

•

••• •• •

••

••• •••••• ••••

• • •• ••••

••

••• ••

•

••• ••• •

••

•

••• •

-1

1•••

••••••

• •

••• • •• •

••

••

••

•••

•••

•••

•••••

•

• •••• ••••••• •••••

• •• •••••

•• •

•••••

••

•• ••

•

•

•••• ••

••

••• ••

•••• •• •

•• • •

• •• ••

••

••• ••

•

• ••••••

••

•

• ••• •••

•••

• ••

• •

• ••••

• •

••

••

••

•••

•• •••

•

••• ••

•

•••• •• •••••• •••••

••••••••

•• •

•••••

•••

• •••

•

••• • •••

••

•• ••

•••• •• ••

• • •• ••••

••

•••••

•

• ••••• ••

•

•

• •• • •••

•• •• •

•• •

• •• ••••

•••

•••

•••

•• •

•••

•••••

•

• ••• •••••

••• •• •• •• ••• ••••

•• •

•••••

••••••

•

•

•••• •••

••

• • ••

•••• ••••

• • •• •• • •

••

•• •••

•

• ••••• •

••

•

•••• •• •

•••

• ••

• •

••••• ••

•••

•••

• ••

•••

•••

•• •••

•

•••••••••

••••••••

• ••• •••••

•••••••

••

••••

•

•

•••••••

••

••••

• •••• ••••••

•••••

••

•••••

•

•••••••

••

•

•••• • • •

•• •

•••

••

••• ••

• •

••

••

••

•••

•• •• •

•

••• ••

•

•••••••••

••••••••

••••

•••••

•••••••

•••

• •••

•

••• •• ••

••

• •••• ••••• •

••• ••••••

••

•••••

•

•• •••• •

••

•

•••• •••

•••••

•••

•••••••

••

••

• •

•••

•••

•••

•• •••

•

••• •• ••••••••• •••

•• ••••••••••••••

••

•• ••

•

•

• ••• ••

••••••

••• •• •••••••

•• • ••

••

•••••

•

• ••••••

••

•

••••

aretd.L1•• •

•••

•••••

••• ••

••

••

••

••

•••

••••••

•• •• •

•

••••• ••••

••••••••

••••

•••••

•••••••

••

••• •

•

•

••••• ••

•••• •

•• ••• • ••

••••

• ••• •

••

•••••

•

• ••••• •

••

•

•••• •• •

•• •••

•• •

•••••

• •

••

••

••

•••

•• •

•••

•• •••

•

••• •••••••••• ••••

•••••••••

••••• ••

••

•• •••

•

••••• •

••••• •

•••

•••• ••• ••

• ••••

••

•••••

•

••• ••••

••

•

•••• •• •

•••••

•• •

•• •••••

••

••

••

•••

•••

• ••

•• •••

•

•••• •• ••••• •• ••••

••••

••• ••

••••

• ••

••

•• •••

•

••••• •

••

••• •

•• ••••• •

••••• ••• •

••

•••••

•

•••• •• •

••

•

•••• •• •

•••••

•• •

•• •••••

••

••

••

•••

•••• •

•

•• •••

•

•••• ••• •••• •• ••••

••••

••• ••

••••

• ••

••

•• •••

•

••••• •

••

••• •

•• •• ••• •

••••• ••• •

••

•••••

•

•••• •• •

••

•

•••• •• •

•••

•••• •

•• •••••

••

••

••

•••

•••• ••

•• •••

•

•••• ••• •••• •• •• ••

••••

••• ••

••••

• ••

••

•• •••

•

••••• •

••

••• •

•• ••••• ••

•••• • •• •

••

•••• •

•

•••• •• •

••

•

••••

•

••••

• •••• •••••••

••

• •

•

• • •••••

• ••••

•

••••

• ••••• •••• ••• •• ••••

•• •••••

•• • •• •• •••

•••• •••

•••

•••

••

•••••

• ••••

••• •• •

•••• • •• •• ••

••

••

•••

•

• ••••• •

•

•••

•••••• •

••••••

••

• •

•

• ••••••

•• ••••

••••• ••••

• ••• •••• •••••

••• •••••

•• • ••••••

•• •

• ••••

••••• •

••

•• •• •• •

•••

•• • •• •

•••• • •• •• ••

••

•••• •

•

• ••• •••

•

••

•••• ••• •

• ••

•• ••

•••

•

•••• ••

••••

••

•

••••

•••• ••• •• •

••• •••••••• •••••

•• • ••••••

•••

• ••••

•••

•• •

••

•••• •• ••

•••• • •

• •••

•• ••••••••

•

•••••

•

• ••• •• •

•

••••

•• ••• •• •

•• •

•••• •

•

• •••••••••

•••

••••• ••• •

•••• •••• •• •• •

•••• ••••

•• • ••••••

•• ••• ••

•

•••

•• •

••

••• • •• ••

••••• •

• ••••• ••• • •••

••

•••••

•

• •• ••••

•

•••

••• ••• •••

•• •

••

•••

•

•• ••• •

••••

••

•

•• ••••••••••••••••••••

•••• ••••

• •••• ••••

•••

••••

•

••••••

••

• ••••• ••••

• ••••••••• ••• ••••

••

•••••

•

•• ••• ••

•

••

••

•••• ••••

••• •

••

••

•

•••• •••

• •••

••

•• ••••••••••• •

•••••••••

•• •••••

•••••••••

•••

•••••

•••

•••

••

••• ••••

•••

•• •••• •••••••••••

••

••

•••

•

••••••

•

•••

•• •••••

••••••

••

••

•

•••• • •

••• •

••

•

••• •••• ••

•••••••••• ••••• ••••••

••• ••••••

•• •

• •• ••

•• •

•• •

••

••••• ••

•••••• •••••

• • ••• •••••

•

••••

•

•

• •• ••••

•

••

••• ••• • •••

•••

••

•••

•

• •••• •

••• ••

••

•• ••

••••••••• •••••• •••

••• •••••

••• ••••••

•••

• ••••

•••

•••

••

• •••• •••••••• •

••• •••• ••••••

••

••

•••

•

• • ••• ••aretd.L2

•

••

••

••••• •••••••

••

••

•

••••••

••••

••

•

••• •••• ••

••••••••• ••••••••••••

•••••• •••

•••

••• ••

••••

••

••

•••• •

• ••

•••• ••• •• •

••••••••••

•

••

•••

•

•• •••••

•

••

••

• •••••

••••••••

• •

•

• •••• •

•• ••

••

•

•• ••

•••• ••• ••••• •• ••••••• •••• •

• ••••• •••

•••

• •• ••

••••

••

••

• ••• ••••

•••• ••

••• ••••••••••••

••

• ••

•

•• ••• ••

•

••••

• •••••

•••• •

••

•• •

•

• •••• •

•• ••

••

•

•• • •

•••• •••• ••

•• •• ••••••• •••• •

• ••••• •••

•••

• •• ••

••••

••

••

• ••• ••••

• ••• ••

••• •••••• ••••••

••

• ••

•

•• ••• ••

•

••••

• ••• • •••••••

••

• •

•

• •••• •

•• ••

••

•

•• ••

•••• •••• ••

•• •• •• ••••• •••• •

• ••••• ••••

••• •• •

•

••••

••

••

• ••• •••

•••

•• ••••

• •• •••• ••• •

••

••

• ••

•

•• ••• •• -1

1

••

••••••

•••••

• •••••••

•

• • •

•

••

•• ••

•

•• ••

•

•••

••• ••• •• •

• ••

• •• •••••• •

•• •

• •••

•• ••••••• •

•

•••••

•••• ••

•

•• ••

••

••• ••• •• •••

••

• ••• •

•

••

••• •-1

1

3

••

•••••• •

••••

• • •••• ••

•

• ••

•

•••••

•

•

••••

•

•••

••• ••• •••

•••• •• ••

•• •• ••••

••• •

••• •••••

•••

••• ••

•••• ••

•

•• ••

••

••• ••• •• •••• •

•••••

•

••

• ••• ••

• •••

• ••

••• •••••

•• ••

•

•••

•

••

••••

•

••••

•

•••

• ••• •• •••••

•••• •••• •• ••••

••• •

••• ••• •

•• •

•

••••••

••• ••

•

•• ••

••

•••••••••••

• •

•••• •

•

••

• •• • ••

• •••

• ••

••

• •• •• •

••••

•

• ••

•

••••

••

•

••••

•

•••

• ••••• •• •

• ••

• ••• ••••• •

•••

••• •

•••••• •

••••

•••• •

•••• ••

•

•• ••

••

•• •••• • ••••• •

•••• •

•

••

•••• ••

••••

• ••

••

••••• •

•• ••

•

•• •

•

•••••

•

•

••• •

•

•••

••••••••••••

• ••• •••

• •••• •

••• •

•••••• ••••

•

•• ••••• ••• •

•

• ••••

••• ••• • ••••

•••

•••••

•

••

•• •• • •

••••

••••

•••• •••• •• •

•

•••

•

••

•• ••

•

••• •

•

•••••

•••• ••••••

••• ••••

••••••

••••

•• ••• •••

• ••

•••• •

•• ••••

•

••• ••••

••••••••••

••

•••• •

•

••

•••• ••

•• ••

••••

•••••••

•• ••

•

•••

•

••

• •••

•

••••

•

•••

•••••••••

• ••

•• •••••••• •••

••••

••• ••••

•••

•

••••

•••• ••

•

•••••

••

•••••••••

• •

•••••

•

••

•••• ••

• •••

•••••

••••••

• •• •

•

• ••

•

•••••

•

•

••• •

•

•••••

•••••••• ••

••• ••••

••••••

••••

••• •• •••

•••

•• ••••

• • •••

•

•••••

•••• ••• ••••

•••

•••••

•

••

•• •• ••

•• •••••••

••• •• •• • ••

•

•••

•

••

• •••

•

•• • •

•

•••••

•••••••••

••••••••••••••

••••

•••• •• •

•••

•

••• ••

•• •• • •

•

•••••

••

• •••••••••

• •

•••• •

•

••

••••aretd.L3

••

• •••

••••

•••

••••••••

•

• ••

•

•••• ••

•

••• •

•

•••• •

•• •••••••

•••• ••

• •• •••••

••••

••• •••••

•••

•• •••

•• ••••

•

•••••

••

• •••• •••••••

•• •• •

•

••

•• •• ••

• •••

••••

•••

••• ••• ••

•

• ••

•

•••• ••

•

••• •

•

•••• •

••• ••••••

•••• ••

• •• •••••

••••

••• •••••

•••

•• •••

•• ••••

•

•••••

••

• •••• •••••••

•• •• •

•

••

•• •• ••

• •••

••••

•••

•••••• ••

•

• ••

•

••••

••

•

••• •

•

•••• •

••• ••••••

•••• ••

• •• •••••

••••

••• •••••

•••

•• •••

•• ••••

•

• •••

•••

• •••• ••• ••••

•• •• •

•

••

•• ••

••••• •

••••

•

••

•••••••• •

••

••••

• •••• ••

••

•

•••

•••••

•••• •••

•

••••• •••

•••••

• • •••

••••

••

•

•

•••

•••

• ••

•••

•••

•••••• ••

••• •

••

••

•• ••

••• ••• •••

•

••••

•••

• •••

•••••• ••

•

••

•• •••• •• •

••

••

•••••

• • ••••

•

••••••

••••

• • •••

•

••••• ••

•

••••

•• • ••

•

••• •

•••

•

••• •

••••

•

•••

• ••

•••••• ••

••• •

••

••

•• •••

•• •••• ••

•

• •••

••

•••

•••

••••

•••

•

••

•••• •• •••

••

•••

•• ••

•• •••

••

••••••

• •• •• ••••

•

•••• •••

•

•••••• • •••

••• •

•••

•

•••

•••

• ••

••

••••

•••

••• •••

•• ••

•••

••••••••••• ••

•

••••

••

•• •

••••• •••••

•

••

••• •

•••• ••

•

••

••

• •••• ••

•••

••••••••••

• • •••

•

••• •• •••

••••

•• • •••

••• •

•••

•

•••• ••

•••

••

••••

•••

••• •••

•• ••

•••

••••

•• •••• • ••

•

••••

•••••

••••••

••••

•

••

••• •

•• •••••

•••

••••

•• •••

••

•••••••••••• ••••

••••• ••

•

••••

••••••

••• •

••

•

•

••••••••

•

••

•••

••••

•••• ••

• ••••

••

••••

••••• ••••

•

• •• ••

••••

• ••

•• •••• •

•

••

•••• • •• ••

••

••

••

• ••••• •

••

•

••••••••

••• •••••

•••• •••

•

••••

•••••

•

••••

••

•

•

•• • •

••• •

•

••

•••

•••

••••••••••

••••

••••

•••••• •• •

•

••••

•••••

•••• ••

••••

•

••

•••••••••

••

••

•••• •

• • •••

••

• ••

•• •• •

••••••••

•••••• •

•

••••••• ••

•

•• ••

•••

•

•••

•• •

•••

•••••

•• •

•• •• ••

••••

••

• •

••••

••••••• ••

•

•••••

••

••••

••••••• •

•

••

••••

• •• •••

•

••

••

•••• •• •

••

•

••••••••••• •••••

•••• ••••

••••

••• ••

•

•• ••

••

•

•

•• • •••

•••

••

•••

•• •

•••••••

••••

•••

•• ••

••••• ••••

•

• ••••

••

••••

•• ••••••

•

••

••• •

• • ••••

•

••

•••• ••• ••

••

•

••••••

• •••• •••••

•••• ••••

••••

•••••

•

•• ••

••

•

•

••• •••

•••

••

•• ••••

•••• • ••

•••••

••

••••

••••• •• ••

•

••••

••

•••

•••

•• •••••

•

••

•••••• •••

••

••

••

• • ••• ••

••

•

• ••

•• •••

••••••••

•••• •••

•

•••••

•••••

••••

•••

•

•••

•••••

•

••••••

••••••••

••• ••

•••

••••••••• • •••

•

• •• ••

••••

vola.L1 •••••••

•••

•

••

••• •

•••• •

••

••

••

••••••••

••

• ••

•••••

•••••••

•

•••••••

•

••••

•••••

•

••••

••

•

•

•••

•••••

•

••

•••

•••

••• •••

••••

••

••

••••

•••••••••

•

••••

••

•••

•••••••

•• •

•

••

•••••••• •

••

••

••

••••••••

••

••••••

••••

•••••

•

•• •••••

•

••••

•••••

•

••••

••

•

•

••••

••••

•

••

•••

•••

••••••

•• ••

••

• •

••••

•••• •• •••

•

••••

••

•••

-2

0

2

••••• •

••••

•

••

••••••

••

••

••

•••

• •••• ••

••

•

•••

•••••

•••• ••• •

••••• •••

••••

•

• • •••

••••

••

•

•

•••

•••

• ••

•••

•••

•••••• ••

••• •

••

••

•• •••

•• ••• •••

•

••••

•••

• •-2

0

2••

•••••• ••

•

••

•• •••••

••

•

••

••••••

• • ••••

•

••••••

••••

•• ••

• •

••••• ••

•

••••

•

• • •••

••• •

•••

•

••• •

••••

•

•••

• ••

••

••

••••

••• •

••

••

•• •••

•• •••• ••

•

• •••

••

•••

••• ••••

•••

•

••

•••• •••

••

•

••••

•• ••

•• •••

••

••••••

• •• ••

•••• •

•••• •••

•

••••

•

• • •••

••• •

•••

•

•••

•••

• ••

••

••••

••

••

••••

••• •

••

••

••••••••••• ••

•

••••

••

•• •

••• •• •

••••

•

••

••• •

••••

••

••

•••

• •••• ••

•••

••••••••••

•• ••

• •

••• •• •••

••••

•

• • •••

••• •

•••

•

•••• ••

•••

••

••••

••

••

•• •••

•• ••

•••

•••••

• •••• • ••

•

••••

•••••

••••••

••••

•

••

••• •

••••

••

••

••

••••

•• •••

••

•••••••••••

• ••••

••••• ••

•

••••

•

•••••

••• •

••

•

•

••••••••

•

••

•••

••

••

•••

• ••

• ••••

••

•••• •••••

••••

•

• •• ••

••••

• •••• •

••• •

•

••

•••• • •

•••

•

•••

••

• ••••• •

••

•

••••••••

•••

•••••

•••• •••

•

••••

•

•••••

••••

••

•

•

•• • •

••• •

•

••

•••

••

••

•••

••••••

••••

•••••••••

• •• •

•

••••

•••••

•••• ••

••••

•

••

•••••••

••

•

•••

•••• •

• • •••

••

• ••

•• •• •

••••••••

•••••• •

•

••••

•

•• •••

•• ••

•••

•

•••

•• •

•••

•••••

••

••

•••••

••••

••

• •

•••••••••

•• ••

•

•••••

••

••••

• •••••• •

•

••

••••

• ••

••

•

••

••

••••

• •• •••

•

•••••••••••

•••••

•••• ••••

••••

•

•• •••

•• ••

••

•

•

•• • •••

•••

••

•••

••

•••••••

••••

••

••

•• •••••••

••••

•

• ••••

••

••••

•• ••••••

•

••

••• •

• ••

••

•

•••

•••• ••• ••

••

•

••••••

• •••••••••

•••• ••••

••••

•

•••••

•• ••

••

•

•

••• •••

•••

••

•• •••

••

•••• ••

•••••

••

•••••••••

•• ••

•

••••

••

•••

••••• •

••••

•

••

•••••••

••

•

•••

••

• • ••• ••

••

•

• ••

•• •••

••••••••

•••• •••

•

••••

•

•••••

••••

•••

•

•••

•••••

•

••••••

••

••

••••

••• ••

•••

•••••••••• •••

•

• •• ••

••••

••• •••

••••

•

••

•••••••

••

•

••

••

•••••••••

••

••••••

••• •••••

••

•••••••

•

••••

•

•••••

••••

••

•

•

•••

•••••

•

••

•••

••

••

•••••

••••

••

••

•••••••••••••

•

••••

••

•••

vola.L2 ••• •••

••• •

•

••

•••••••

••

•

••

••

••••

•••••

••

••••••

••••

••••••

•• •••••

•

••••

•

•••••

••••

••

•

•

••••

••••

•

••

•••

••

••

•••

•••

• •••

•• •

•••••••• •• •••

•

••••

••

•••

••••• •

••••

•

••

••••••

••

••

••

•••

• •••• •

••

••

•••

•••••

•••• ••• •

••

••• •••

••••

•

• • •••

••••

••

•

•

•••• ••

• ••

•

••

•••

•••

••• ••

•

•• •

••

••

••

••••••

••••••• •••

••

•• •

-0.4

•••••••• •••

••

•• •••••

••

•

••

••••••

• • ••••

•

•••

•••••

•••

• ••• •

••

••• •••

•••

•

•

• • •••

••• •

•••

•

••• •••

•••

•

••

• ••

••

•

•••

••

•

•• •

••

••

•••••

•••

••• ••• ••

••

••

•••

••• ••••

••••

••

•••• •••

••

•

••••

•• ••

•• ••

••

•

•••

•••• •

• ••

•••• •

••

•• ••••

••••

•

• • •••

••• •

•••

•

••• •••

• ••

•

••

•••

••

•

•••

••

•

•• •

••

••

••

••••••••

• •••• •• •••

•• •

-2 1

••• •• •

••••

•

••

••• •

••••

••

••

•••

• •••• •

••••

•••

•••••

•••

• ••• •

••

• •• •••

••••

•

• • •••

••• •

•••

•

•••• ••

•••

•

••

•••

••

•

••• ••

•

•• •

••

••

••

•••• •

•••• •••• •• •

• •

•••

••••••

•••••

••

••• •

••••

••

••

••

••••

•• ••

•••

•••

•••••

•••

• ••••

••

••• •••

••••

•

•••••

••• •

••

•

•

•••• ••

•••

•

••

••••

••

•••

• •

•

•••••

••

••

•• ••••• •

•••• •• ••

••

•••

-2 1

• •••• •

••• •

•

••

•••• • •

•••

•

•••

••

• ••••••

••

•

•••

•••••

•••

•••••

•••• •••

•

••••

•

•••••

••••

••

•

•

•• • •••

• ••

•

••

••••

••

•••

••

•

•••••••

••••••••••

•• ••• •••••

•••

•••• ••

••••

•

••

•••••••

••

•

•••

•••• •

• • ••

•••

• ••

•• •• •

••••••••

••

•••• ••

••••

•

•• •••

•• ••

•••

•

••••• •••

•

•

••

•••

•••

•••••

•

•••

••

• •

••

•••••

•••

• ••••••

•• •

•••

-2 1 3

••• •••

••• •

•

••

••••

• ••

••

•

••

••

••••

• •••

••

•

•••

•••••

•••

•••••

••

•• ••••

••••

•

•• •••

•• ••

••

•

•

•• • •••

•••

•

••

••••

••

•••••

•

•••

••

••

••

•••••

•• ••••• ••

••

••

•••

•••• ••

••••

•

••

••• •

• ••

••

•

•••

•••• ••• ••

••

•

•••

•••• •

••••••••

••

•• ••••

••••

•

•••••

•• ••

••

•

•

••• •••

•••

•

••

• •••

••

•••• •

•

•••

••

••

••••••••• •

• •••• •••••

•••

-1 1

••••• •

•••••

••

•••••••

••

•

•••

••

• • ••• •

••

••

• ••

•• •••

••••••••

••

•• ••••

••••

•

•••••

••••

•••

•

•••• ••••

•

•

••

•••

••

•

•••

••

•

•• ••

•••

••••••••• •

•••• •••

•••

•••

••• •••

••••

•

••

•••••••

••

•

••

••

•••••••••

••

•••

•••••

• ••

••• ••

•••••••

•

•••

•

•

•••••

••••

••

•

•

••• •••••

•

•

••

•••

••

•

•••••

•

•••

••

••

••••••••••••••• •••

••

•••

-2 0 2

••• •••

••••

•

••

••• •

•••

••

•

••

••

••••

•••••

••

• ••

•••••

•••

••• ••

•••••••

•

•••

•

•

•••••

••••

••

•

•

••• •••••

•

•

••

•••

••

•

•• •••

•

•••

••

••

••

•••••

•••••••• •••

••

•••

vola.L3

-2

0

-2 0

16


OLS Fit

Results of ordinary least squares analysis of NYSE data

Term Coefficient Std. Error t-Statistic

Intercept -0.02 0.04 -0.64

volume.L1 0.09 0.05 1.80

volume.L2 0.06 0.05 1.19

volume.L3 0.04 0.05 0.81

retd.L1 0.00 0.04 0.11

retd.L2 -0.02 0.05 -0.46

retd.L3 -0.03 0.04 -0.65

aretd.L1 0.08 0.07 1.12

aretd.L2 -0.02 0.05 -0.45

aretd.L3 0.03 0.04 0.77

vola.L1 0.20 0.30 0.66

vola.L2 -0.50 0.40 -1.25

vola.L3 0.27 0.34 0.78

17


Variable subset selection

We retain only a subset of the coefficients and set to zero the coefficients

of the rest.

There are different strategies:

• All subsets regressionfinds for eachs ∈ 0, 1, 2, . . . p the subset of

sizes that gives smallest residual sum of squares. The question of

how to chooses involves the tradeoff between bias and variance: can

use cross-validation (see below)

• Rather than search through all possible subsets, we can seek a good

path through them.Forward stepwise selectionstarts with the

intercept and then sequentially adds into the model the variable that

most improves the fit. The improvement in fit is usually based on the

18


F ratio

F =RSS(βold)−RSS(βnew)

RSS(βnew)/(N − s)

• Backward stepwise selectionstarts with the full OLS model, and

sequentially deletes variables.

• There are also hybridstepwise selectionstrategies which add in the

best variable and delete the least important variable, in a sequential

manner.

• Each procedure has one or moretuning parameters:

– subset size

– P-values for adding or dropping terms

19


Model Assessment

Objectives:

1. Choose a value of a tuning parameter for a technique

2. Estimate the prediction performance of a given model

For both of these purposes, the best approach is to run the procedure on

an independent test set, if one is available

If possible one should use different test data for (1) and (2) above: a

validation setfor (1) and atest setfor (2)

Often there is insufficient data to create a separate validation or test set. In

this instanceCross-Validationis useful.

20


K-Fold Cross-Validation

Primary method for estimating a tuning parameterλ (such as subset size)

Divide the data intoK roughly equal parts (typicallyK=5 or 10)

Train Train Train

5

TrainTest

21 3 4

• for eachk = 1, 2, . . . K, fit the model with parameterλ to the otherK − 1

parts, givingβ−k(λ) and compute its error in predicting thekth part:

Ek(λ) =P

i∈kth part(yi − xiβ−k(λ))2.

This gives the cross-validation error

CV (λ) =1

K

KXk=1

Ek(λ)

• do this for many values ofλ and choose the value ofλ that makesCV (λ)

smallest.

21


• In our variable subsets example,λ is the subset size

• β−k(λ) are the coefficients for the best subset of sizeλ, found from the

training set that leaves out thekth part of the data

• Ek(λ) is the estimated test error for this best subset.

• from theK cross-validation training sets, theK test error estimates are

averaged to give

CV (λ) = (1/K)

KXk=1

Ek(λ).

• Note that different subsets of sizeλ will (probably) be found from each of

theK cross-validation training sets. Doesn’t matter: focus is on subset size,

not the actual subset.

22


subset size

CV

err

or

2 4 6 8 10 12

0.06

00.

065

0.07

00.

075

•

•

•

•

•

•

•

•

••

•

••

all subsets

CV curve for NYSE data

• The focus is onsubset size—not which variables are in the model.

• Variance increases slowly—typicallyσ2/N per variable.

23



Subset Size k

Res

idua

l Sum

-of-

Squ

ares

020

4060

8010

0

0 1 2 3 4 5 6 7 8

•

•

•••••••

••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••

•••••••

•

•

•

•• • • • • • •

Figure 3.5: All possible subset models for the prostate

cancer example. At each subset size is shown the resid-

ual sum-of-squares for each model of that size.

24


The Bootstrap approach

• Bootstrap works by samplingN times with replacement from training set to

form a “bootstrap” data set. Then model is estimated on bootstrap data set,

and predictions are made for original training set.

• This process is repeated many times and the results are averaged.

• Bootstrap most useful for estimating standard errors of predictions.

• Can also use modified versions of the bootstrap to estimate prediction error.

Sometimes produces better estimates than cross-validation (topic for current

research)

25


NYSE example continued

Table shows the coefficients from a number of different selection and shrinkage

methods, applied to the NYSE data.

Term OLS VSS Ridge Lasso PCR PLS

Intercept -0.02 0.00 -0.01 -0.02 -0.02 -0.04

volume.L1 0.09 0.16 0.06 0.09 0.05 0.06

volume.L2 0.06 0.00 0.04 0.02 0.06 0.06

volume.L3 0.04 0.00 0.04 0.03 0.04 0.05

retd.L1 0.00 0.00 0.01 0.01 0.02 0.01

retd.L2 -0.02 0.00 -0.01 0.00 -0.01 -0.02

retd.L3 -0.03 0.00 -0.01 0.00 -0.02 0.00

aretd.L1 0.08 0.00 0.03 0.02 -0.02 0.00

aretd.L2 -0.02 -0.05 -0.03 -0.03 -0.01 -0.01

aretd.L3 0.03 0.00 0.01 0.00 0.02 0.01

vola.L1 0.20 0.00 0.00 0.00 -0.01 -0.01

vola.L2 -0.50 0.00 -0.01 0.00 -0.01 -0.01

vola.L3 0.27 0.00 -0.01 0.00 -0.01 -0.01

Test err 0.050 0.041 0.042 0.039 0.045 0.044

SE 0.007 0.005 0.005 0.005 0.006 0.006

CV was used on the 50 training observations (except for OLS). Test error for

constant: 0.061.

26


Estimated prediction error

curves for the various selection

and shrinkage methods. The

arrow indicates the estimated

minimizing value of the

complexity parameter. Training

sample size = 50.

subset size

CV

err

or

2 4 6 8 10 12

0.05

0.07 •

••

•

•

•• •

• ••

• •

all subsets

degrees of freedom

CV

err

or

2 4 6 8 10 12

0.05

0.07 •

••••••••••••

ridge regression

s

CV

err

or0.0 0.2 0.4 0.6 0.8 1.0

0.05

0.07

•

•• •

•• • • • • •

• •

lasso

# directions

CV

err

or

0 2 4 6 8 10 12

0.05

0.07

• •

• • • • • • ••

•• •

PC regression

# directions

CV

err

or

0 2 4 6 8 10 12

0.05

0.07 •

•• • • • • •

• • • •

•

partial least squares

27



Subset Size

CV

Err

or

0 2 4 6 8

0.6

0.8

1.0

1.2

1.4

1.6

1.8

•

•• • • • • • •

All Subsets

Degrees of Freedom

CV

Err

or

0 2 4 6 8

0.6

0.8

1.0

1.2

1.4

1.6

1.8

•••••••••••••

••

Ridge Regression

Shrinkage Factor s

CV

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

0.6

0.8

1.0

1.2

1.4

1.6

1.8

•

•

••

• • • • • • • • • • •

Lasso

Number of Directions

CV

Err

or

0 2 4 6 8

0.6

0.8

1.0

1.2

1.4

1.6

1.8

•

• •• • • • • •

Principal Components Regression

Number of Directions

CV

Err

or

0 2 4 6 8

0.6

0.8

1.0

1.2

1.4

1.6

1.8

•

•• • • • • • •

Partial Least Squares

Figure 3.6: Estimated prediction error curves and

their standard errors for the various selection and

shrinkage methods, found by 10-fold cross-validation.

28


Shrinkage methods

Ridge regression

The ridge estimator is defined by

βridge = argmin(y −Xβ)T (y −Xβ) + λβT β

Equivalently,

βridge = argmin (y −Xβ)T (y −Xβ)

subject toX

β2j ≤ s.

The parameterλ > 0 penalizesβj proportional to its sizeβ2j . Solution is

βλ = (XT X + λI)−1XT y

whereI is the identity matrix. This is a biased estimator that for some value of

λ > 0 may have smaller mean squared error than the least squares estimator.

Noteλ = 0 gives the least squares estimator; ifλ →∞, thenβ → 0.

29

ESL Chap3 — Linear Methods for Regression Trevor HastieElements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 3

Coe

ffici

ents

0 2 4 6 8

-0.2

0.0

0.2

0.4

0.6

•

••••

••

••

••

••

••

••

••

••

•••

•

lcavol

••••••••••••••••••••••••

•

lweight

••••••••••••••••••••••••

•

age

•••••••••••••••••••••••••

lbph

••••••••••••••••••••••••

•

svi

•

•••

••

••

••

••

••••••••••••

•

lcp

••••••••••••••••••••••••

•gleason

•

•••••••••••••••••••••••

•

pgg45

df(λ)

Figure 3.7: Profiles of ridge coefficients for the

prostate cancer example, as tuning parameter λ is var-

ied. Coefficients are plotted versus df(λ), the effec-

tive degrees of freedom. A vertical line is drawn at

df = 4.16, the value chosen by cross-validation.

30


The Lasso

The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the

outcomey.

The lasso is defined by

βlasso = argmin (y −Xβ)T (y −Xβ)

subject toX

|βj | ≤ t

• Notice that ridge penaltyP

β2j is replaced by

P|βj |.

• this makes the solutions nonlinear iny, and a quadratic programming

algorithm is used to compute them.

• because of the nature of the constraint, ift is chosen small enough then the

lasso will set some coefficients exactly to zero. Thus the lasso does a kind of

continuous model selection.

31


• The parametert should be adaptively chosen to minimize an estimate of

expected, using say cross-validation

• Ridge vs Lasso:if inputs are orthogonal, ridgemultipliesleast squares

coefficients by a constant< 1, lassotranslatesthem towards zero by a

constant, truncating at zero.

Ridge

Lasso

Coefficient

OLS

Coefficient

Transformed

32


Lasso in Action

Profiles of coefficients for NYSE data as lasso shrinkage is varied.

Shrinkage Factor s

Coe

ffici

ents

0.0 0.2 0.4 0.6 0.8 1.0 1.2

-0.4

-0.2

0.0

0.2

23456

7

8

9

10

11

12

s = t/t0 ∈ [0, 1], wheret0 =P|βOLS |.

33


Shrinkage Factor s

Coe

ffici

ents

0.0 0.2 0.4 0.6 0.8 1.0

-0.2

0.0

0.2

0.4

0.6

•

•

•

•

•

•

••

•• • • • • • • • • • • • • • • • lcavol

• • • • ••

••

•• • • • • • • • • • • • • • • • lweight

• • • • • • • • • • • • • ••

• • • • • • • • • •age

• • • • • • • • • ••

••

•• • • • • • • • • • • lbph

• • • • • • ••

••

••

•• • • • • • • • • • • •svi

• • • • • • • • • • • • • • ••

••

••

••

••

• lcp

• • • • • • • • • • • • • • • • • • • • • • • • •gleason• • • • • • • • • •

••

•• • • • • • • • • •

••pgg45

Figure 3.9: Profiles of lasso coefficients, as tuning

parameter t is varied. Coefficients are plotted versus

s = t/∑p

1 |βj |. A vertical line is drawn at s = 0.5, the

value chosen by cross-validation. Compare Figure 3.7

on page 7; the lasso profiles hit zero, while those for

ridge do not.

34



β^ β^2. .β

1

β2

β1β

Figure 3.12: Estimation picture for the lasso (left)

and ridge regression (right). Shown are contours of the

error and constraint functions. The solid blue areas are

the constraint regions |β1|+ |β2| ≤ t and β21 + β2

2 ≤ t2,

respectively, while the red ellipses are the contours of

the least squares error function.

35


A family of shrinkage estimators

Consider the criterion

β = argmin β

NXi=1

(yi − xTi β)2

subject toX

|βj |q ≤ s

for q ≥ 0. The contours of constant value ofP

j |βj |q are shown for the case of

two inputs.


q = 4 q = 2 q = 1 q = 0.5 q = 0.1

Figure 3.13: Contours of constant value of∑

j |βj |q

for given values of q.

Contours of constant value ofP

j |βj |q for given values ofq.

Thinking of |βj |q as the log-prior density forβj , these are also the equi-contours

of the prior.

36


Use of derived input directions

Principal components regression

We choose a set of linear combinations of thexjs, and then regress the outcome

on these linear combinations.

The particular combinations used are the sequence of principal components of the

inputs. These are uncorrelated and ordered by decreasing variance.

If S is the sample covariance matrix ofx1, . . . , xp, then the eigenvector equations

Sq` = d2jq`

define the principal components ofS.

37


-4 -2 0 2 4

-4-2

02

4

o

o

o

o

oo

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o o

o

o

o o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

oo

o

o

oo

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o oo o

o

o

o

ooo

o

o

o o

o

o

o

o

oo

o

oo

o

o

o o

o

o o

o

oo

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

oo

o

o

o

o

oo

o

o

o

o

o

o

Largest PrincipalComponent

Smallest PrincipalComponent

X1

X2

Figure 3.8: Principal components of some input data

points. The largest principal component is the direc-

tion that maximizes the variance of the projected data,

and the smallest principal component minimizes that

variance. Ridge regression projects y onto these com-

ponents, and then shrinks the coefficients of the low-

variance components more than the high-variance com-

ponents.

38


Digression: some notes onPrincipal Components and the SVD (PCA.pdf)

39

http://www-stat.stanford.edu/~hastie/Printer/315-LECTURES/PCA.pdf


PCA regression continued

• Write q(j) for the ordered principal components, ordered from largest to

smallest value ofd2j .

• Then principal components regression computes the derived input columns

zj = Xq(j) and then regressesy onz1, z2, . . . zJ for someJ ≤ p.

• Since thezjs are orthogonal, this regression is just a sum of univariate

regressions:

ypcr = y +

JXj=1

γjzj

whereγj is the univariate regression coefficient ofy onzj .

40


• Principal components regression is very similar to ridge regression: both

operate on the principal components of the input matrix.

• Ridge regression shrinks the coefficients of the principal components, with

relatively more shrinkage applied to the smaller components than the larger;

principal components regression discards thep− J + 1 smallest eigenvalue

components.


Index

Shr

inka

ge F

acto

r

2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

••

••

• ••

•

• • • • • • •

• •

ridgepcr

Figure 3.10: Ridge regression shrinks the regres-

sion coefficients of the principal components, using

shrinkage factors d2j/(d2

j + λ) as in (3.47). Princi-

pal component regression truncates them. Shown are

the shrinkage and truncation patterns corresponding to

Figure 3.6, as a function of the principal component

index.

41


Partial least squares

This technique also constructs a set of linear combinations of thexjs for

regression, but unlike principal components regression, it usesy (in addition to

X) for this construction.

• We assume thaty is centered and begin by computing the univariate

regression coefficientγj of y on eachxj

• From this we construct the derived inputz1 =P

γjxj , which is the first

partial least squares direction.

• The outcomey is regressed onz1, giving coefficientβ1, and then we

orthogonalizey,x1, . . .xp with respect toz1: r1 = y − β1z1, and

x∗` = x` − θ`z1

• We continue this process, untilJ directions have been obtained.

42


• In this manner, partial least squares produces a sequence of derived inputs or

directionsz1, z2, . . . zJ .

• As with principal components regression, if we continue on to construct

J = p new directions we get back the ordinary least squares estimates; use of

J < p directions produces a reduced regression

• Notice that in the construction of eachzj , the inputs are weighted by the

strength of their univariate effect ony.

• It can also be shown that the sequencez1, z2, . . . zp represents the conjugate

gradient sequence for computing the ordinary least squares solutions.

43


Ridge vs PCR vs PLS vs Lasso

Recent study has shown that ridge and PCR outperform PLS in prediction, and

they are simpler to understand.

Lasso outperforms ridge when there are a moderate number of sizable effects,

rather than many small effects. It also produces more interpretable models.

These are still topics for ongoing research.

44

Regularization and Boosting: CIMAT, Jan. 2007 Saharon Rosset 1

Regularized Optimization, Boosting,

and Some Connections between

Them

Saharon Rosset (IBM Research)Collaborators: Ji Zhu (Michigan), Trevor Hastie (Stanford)


Predictive modeling

Given n data samples (xi, yi)ni=1 , x

Ti ∈ R

p

Generated independetly from a data distribution:

y = f(x) + ε(x)

(f — fixed; ε — random)

We want to find a ”good” model f(x) to describe the deterministic part.

Definition of “good” is typically in terms of EXL(y, f(x)), where L depends on

problem.


Corporate Data Bases

Many tables, relational database.


Motivation

Modern data (Data Mining, Machine Learning etc.) is:

• High dimensional

– By nature: micro-arrays, scientific data, customer databases

– Computational tool: data often projected into high dimensional space:

kernel methods, wavelets, boosting’s weak hypotheses, etc.

• Noisy and dirty (e.g. customer databases)

• Contains many irrelevant predictors (e.g. customer databases, micro-arrays)

Fitting models without controlling complexity results in:

• Badly over-fitted models

• Useless for prediction or interpretation


Illustrative example

100 data points, 80 dimensional space. True model:

yi = xi1 + εi

εiiid∼ N(0, 1)

We are fitting a linear regression model of the form:

f(x) = x · β


Unregularized model projected to x1

Unregularized model: β = arg minβ ‖yi − xiβ‖2

−3 −2 −1 0 1 2

−3

−2

−1

01

23

x_1

y


Appropriately regularized model

We impose an l1 constraint on the model:

β = arg min‖β‖1≤1

‖yi − xiβ‖2

−3 −2 −1 0 1 2

−3

−2

−1

01

23

x_1

y

non−regularizedl1 regularized


Prediction problems

• Training data (x1, y1), . . . , (xn, yn)

• Input xi ∈ Rp

• Output yi

– Regression: yi ∈ R

– Two class classification: yi ∈ 1,−1

• Wish to find a prediction model for future data

f : x ∈ Rp → R

Regression: predict f(x)

Classification: predict sign of f(x)

• Generally take f(x) = xβ (linear model)

– Can be linear in a basis expansion (kernel/wavelets etc.)


The regularized optimization problem

β(λ) = arg minβ

∑

i

C(yi,xiβ) + λJ(β)

Where:

• C is a convex loss, describing the “goodness of fit” of our model to training

data

– Regression: C(y, f) = C(y − f) function of residual

– Classification: C(y, f) = C(yf) function of margin

• J(β) is a model complexity penalty.

Typically J(β) = ‖β‖qq i.e. penalize lq norm of model, q ≥ 1.

• λ ≥ 0 is a regularization parameter

– As λ→ 0, we approach non-regularized model

– As λ→∞, we get that β(λ)→ 0


Examples

• Regularized linear regression:

Squared error loss: C(y, f) = (y − f)2

– Ridge regression uses l2 penalty J(β) = ‖β‖22

– The Lasso (Tibshirani 96) uses l1 penalty J(β) = ‖β‖1

• Support Vector Machines:

Hinge loss: C(y, f) = (1− yf)+

– Standard (2-norm) SVM uses l2 penalty ‖β‖22

– 1-norm SVM uses l1 penalty ‖β‖1


Considerations in selecting loss

β(λ) = arg minβ

∑

i

C(yi,xiβ) + λ‖β‖qq

“Classical” view: loss should correspond to data log-likelihood

• Squared error loss corresponds to Gaussian errors

• Logistic regression uses binomial likelihood

Pragmatic view: need to do well on data

• Robustness considerations: sensitivity to incorrect error model

• Computational considerations: can we solve the problem efficiently


Some loss functions for regression and

classification

−3 −2 −1 0 1 2 30

1

2

3

4

5

6

residual

squared losshuber’s loss

−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 20

1

2

3

4

5

6

7

8

9

10

margin

exponentiallogistichinge


Considerations in selecting penalty

β(λ) = arg minβ

∑

i


Two perspectives on penalty:

• Bayesian: prior over the model space

– reg. optimization solution is maximum posterior likelihood

• Limit model space to avoid over-fitting

Considerations in selecting penalty:

• Adequacy of penalty (implied prior)

– Sparsity considerations (l1 penalty encourages sparsity)

• Computational considerations


l1, l2 and l∞ penalties in R2

−1.5 −1 −0.5 0 0.5 1 1.5

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1l1 penalty

l2 penalty

l∞ penalty


Regularization parameter: balancing loss and

penalty

β(λ) = arg minβ

∑

i


Theoretical approaches to selecting λ:

• Bayesian: λ is “strength of prior”

• Frequentist: use loss + complexity penalty (Cp, AIC etc.)

Practical approach:

1. Solve for many (or all) values of λ.

2. Select based on cross-validation error


Equivalent constrained formulation

β(S) = arg minβ

∑

i

C(yi,xiβ) s.t. ‖β‖qq ≤ S

Both formulations are equivalent when loss and penalty are convex, with the

following property:

β(λ) : λ ∈ R ⊂ β(S) : S ∈ R

Under most conditions we will consider the two sets are actually equal.

We use both formulations exchangeably.


Illustration: Lasso and Huberized lasso

• n = 100, p = 80.

• All xij are i.i.d N(0, 1) and the true model is:

yi = 10 · xi1 + εi

εiiid∼ 0.9 ·N(0, 1) + 0.1 ·N(0, 100)

• Sparsity implies l1 penalty is appropriate

• Compare l1-regularized paths using Huber’s loss and squared error loss


Hub. lasso path Lasso path

0 20 40 60 80

−50

510

0 50 100 150 200 250

−50

510

‖β(λ)‖1‖β(λ)‖1

ββ

Squared error curves for the two solution paths

0 10 20 30 40

010

2030

4050

60

Squa

red

Erro

r LASSOHuberized

‖β(λ)‖1


Boosting: warmup

• Introduced in the machine learning community by Freund and Schapire

(1996).

• Extremely successful in practice

• Main idea:

Iteratively build prediction model by fitting re-weighted versions of the data

– Weights emphasize badly fitted data points

– Each iteration builds a “weak” learner to model current weighted data

• Boosting can be interpreted as “coordinate descent” in high dimensional

predictor space (Mason et al 99, Friedman 2001)


Schematic of boosting

Training sample

Weighted sample

Weighted sample

G1(x)

G2(x)

GM (x)

sgn (

Pi αiGi(x)) Final prediction model


Boosting analysis: outline

• AdaBoost and its interpretations

– Boosting as gradient descent

– Margins view of boosting

• Relation of boosting to `1-constrained optimization

• Convergence of `p-constrained optimization of classification loss functions to

an “ `p-margin” maximizing separator

• Conclusions:

– Boosting approximately corresponds to `1-constrained optimization

– Classification boosting (AdaBoost and LogitBoost) “conver ge” to

`1-optimal separator, compared to `2-optimal for SVM


Schematic of Talk Structure

BoostingConstrainedOptimization Margins

SVM


Boosting basics

Given:

• Data xi, yini=1 with xi ∈ R

p and yi ∈ −1, +1

• Convex loss criterion L(y, f)

• DictionaryH of “weak classifiers” , i.e. ∀h ∈ H, h : Rp → −1, +1

– Example: all decision trees with up to k splits


Boosting basics (ctd)

We want to find a “good” linear combination :

F (x) =∑

hj∈H

βjhj(x)

such that∑

i L(yi, F (xi)) is small.

In boosting this is done incrementally i.e. at step T our model is:

FT (x) =∑

t≤T

αtht(x)


AdaBoost algorithm (Freund and Schapire 1995)

1. Initialize: wi ≡ 1

2. While (improvement on test set)

(a) Look for ht = arg minh∈H

∑

i wiIyi 6= h(xi) (minimizes weighted

misclassification error)

(b)

errt =

∑

i wiIyi 6= ht(xi)∑

i wi

(c) Set αt = log(1−errt)

errt

(d) wi ← wi · exp(αtIyi 6= ht(xi))

3. Output model F (x) =∑

t αtht(x) and classifier: sign(F (x))


AdaBoost as Gradient Descent

It has been shown that AdaBoost is “coordinate descent” with exponential loss:

L(y, Ft(x)) = exp(−yFt(x))

The criterion for selecting the next ht is to minimize:

∂∑

i L(yi, Ft(xi))

∂βj= 〈−∇L(Ft(x)), hj(x)〉

ht is the best ”canonical” improvement direction, to first orde r

The AdaBoost αt is chosen via a line search

• We will consider αt ≡ ε — which is “stronger”, empirically better and

theoretically more tractable


Practical importance of boosting approaches

• Computationally friendly when |H| is large:

– Does not require second derivatives and matrix inversion.

– Greedy search algorithms allow finding best direction “approximately”

– Mainly in situations where there is no explicit β at all, rather a dictionaryH

from which a “best” member is chosen every time using heuristics (e.g.

decision trees using greedy methods).

• Empirically shown to do very well

– AdaBoost (Freund and Schapire 95) and other boosting algorithms are

best “off the shelf” classifiers according to Breiman


Other gradient-based boosting algorithms

This methodology can be applied to any function estimation problem

• Friedman, Hastie and Tibshirani (2000) use binomial log-likelihood loss:

L(y, Ft(x)) = log(1 + e−yFt(x))

• Friedman (2001) applies it to regression problems with various losses

• Rosset and Segal (NIPS 2002) apply it to density estimation with

log-likelihood criterion : L(Ft(x)) = −log(Ft(x)).


Margin Basics

• Margin of separating hyper-plane∑

hj∈Hβjhj(x) = 0 is Euclidean

distance of closest point:

mini

yiβ′h(xi)

‖β‖2

• Non-regularized SVM solution maximizes minimal margin

• SVM literature: large margins⇒ “small” prediction error

−4 −3 −2 −1 0 1 2 3 4−3

−2

−1

0

1

2

3


Margins in Boosting

• Boosting margin of model F (x) =∑

t αtht(x) is defined as:

mini

yiF (xi)∑

t |αt|∈ [−1, +1]

• Basis representation for finite |H|:∑

t αtht =∑

hj∈Hβjhj

• ‖β‖1 =∑

j |βj | ≤∑

t |αt| equality e.g. if αt ≥ 0 ∀t (monotonicity)

−4 −3 −2 −1 0 1 2 3 4−3

−2

−1

0

1

2

3


The two margin definitions

Euclidean distance (SVM margin) between data point and “hyper-plane”∑

hj∈Fβjhj(x) = 0:

yiβ′h(xi)

‖β‖2

Normalized Boosting margin:

yiβ′h(xi)

‖α‖1=

yiβ′h(xi)

‖β‖2·‖β‖2‖β‖1

·‖β‖1‖α‖1

Differences:

• `1 vs `2 norm - encourages ”sparse” representations

• ‖β‖1 ≤ ‖α‖1 - sign consistency (“monotonicity”) assures equality


Boosting as a margin-maximizing process

Boosting the Margin - (Schapire et al. 1998, Annals):

• Prove that “weak learnability” (=separability) increases margins

• Experimentally show boosting increases margins

• Discuss geometric interpretation

• Generalization error bounds for finite basis, infinite basis, as function of

margin distribution e.g.: with probability≥ 1− δ

PTe(yF ≤ 0) ≤ PTr(yF ≤ θ) + O(n−.5(log|H|).5θ−1log(δ)−.5)

Plenty of other papers about boosting and margins


Advantages(?) of margins view

• Explains behavior of Adaboost in separable case:

– Seeks to maximize minimal margin, consequently finds a “good”

separating hyper-plane - similar to SVM

– Loss criterion view does not give such intuitions:

any separating hyper-plane, scaled up, drives exponential loss to 0.

• Generalization error bounds as function of minimal margin:

– Breiman (97) directly maximized margins, attained bad generalization

performance

– That’s not surprising, since margin maximization is clearly overfitting


What we have learned so far


SVM


Next steps


SVM


Constrained (regularized) optimization

We want to find β(c) which achieves :

min‖β‖1≤c

∑

i

L(yi, β′h(xi))

i.e. the optimal solution with `1 norm c.

What is the relation of this solution to the ε-boosting solution with `1 norm c (i.e.

after c/ε iterations)?


Relation of boosting to regularized optimization

Consider the local “monotone” optimization problem:

minL(β)

s.t. ‖β‖1 − ‖β0‖1 ≤ ε

|β| ≥ |β0| component-wise

It’s easy to see:

limε→0

|(β − β0)k|

ε> 0⇒ k = arg max

j|∇L(β0)j | = arg max〈−∇L(Ft(x)), h(x)〉

k is unique “almost everywhere” in our space, so we are choosing the direction of

the best monotone path .

We may conjecture that if this ”monotonicity” holds on optimal path then

ε-boosting converges to optimal regularized path


ε-Boosting and `1 constrained fitting

For squared error loss regression (from Efron et al. 2002):

Lasso: β(c) = arg min‖β‖1≤c ‖y −Xβ′‖22“Stagewise”: the ε-boosting coefficients

Lasso

0 1000 2000 3000

-500

050

0

123 4 5 67 89 10 1

2

3

4

5

6

78

9

10••• • • •• •• •

Stagwise

0 1000 2000 3000

-500

050

0

123 4 5 67 89 10 1

2

3

4

5

6

78

9

10••• • • •• •• •

PSfrag replacements t =P j^jj !t =P j^jj !P j^jj ! j


What about other loss functions?

For classification with binomial log-likelihood loss:

`1 constrained solutions (left), ε-boosting path (right)

0 2 4 6 8 10 12−2

−1

0

1

2

3

4

5

6Exact constrained solution

||β||1

β va

lues

0 2 4 6 8 10 12−2

−1

0

1

2

3

4

5

6ε−Stagewise

||β||1

β va

lues


Partial theoretical results

Denote:

β(c) = arg min‖β‖1≤c

∑

i L(yi, β′h(xi))

β(ε)(c) is the ε-boosting coefficient vector for `1 norm c.

Theorem 1 if β(c) is strongly monotone in all coordinates ∀c < c0 , then

limε→0 β(ε)(c0) = β(c0)

• Much stronger condition on derivatives along the optimal path

We also have a “local” result:

Theorem 2 Under monotonicity only, if we denote by γ(ε) the ε-stagewise

“direction” starting from β(c0) then:

limε→0

γ(ε) =dβ(c)

dc|c=c0

• (Efron et al 02) proved for squared error loss, we generalized to any convex

loss


`p constrained classification losses

Consider the constrained optimization problem:

β(p)(c) = arg min‖β‖p≤c

∑

i

L(yi, β′h(xi))

With the loss being either the exponential or log-likelihood:

Le(y, β′h(x)) = exp(−yβ′h(x))

Ll(y, β′h(x)) = log(1 + exp(−yβ′h))


Convergence to “ `p- optimal” separating hyper-plane

Define:

β(p) = limc→∞

β(p)(c)

c

Theorem 3 If the data is separable, then with either Le or Ll,

β(p) = arg max‖β‖p=1

mini

yiβ′h(xi)

Interpretation: the normalized constrained optimizer “converges” to an “`p-margin

maximizing” separating hyper-plane


Boosting interpretation

We can conclude that ε-boosting tends to converge to the `1-margin maximizing

separating hyper-plane

100

101

102

103

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

Minimal margins

||β||1

min

imal

mar

gin

exponentiallogistic AdaBoost

100

101

102

103

0.045

0.05

0.055

0.06

0.065

0.07

0.075

0.08

0.085

0.09

0.095Test error

||β||1

test

err

or

exponentiallogistic AdaBoost


Boosting and support vector machines

In the separable case:

• SVM non-regularized solution is β(2)

• Boosting non-regularized solution is β(1)

• Differences:

– Boosting margin vs. SVM margin (Euclidean distance)

– Different loss functions⇒ different regularized paths

• “`2 ε-boosting” follows a different regularized path to “SVM” solution

– Choose coefficient to change according to maxh−∇L(Ft(X))′h(X)

βt,h

In non-separable case even the non-regularized solutions would be different


Simple data example

Same example as before with additional large mass (20 observations) at “far”

point

−2 −1 0 1 2 3 4 5 6−2

0

2

4

6

8

10

Experiment data


Convergence of `1 and `2 boosting paths to optimal

separator

0 5 10 15

0

0.2

0.4

0.6

0.8

1

Normalized L1−boosting coefficients

||β||1

β/||β

|| 1

boost var1boost var2opt var1 opt var2

0 2 4 6 8 10

0

0.2

0.4

0.6

0.8

1

Normalized L2−boosting coefficients

||β||2

β/||β

|| 2boost var1boost var2svm var1 svm var2 opt var1 opt var2


More interesting example: Boosting vs. `2 boosting

Boosting `2 boosting

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

optimal boost 105 iter boost 3*106 iter

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

optimal boost 5*106 iterboost 108 iter


Summary

• Boosting related to `1-constrained fitting

– Can define `p boosting algorithms to correspond to `p constraints

• `p constrained classification loss solutions converge to “`p-margin”

maximizers in separable case

– Has implication on understanding of logistic regression

• A common thread for boosting and SVM:

Computational trick for regularized fitting in high dimensi onal predictor

spaces

– SVM: kernel trick (`2 regularization)

– Boosting: coordinate descent (approximate `1 regularization)

`1 regularization: CIMAT, Jan. 2007 Saharon Rosset 1

`1 regularization: properties and

computations

Saharon Rosset (IBM Research)Collaborators: Ji Zhu (Michigan), Trevor Hastie, Rob Tibshirani (Stanford), Nathan

Srebro (TTI), Grzegorz Swirszcz (IBM Research)


Results on `1 regularization• Sparsity

• Piecewise linearity

• Applicability in very high or infinite dimensional embedding

spaces


The regularized solution pathFixing the loss, penalty and data, and varying the regularization

parameter we get the “path of solutions”

β(λ) , 0 ≤ λ <∞

This is a 1-dim curve through Rp.

• Interesting statistically, as the set of solutions to problems of

interest (Bayesian interpretation: changing prior variance)

• Often interesting computationally, as it has properties which

allow efficient “tracking” of this path


Example: Lasso solution path in R10

Lasso

0 1000 2000 3000

-500

050

0123 4 5 67 89 10 1

2

3

4

5

6

78

9

10••• • • •• •• •


(from Efron et al. (2004). Least Angle Regression. Annals of Statistics)


Sparseness propert(ies) of `1

regularized path


`1, `2 and `∞ penalties in R2

−1.5 −1 −0.5 0 0.5 1 1.5

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1l1 penalty

l2 penalty

l∞ penalty


Sparseness of `1 penalty: n > pShape of `1 penalty implies sparseness. For large values of λ only few non-zero

coefficients.

Lasso

0 1000 2000 3000

-500

050

0

123 4 5 67 89 10 1

2

3

4

5

6

78

9

10••• • • •• •• •



Sparseness: p > nFor any convex loss, assuming only “non-redundancy”:

Theorem (e.g., Rosset et al. 2004)

Any `1 regularized solution has at most n non-zero components

Proof: Simple application of Caratheodory’s Convex Hull Theorem.

CorollaryThe limiting interpolating (or margin maximizing) solution also has atmost n non-zero components


Some implications of sparseness• Variable selection (obviously)

• `1-regularized problems are “easier” than, say, `2-regularized

ones

– Can give good solutions in p >> n situations

See:

Friedman, Hastie, Rosset, Tibshirani, Zhu (2004). Discussion

of three boosting papers. Annals of Statistics

Ng (2004). Feature selection, `1 vs `2 regularization and

rotational invariance. ICML-04


Piecewise Linear Solution Paths


Piecewise linear property

We want to identify situations where the path of solutions β(λ) , 0 ≤ λ <∞

is easy to generate.

One such situation is when β(λ) is piecewise linear in λ.

+

+

+

+

+

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

+ +

+

+

+

+ + + +

+

+ + +

++

‖β‖1

β


Primary example: the lasso

(Efron et al 03), (Osborne et al 00) show that for the lasso:

β(λ) = arg minβ

∑

i

(yi − xiβ)2 + λ‖β‖1

β(λ) is piecewise linear in λ.

• Yields efficient algorithm for finding β(λ) , 0 ≤ λ <∞

– Cost is “approximately” one least-squares calculation


Some properties of the Lasso regularized path

1. Sparsity: if p > n, any regularized solution β(λ) has at most n non-0

coefficients (property of `1 penalty)

2. High correlation:

β(λ)j 6= 0 ⇒∣

∣

∣

∂C(β)∂βj|β=β(λ)

∣

∣

∣=

∣

∣

∣xT

j (y −Xβ(λ))∣

∣

∣= λ

3. Compactness: Number of “pieces” in the path is approximately min(n, p).

+

+

+

+

+

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

+ +

+

+

+

+ + + +

+

+ + +

++

‖β‖1

β


Our key questions:• What is the fundamental property of (loss, penalty) pairs which

yields piecewise linearity?

• Are there efficient algorithms to generate these regularized

paths?

• Are there statistically interesting members in these families?


What makes paths piecewise linear?

Assume loss and penalty are both twice differentiable everywhere.

With some algebra we get:

∂β(λ)

∂λ= −(∇2C(β(λ)) + λ∇2J(β(λ)))−1∇J(β(λ))

We want this derivative to be constant, thus:

A sufficient condition for piecewise linearity is that:

• The loss C is piecewise quadratic

• The penalty J is piecewise linear


Building blocks for PWL regularized optimization

problems

Piecewise quadratic loss:

• Squared error loss: regression: (y − f)2, classification: (1− yf)2

• Huberized squared error loss (robust):

C(y,xβ) =

(y − xβ)2 if |y − xβ| ≤ m

m2 + 2m(|y − xβ| −m) otherwise

• Piecewise linear loss: regression: |y − f | , classification: (1− yf)+

Piecewise linear penalty:

• `1 penalty: J(β) =∑

j |βj | (gives sparse solutions)

• `∞ penalty: J(β) = maxj |βj |


Some Interesting Examples


Regression: the Huberized lasso vs. the lasso

0 20 40 60 80

−5

05

10

0 50 100 150 200 250

−5

05

10

‖β(λ)‖1‖β(λ)‖1

ββ


Squared error loss with `∞ penalty

0 100 200 300 400 500 600 700 800−800

−600

−400

−200

0

200

400

600

800

||β||∞

β


Classification: 1-norm and 2-norm Support Vector

Machines

0.0 0.4 0.8 1.2

0.0

0.2

0.4

0.6

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

ββ

‖β‖1 ‖β‖22

1-norm SVM 2-norm SVM


Multiple penalty problem: Protein Mass

Spectroscopy

(Tibshirani et al, in preparation)

• Predictors are “experssion levels” along a spectrum of masses for proteins.

• Want to constrain model while keeping coefficients “smooth”.

• Solution: `1 penalty on coefficients, `1 penalty on successive differences:

β(λ1, λ2) = arg minβ

∑

i

(yi − xiβ)2 + λ1‖β‖1 + λ2

∑

j

|βj − βj−1|

• Solution path is piecewise affine in (λ1, λ2)


Almost quadratic loss with `1

penalty


Almost quadratic loss

We define almost quadratic loss as:

C(r) = a2(r) · r2 + a1(r) · r + a0(r)

Where:

• a2, a1, a0 : R → R are piecewise constant functions

• C(r) is (once) differentiable everywhere

• r = (y − xβ) the residual for regression

• r = yxβ the margin for classification


Motivation for this family

• Piecewise linear solution paths

• `1 penalty⇒ sparse solutions

• Allows efficient, relatively simple algorithm

• Includes robust loss functions for regression and classification


Algorithm

• Initialize: β = 0, A = arg maxj |(∇L(β))j |, γ = −sgn(∇L(β))A

• While (max|∇L(β)| > 0)

– d1 = arg mind>0 minj /∈A |∇L(β + dγ)j | = |∇L(β + dγ)A|

– d2 = arg mind>0 minj∈A(β + dγ)j = 0 (hit 0)

– d3 = arg mind>0 mini r(yi,xiβ + dγ) hits a “knot”

– set d = min(d1, d2, d3)

– If d = d1 then add variable attaining equality at d toA.

– If d = d2 then remove variable attaining 0 at d fromA.

– β ← β + dγ

– B =∑

i a(f(yi,xiβ))xA′ixAi

– γ = B−1(−sgn(∇L(β))A)


Loss functions of interest: robust, differentiable

Linear for outliers, squared around “corners”:

• Regression: Huberized squared error loss

• Classification: Huberized squared hinge loss:

−3 −2 −1 0 1 2 30

1

2

3

4

5

6

7hinge loss (svm)Hub. sq. hinge (almost quad.)


Computational complexity

Calculations in each step of our algorithm:

• Step size: find the length of current “piece”

– O(np) calculations (for each observation, figure when it hits a “knot”)

• Direction calculation: calculate the direction of the next “piece”

– O(min(n, p)2), using Sherman-Morrison-Woodbury updating formula

Number of steps of the algorithm:

• Difficult to bound in “worst case”

• Under mild assumptions it’s O(n).


Computational complexity (ctd.)

Overall complexity is thus O(n2p) for both n > p and n < p

Compare to least squares calculation:

• O(np2) when n > p.

• O(n3) when n < p.


Example: “Dexter” dataset (NIPS 03

challenge)• n = 800 observations

• p = 1152 variables

• Use Huberized squared hinge loss

• Path has 452 “pieces”

• Inefficient R implementation takes about 3 minutes to generate

path on laptop.


Validation error and number of non-0 coefficients

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2V

al. e

rror

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

20

40

60

80

100

120

140

160

180

200

‖β‖1


Summary• Regularization is fundamental in modern data modeling

• Considerations in selecting specific formulation:

– Statistical: robustness (loss), sparsity (penalty)

– Computational: efficient computation

• Piecewise linear solution path offer solutions that are:

– Robust: select appropriate loss function

– Adaptable: select regularization parameter adaptively

– Efficient: generate whole regularized path efficiently


`1 regularization in infinite

dimensional feature spaces


Outline• Regularized embeddings: kernels, boosting and all that

• Generalizing `1 regularization to non-countable dimension as

measure constraint

• Properties of `1 regularized solutions in infinite dimensions:

– Existence

– Sparsity: existence of finite-support optimal solution

– Optimality criteria

• Practical, exact `1 regularization in very high dimension via path

following

• Example: additive quadratic splines


Regularized fittingGeneric supervised learning problem, given:

• x1, ...,xn ∈ Rp (or simply Xn∗p)

• y ∈ Rn for regression, y ∈ ±1n for classification

Find model y ≈ f(x)

Linear models set f(x) = xTβ and often use regularized fitting:

β = arg minβ∈Rp

L(y, Xβ) + λJ(β) (or, min L s.t. J ≤ C)

Where L (loss) and J (penalty) are typically convex

J(β) = ‖β‖q is typical choice, usually q ∈ 1, 2

E.g.: Ridge regression, LASSO, Linear SVM,...


Data embeddingWe can increase the representation power of linear model by

embedding the data into high dimensional space, fitting linear

models there:

x→ φ(x) ∈ RΩ (typically |Ω| >> p)

f(x) = φ(x)Tβ

where Ω is index set of the features in the high dimensional space

Simple example: p = 2(+intercept/bias), Ω is set of degree-2

polynomials

x = (1, x1, x2)

φ(x) = (1, x1, x2, x21, x

22, x1x2)


Examples of embedding-based

methods• Kernel methods: φ often not explicitly defined but implicitly

through inner product kernel: K(x,y) =< φ(x), φ(y) >.

Ω usually infinite.

• Wavelets: φ(x) is wavelet basis values at x.

• Boosted trees: φ(x) is set of all trees of certain size, evaluated

at x. Ω can be made finite.

• Spline dictionary: with x ∈ [0, 1], Ω = [0, 1] and

φ(x) = (x− a)k+ : a ∈ Ω. Infinite (non-countable)

dictionary.


Embedding+regularization: kernel

methods, boostingSome of the most successful “modern” methods seem to rely on

right combination of embedding and regularization:

• Kernel methods: implicit embedding into RKHS + exact `2

regularization + representer theorem

⇒ computational and statistical success

• Boosting: embedding into space of trees + (very) approximate `1

regularization + incremental implementation

⇒ computational and statistical success

What about exact `1 regularization in embeddings?


`1 or `2 regularization?Good question! Detailed discussion is outside our scope...

Easy answer (as always): be Bayesian

One important aspect is the sparsity property of `1 regularization:

Sparsity property

If |Ω| > n finite, then any `1 regularized problem has a

solution β containing at most n non-zero entries.

Does this still hold when Ω is infinite?


Generalizing `1 regularizationWe start from:

minβ

∑

i

L(yi, φ(xi)Tβ) s.t. ‖β‖1 ≤ C

By doubling the number of variables: βj = βj,+ − βj,− and adding

positivity constraints we can replace the norm by sum:

minβ

∑

i

L (yi, φ(xi)T(β+ − β−)) s.t.

∑

j

βj,++βj,− ≤ C , β+, β− 0

Now we replace the sum by a positive measure:

minP∈P

∑

i

L(yi,

∫

Ω

φω(xi)dP (ω)) s.t. P (Ω) ≤ C


Understanding our generalizationProbability measure requires probability space, hence a σ-algebra

Σ over Ω.

We require w : ω ∈ Ω ⊂ Σ

• If Ω finite or countable this implies Σ = 2Ω and hence

P (Ω) = ‖β‖1 as required

• In the non-countable case this still works!


When does an optimal solution

exist?Theorem

If the set φω(X) : ω ∈ Ω ⊂ Rn is compact, then our problem

has an optimal solution

Corollary

If the set Ω is compact and the mapping φ.(X) : Ω→ Rn is

continuous, then our problem has an optimal solution.

Bottom line: under mild conditions, an optimal solution is

guaranteed to exist.


The sparsity property in infinite

dimensionTheorem:

Assume an optimal solution exists, then there exists an optimal

solution P (C) supported on at most n + 1 features in Ω.

Main idea of proof:

- Consider A = φω(X) : ω ∈ Ω ⊂ Rn

- Show that any z ∈ co(A) (convex hull) can be represented as

convex combination of n + 1 points

(for finite Ω this is just Caratheodory’s convex hull theorem)

⇒ any infinite-support measure can be approximated by one

supported on n + 1 features


Optimality criterionSuppose we are presented with a finite-support solution P (C).

How can we verify it is optimal?

Answer: we only need to verify it is optimal in any finite feature set

containing its support

Theorem

If an optimal solution to the regularized problem exists, and we are

presented with a finite-support candidate solution P supported on

A = ω1, ..., ωk with k ≤ n + 1 then:

P is optimal solution⇔ ∀B ⊂ Ω s.t. A ⊆ B, |B| <∞, P is

optimal solution for the problem in PB


Summary of mathematical/statistical

properties we prove• Under boundedness + continuity condition an optimal solution

exists

• There is always a sparse optimal solution with at most n + 1

features

• Given a finitely supported solution, we can test its optimality by

considering only finite problems on supersets of its support

Now, can we actually find the solution?


Path following algorithmsSome regularized problems can take advantage of looking at the

solution set: β(λ) : λ ∈ R as a path in R|Ω| and following it

efficiently:

• Lasso (quadratic loss + `1 penalty): LARS-Lasso of Efron et al.

(2004) (also earlier work from Osborne et al.)

• SVM by Hastie et al. (2004), LP-SVM by Zhu et al. (2004)

• etc.


Lasso and LARSLasso:

β(λ) = arg minβ‖y −Xβ‖22 + λ‖β‖1,

with X ∈ Rn×p, y ∈ R

n, β ∈ Rp.

LARS-Lasso (Efron et al 2004) is a homotopy algorithm to generate

the path β(λ) for all λ efficiently. Algebraically, we can derive

LARS-Lasso from KKT conditions:

β(λ)j 6= 0 ⇒ |XTj (y −Xβ(λ))| = λ (1)

β(λ)j = 0 ⇐ |XTj (y −Xβ(λ))| < λ (2)


Schematic of LARS-Lasso1. Preliminaries

2. Loop:

(a) Find next variable to add to active setA:

dadd, step size such that a variable not inA attains equality

in (?? )

(b) Find next variable to remove from active set:

drem, step size such that coefficient from active set hits 0

(c) Make step min(dadd, drem), modify active set accordingly

(d) Calculate new LARS direction:

γ = −(XTAXA)−1sgn(XT

A(y −Xβ(λ)))


Can we do LARS-Lasso in infinite

dimensional embeddings?Going back to schematic of LARS-Lasso:

Only finding dadd requires considering high dimension

Therefore, if:

1. We have sparsity X

2. We can search over Ω for next feature efficiently

⇒ we can apply LARS-Lasso and find full path (optimality

guaranteed by our criterion)


Search problem for LASSOFormally

dadd = mind > 0 : ∃ ω /∈ A

−φω(X)T(y − φA(X)β(λ)0 − dφA(X)γA) = λ0 − d

We can re-write it as dadd = minω∈Ω−A d(ω), where d(ω) is the

value attaining equality for the dictionary function indexed by ω.

Specifically we get:

d(ω) =φω(X)T

r + λ(β)

φω(X)TφA(X)γA + 1


Spline basesAssume our data points xi are in [0, 1].

A polynomial spline of order k is a piecewise polynomial of degree

k − 1 with k − 2 continuous derivatives.

E.g. second order spline is piecewise linear continuous function.

Dictionary for kth order spline:

Φk =

1, x, ..., xk−2, xk−2, (x− a)k−1+ : a ∈ (0, 1]


Total-variation penalties and

regularized splinesStart from the general nonparametric problem with x ∈ R:

f(x) = minf∈C(k−1)

∑

i

(yi − f(xi))2 + λTV (f (k−1))

Most general result:

Theorem (e.g. Mammen & van de Geer 97)

Optimal solution f can be represented as a k-th order spline with at

most n knots

Since roughly TV (f (k−1)) = (k − 1)! · P (Ω), our results prove

this theorem in one line!


What do we know about

TV-penalized spline solutions?• For k < 3 can show (Mammen and VDG 97) that this spline has

knots at the data points — an `1 “representer” theorem!

• They propose efficient algorithms for solving with k ∈ 1, 2—

can be rephrased as versions of LARS-Lasso with n variables

(constant/linear spline basis)

• For k ≥ 3 they only offer LARS-like approximate algorithm with

knots at data points

But if we can solve the next feature search problem, we can apply

our algorithm and get exact solution path


Feature search problem for the

k = 3 case (piecewise quadratic)We want to minimize over Ω:

d(ω) =φω(X)T

r + λ(β)

φω(X)TφA(X)γA + 1

This is a piecewise rational function of ω with quadratics in

numerator and denominator

⇒ can solve analytically


2-dimensional additive spline example ( k = 3)

x1

x2

y

Surface

x1

x2

y

15 steps

x1

x2

y

40 steps

x1x2

y

65 steps


Boston and California housing ( k = 3)

0 50 100 150 200

810

1214

1618

Iterations

Pre

dict

ion

MS

E

linearquadraticspline

0 50 100 150 200 2500.

090

0.09

50.

100

0.10

50.

110

Iterations

Pre

dict

ion

MS

E

linearquadraticspline


Summary• `1 regularization generalizes elegantly to infinite dimensional

embeddings through generalization of norm to measure

• Statistical/mathematical properties:

– Existence

– Sparsity

– Testability

• We can design and implement a path following algorithm

– Practical applicability hinges on feature search problem

• We can practically implement in spline bases

– Optimally solves a total-variation penalized non-parametric

regression problem


Critical open issues• What can we say about learning performance? Which

embeddings are good?

• Characterize in general feature spaces where we can solve the

feature search problem

Date post:	09-Jun-2020
Category:	Documents
Upload:	others
View:	9 times
Download:	0 times

ESL Chap3 — Linear Methods for Regression Trevor Hastiehorebeek/epe/rosset2.pdf · ESL Chap3 —...

Documents