June 2006 Trevor Hastie, Stanford Statistics 1
Regularization PathsTrevor Hastie
Stanford University
drawing on collaborations with Brad Efron, Mee-Young Park, Saharon
Rosset, Rob Tibshirani, Hui Zou and Ji Zhu.
June 2006 Trevor Hastie, Stanford Statistics 2
Theme
• Boosting fits a regularization path toward a max-marginclassifier. Svmpath does as well.
• In neither case is this endpoint always of interest — somewherealong the path is often better.
• Having efficient algorithms for computing entire pathsfacilitates this selection.
• A mini industry has emerged for generating regularizationpaths covering a broad spectrum of statistical problems.
June 2006 Trevor Hastie, Stanford Statistics 3
0 200 400 600 800 1000
0.24
0.26
0.28
0.30
0.32
0.34
0.36
Adaboost Stumps for Classification
Iterations
Tes
t Mis
clas
sific
atio
n E
rror
Adaboost StumpAdaboost Stump shrink 0.1
June 2006 Trevor Hastie, Stanford Statistics 4
1 5 10 50 100 500 1000
1.5
2.0
2.5
3.0
3.5
4.0
Boosting Stumps for Regression
Number of Trees
MS
E (
Squ
ared
Err
or L
oss)
GBM StumpGMB Stump shrink 0.1
June 2006 Trevor Hastie, Stanford Statistics 5
Least Squares Boosting
Friedman, Hastie & Tibshirani — see Elements of StatisticalLearning (chapter 10)
Supervised learning: Response y, predictors x = (x1, x2 . . . xp).
1. Start with function F (x) = 0 and residual r = y
2. Fit a CART regression tree to r giving f(x)
3. Set F (x) ← F (x) + εf(x), r ← r − εf(x) and repeat steps 2and 3 many times
June 2006 Trevor Hastie, Stanford Statistics 6
Linear Regression
Here is a version of least squares boosting for multiple linearregression: (assume predictors are standardized)
(Incremental) Forward Stagewise
1. Start with r = y, β1, β2, . . . βp = 0.
2. Find the predictor xj most correlated with r
3. Update βj ← βj + δj , where δj = ε · sign〈r, xj〉4. Set r ← r − δj · xj and repeat steps 2 and 3 many times
δj = 〈r, xj〉 gives usual forward stagewise; different from forwardstepwise
Analogous to least squares boosting, with trees=predictors
June 2006 Trevor Hastie, Stanford Statistics 7
Example: Prostate Cancer Data
0.0 0.5 1.0 1.5 2.0 2.5
-0.2
0.0
0.2
0.4
0.6
lcavol
lweight
age
lbph
svi
lcp
gleason
pgg45
0 50 100 150 200 250
-0.2
0.0
0.2
0.4
0.6
lcavol
lweight
age
lbph
svi
lcp
gleason
pgg45
t =P
j |βj |
Coeffi
cien
ts
Coeffi
cien
ts
Lasso Forward Stagewise
Iteration
June 2006 Trevor Hastie, Stanford Statistics 8
Linear regression via the Lasso (Tibshirani, 1995)
• Assume y = 0, xj = 0, Var(xj) = 1 for all j.
• Minimize∑
i(yi −∑
j xijβj)2 subject to ||β||1 ≤ t
• Similar to ridge regression, which has constraint ||β||2 ≤ t
• Lasso does variable selection and shrinkage, while ridge onlyshrinks.
β^ β^2. .β
1
β 2
β1β
June 2006 Trevor Hastie, Stanford Statistics 9
Diabetes Data
Lasso
0 1000 2000 3000
-500
050
0
123 4 5 67 89 10 1
2
3
4
5
6
78
9
10••• • • •• •• •
Stagewise
0 1000 2000 3000
-500
050
0
123 4 5 67 89 10 1
2
3
4
5
6
78
9
10••• • • •• •• •
t =P
|βj | →t =P
|βj | →
βj
June 2006 Trevor Hastie, Stanford Statistics 10
Why are Forward Stagewise and Lasso so similar?
• Are they identical?
• In orthogonal predictor case: yes
• In hard to verify case of monotone coefficient paths: yes
• In general, almost!
• Least angle regression (LAR) provides answers to thesequestions, and an efficient way to compute the complete Lassosequence of solutions.
June 2006 Trevor Hastie, Stanford Statistics 11
Least Angle Regression — LAR
Like a “more democratic” version of forward stepwise regression.
1. Start with r = y, β1, β2, . . . βp = 0. Assume xj standardized.
2. Find predictor xj most correlated with r.
3. Increase βj in the direction of sign(corr(r, xj)) until someother competitor xk has as much correlation with currentresidual as does xj .
4. Move (βj , βk) in the joint least squares direction for (xj , xk)until some other competitor x� has as much correlation withthe current residual
5. Continue in this way until all predictors have been entered.Stop when corr(r, xj) = 0 ∀ j, i.e. OLS solution.
June 2006 Trevor Hastie, Stanford Statistics 18
0.0 0.2 0.4 0.6 0.8 1.0
−50
00
500
|beta|/max|beta|
Sta
ndar
dize
d C
oeffi
cien
ts
LAR
52
110
84
69
0 2 3 4 5 7 8 10
df for LAR
• df are labeled at thetop of the figure
• At the point a com-petitor enters the ac-tive set, the df are in-cremented by 1.
• Not true, for example,for stepwise regression.
March 2003 Trevor Hastie, Stanford Statistics 14�
�
�
�0 1000 2000 3000
-500
050
0
123 4 5 67 89 10 1
2
3
4
5
6
78
9
10••• • • •• •• •
2 4 6 8 100
5000
1000
015
000
2000
0
••
•
•
•• •
• • •
1
2
3
4
5
6
7
8
9
10
9
4
7
210 5
8 6 1
LARS
Ck
Step k →
|c kj|
∑|βj | →
βj
June 2006 Trevor Hastie, Stanford Statistics 12
µ0 µ1
x2 x2
x1
u2
y1
y2
The LAR direction u2 at step 2 makes an equal angle with x1 andx2.
June 2006 Trevor Hastie, Stanford Statistics 13
Relationship between the 3 algorithms
• Lasso and forward stagewise can be thought of as restrictedversions of LAR
• Lasso: Start with LAR. If a coefficient crosses zero, stop. Dropthat predictor, recompute the best direction and continue. Thisgives the Lasso path
Proof: use KKT conditions for appropriate Lagrangian. Informally:
∂
∂βj
[12||y −Xβ||2 + λ
∑j
|βj |]
= 0
⇔〈xj , r〉 = λ · sign(βj) if βj �= 0 (active)
June 2006 Trevor Hastie, Stanford Statistics 14
• Forward Stagewise: Compute the LAR direction, but constrainthe sign of the coefficients to match the correlations corr(r, xj).
• The incremental forward stagewise procedure approximatesthese steps, one predictor at a time. As step size ε → 0, canshow that it coincides with this modified version of LAR
June 2006 Trevor Hastie, Stanford Statistics 15
lars package
• The LARS algorithm computes the entire Lasso/FS/LAR pathin same order of computation as one full least squares fit.
• When p N , the solution has at most N non-zero coefficients.Works efficiently for micro-array data (p in thousands).
• Cross-validation is quick and easy.
Data Mining Trevor Hastie, Stanford University 24
Cross-Validation Error Curve
0.0 0.2 0.4 0.6 0.8 1.0
3000
3500
4000
4500
5000
5500
6000
Tuning Parameter s
CV
Err
or
• 10-fold CV error curve usinglasso on some diabetes data(64 inputs, 442 samples).
• Thick curve is CV error curve
• Shaded region indicates stan-dard error of CV estimate.
• Curve shows effect of over-fitting — errors start to in-crease above s = 0.2.
• This shows a trade-off be-tween bias and variance.
June 2006 Trevor Hastie, Stanford Statistics 16
Forward Stagewise and the Monotone Lasso
01
23
4
Coe
ffici
ents
(P
ositi
ve)
Lasso
01
23
4
Coe
ffici
ents
(N
egat
ive)
01
23
4
Coe
ffici
ents
(P
ositi
ve)
Forward Stagewise
01
23
4
Coe
ffici
ents
(N
egat
ive)
0 20 40 60 80
L1 Norm (Standardized)
• Expand the variable set to in-clude their negative versions−xj .
• Original lasso corresponds toa positive lasso in this en-larged space.
• Forward stagewise corre-sponds to a monotone lasso.The L1 norm ||β||1 in thisenlarged space is arc-length.
• Forward stagewise producesthe maximum decrease in lossper unit arc-length in coeffi-cients.
June 2006 Trevor Hastie, Stanford Statistics 17
Degrees of Freedom of Lasso
• The df or effective number of parameters give us an indicationof how much fitting we have done.
• Stein’s Lemma: If yi are i.i.d. N(µi, σ2),
df(µ) def=n∑
i=1
cov(µi, yi)/σ2 = E
[n∑
i=1
∂µi
∂yi
]
• Degrees of freedom formula for LAR: After k steps, df(µk) = k
exactly (amazing! with some regularity conditions)
• Degrees of freedom formula for lasso: Let df(µλ) be thenumber of non-zero elements in βλ. Then Edf(µλ) = df(µλ).
June 2006 Trevor Hastie, Stanford Statistics 18
0.0 0.2 0.4 0.6 0.8 1.0
−50
00
500
|beta|/max|beta|
Sta
ndar
dize
d C
oeffi
cien
ts
LAR
52
110
84
69
0 2 3 4 5 7 8 10
df for LAR
• df are labeled at thetop of the figure
• At the point a com-petitor enters the ac-tive set, the df are in-cremented by 1.
• Not true, for example,for stepwise regression.
June 2006 Trevor Hastie, Stanford Statistics 19
Back to Boosting
• Work with Rosset and Zhu (JMLR 2004) extends theconnections between Forward Stagewise and L1 penalizedfitting to other loss functions. In particular the Exponentialloss of Adaboost, and the Binomial loss of Logitboost.
• In the separable case, L1 regularized fitting with these lossesconverges to a L1 maximizing margin (defined by β∗), as thepenalty disappears. i.e. if
β(t) = arg minL(y, f) s.t. |β| ≤ t,
then
limt↑∞
β(t)|β(t)| → β∗
• Then mini yiF ∗ (xi) = mini yixTi β∗, the L1 margin, is
maximized.
June 2006 Trevor Hastie, Stanford Statistics 20
• When the monotone lasso is used in the expanded featurespace, the connection with boosting (with shrinkage) is moreprecise.
• This ties in very nicely with the L1 margin explanation ofboosting (Schapire, Freund, Bartlett and Lee, 1998).
• makes connections between SVMs and Boosting, and makesexplicit the margin maximizing properties of boosting.
• experience from statistics suggests that some β(t) along thepath might perform better—a.k.a stopping early.
• Zhao and Yu (2004) incorporate backward corrections withforward stagewise, and produce a boosting algorithm thatmimics lasso.
June 2006 Trevor Hastie, Stanford Statistics 21
Maximum Margin and Overfitting
Mixture data from ESL. Boosting with 4-node trees, gbm package inR, shrinkage = 0.02, Adaboost loss.
−0.
3−
0.2
−0.
10.
00.
1
Number of Trees
Mar
gin
0 2K 4K 6K 8K 10K
0.25
0.26
0.27
0.28
Number of Trees
Tes
t Err
or
0 2K 4K 6K 8K 10K
June 2006 Trevor Hastie, Stanford Statistics 22
Lasso or Forward Stagewise?
• Micro-array example (Golub Data). N = 38, p = 7129,response binary ALL vs AML
• Lasso behaves chaotically near the end of the path, whileForward Stagewise is smooth and stable.
0.0 0.2 0.4 0.6 0.8 1.0
−0.
20.
00.
20.
40.
60.
8
|beta|/max|beta|
Sta
ndar
dize
d C
oeffi
cien
ts
LASSO
2968
6801
2945
461
2267
0.0 0.2 0.4 0.6 0.8 1.0
−0.
20.
00.
20.
40.
6
|beta|/max|beta|
Sta
ndar
dize
d C
oeffi
cien
ts
Forward Stagewise
4535
353
246
2945
1834
2267
June 2006 Trevor Hastie, Stanford Statistics 23
Other Path Algorithms
• Elasticnet: (Zou and Hastie, 2005). Compromise between lassoand ridge: minimize
∑i(yi −
∑j xijβj)2 subject to
α||β||1 + (1− α)||β||22 ≤ t. Useful for situations where variablesoperate in correlated groups (genes in pathways).
• Glmpath: (Park and Hastie, 2005). Approximates the L1
regularization path for generalized linear models. e.g. logisticregression, Poisson regression.
• Friedman and Popescu (2004) created Pathseeker. It uses anefficient incremental forward-stagewise algorithm with a varietyof loss functions. A generalization adjusts the leading k
coefficients at each step; k = 1 corresponds to forwardstagewise, k = p to gradient descent.
June 2006 Trevor Hastie, Stanford Statistics 24
• Bach and Jordan (2004) have path algorithms for Kernelestimation, and for efficient ROC curve estimation. The latteris a useful generalization of the Svmpath algorithm discussedlater.
• Rosset and Zhu (2004) discuss conditions needed to obtainpiecewise-linear paths. A combination of piecewisequadratic/linear loss function, and an L1 penalty, is sufficient.
• Mee-Young Park is finishing a Cosso path algorithm. Cosso(Lin and Zhang, 2002) fits models of the form
minβ
�(β) +K∑
k=1
λk||βk||2
where || · ||2 is the L2 norm (not squared), and βk represents asubset of the coefficients.
June 2006 Trevor Hastie, Stanford Statistics 25
elasticnet package (Hui Zou)
• Min∑
i(yi −∑
j xijβj)2 s.t. α · ||β||22 + (1− α) · ||β||1 ≤ t
• Mixed penalty selects correlated sets of variables in groups.
• For fixed α, LARS algorithm, along with a standard ridgeregression trick, lets us compute the entire regularization path.
0.0 0.2 0.4 0.6 0.8 1.0
−10
010
2030
40
1
2
3
4
56
s = |beta|/max|beta|
Stan
dard
ized
Coe
ffici
ents
Lasso
0.0 0.2 0.4 0.6 0.8 1.0
−20
−10
010
20
1
2
3
4
5
6
Elastic Net lambda = 0.5
s = |beta|/max|beta|
Stan
dard
ized
Coe
ffici
ents
June 2006 Trevor Hastie, Stanford Statistics 26
**
**
**
*
*
0 5 10 15
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
lambda
Sta
ndar
dize
d co
effic
ient
s
******
*
*
****
**
*
*
****
**
*
*
*******
*
Coefficient path
x4
x3
x5
x2
x1
5 4 3 2 1
************************
******
******
*******
******
******
********************
0 5 10 15
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
lambda
Sta
ndar
dize
d co
effic
ient
s
*****************************
********
********
********
********
**************
***************************************************************************
***************************************************************************
**************************************************
***************
**********
Coefficient path
x4
x3
x5
x2
x1
5 4 3 2 1
glmpath package
• max �(β) s.t. ||β||1 ≤ t
• Predictor-correctormethods in convexoptimization used.
• Computes exact pathat a sequence of indexpoints t.
• Can approximate thejunctions (in t) wherethe active set changes.
• coxpath included inpackage.
June 2006 Trevor Hastie, Stanford Statistics 27
Path algorithms for the SVM
• The two-class SVM classifier f(X) = α0 +∑N
i=1 αiK(X, xi)yi
can be seen to have a quadratic penalty and piecewise-linearloss. As the cost parameter C is varied, the Lagrangemultipliers αi change piecewise-linearly.
• This allows the entire regularization path to be traced exactly.The active set is determined by the points exactly on themargin.
12 points, 6 per class, Separated
Step: 17 Error: 0 Elbow Size: 2 Loss: 0
*
*
**
*
*
*
*
*
*
**
7
8
9
10
11
12
1
2
3
4
56
1
3
Mixture Data − Radial Kernel Gamma=1.0
Step: 623 Error: 13 Elbow Size: 54 Loss: 30.46
***
*
*
*
*
*
**
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
* ****
*
*** *
*
*
*
*
*
*
*
***
***
****
*
***
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
**
***
*
*
*
*
*
*
***
** **
*
*
*
*
*
*
*
*
*
**
*
***
*
*
*
*
*
*
*
*
**
*
* *
**
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
**
*
*
*
*
*
*
*
*
*
** *
**
**
*
**
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
Mixture Data − Radial Kernel Gamma=5
Step: 483 Error: 1 Elbow Size: 90 Loss: 1.01
***
*
*
*
*
*
**
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
* ****
*
*** *
*
*
*
*
*
*
*
***
***
****
*
***
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
**
***
*
*
*
*
*
*
***
** **
*
*
*
*
*
*
*
*
*
**
*
***
*
*
*
*
*
*
*
*
**
*
* *
**
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
**
*
*
*
*
*
*
*
*
*
** *
**
**
*
**
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
June 2006 Trevor Hastie, Stanford Statistics 28
SVM as a regularization method
-3 -2 -1 0 1 2 3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Binomial Log-likelihoodSupport Vector
yf(x) (margin)
Loss
With f(x) = xT β + β0 andyi ∈ {−1, 1}, consider
minβ0, β
N∑i=1
[1−yif(xi)]++λ
2‖β‖2
This hinge loss criterionis equivalent to the SVM,with λ monotone in B.Compare with
minβ0, β
N∑i=1
log[1 + e−yif(xi)
]+
λ
2‖β‖2
This is binomial deviance loss, and the solution is “ridged” linearlogistic regression.
June 2006 Trevor Hastie, Stanford Statistics 29
The Need for Regularization
1e−01 1e+01 1e+03
0.20
0.25
0.30
0.35
1e−01 1e+01 1e+03 1e−01 1e+01 1e+03 1e−01 1e+01 1e+03
Tes
t Err
or
Test Error Curves − SVM with Radial Kernel
γ = 5 γ = 1 γ = 0.5 γ = 0.1
C = 1/λ
• γ is a kernel parameter: K(x, z) = exp(−γ||x− z||2).• λ (or C) are regularization parameters, which have to be
determined using some means like cross-validation.
June 2006 Trevor Hastie, Stanford Statistics 30
• Using logistic regression + binomial loss or Adaboostexponential loss, and same quadratic penalty as SVM, we getthe same limiting margin as SVM (Rosset, Zhu and Hastie,JMLR 2004)
• Alternatively, using the “Hinge loss” of SVMs and an L1
penalty (rather than quadratic), we get a Lasso version ofSVMs (with at most N variables in the solution for any valueof the penalty.
June 2006 Trevor Hastie, Stanford Statistics 31
Concluding Comments
• Boosting fits a monotone L1 regularization path toward amaximum-margin classifier
• Many modern function estimation techniques create a path ofsolutions via regularization.
• In many cases these paths can be computed efficiently andentirely.
• This facilitates the important step of model selection —selecting a desirable position along the path — using a testsample or by CV.