Linear Regression with Sparsity Constraints
April 22, 2013
MATH 287D Spring 2013: Statistical LearningUniversity of California, San DiegoInstructor: Jelena Bradic
http://math.ucsd.edu/~jbradic
Restricted Null Space and l1 minimization
A Random Matrix Theory Result
Lasso, Ridge, Scad and all that fun stuff
Noiseless linear model and basis pursuit
y = Xθ∗
I under-determined system of linear equations: unidentifiable without constraints
I say θ∗ ∈ Rp is sparse: supported on S ⊂ {1, 2, 3, · · · , p}.
l0 optimization
θ∗ = arg minθ∈Rp ‖θ‖0
s.t. Xθ = y
Computationally intractableNP hard
l1 relaxation
θ∗ = arg minθ∈Rp ‖θ‖1
s.t. Xθ = y
Linear Program (easy to solve)Basis pursuit relaxation
Restricted nullspace: necessary and sufficient
DefinitionFor a fixed S ⊂ {1, · · · , p} the matrix X ∈ Rn×p satisfies the restricted nullspaceproperty with respect to S (RN(S) for short) if
N(X) ∩ C(S) = {0}
(Donoho & Huo, 2001; Feuer & Nemirovski, 2003; Cohen et al, 2009)
TheoremBasis pursuit is exact for all S-sparse vectors (if sparsity is not of large size)⇔There is at most one decomposition of vector θ such that the l1 relaxation problem issatisfied⇔Matrix X satisfies RN(S)
Restricted nullspace: necessary and sufficient
DefinitionFor a fixed S ⊂ {1, · · · , p} the matrix X ∈ Rn×p satisfies the restricted nullspaceproperty with respect to S (RN(S) for short) if
N(X) ∩ C(S) = {0}
{∆ ∈ Rp : X∆ = 0} ∩ {∆ ∈ Rp : ‖∆Sc ‖1 ≤ ‖∆S‖1} = {0}
(Donoho & Huo, 2001; Feuer & Nemirovski, 2003; Cohen et al, 2009)
TheoremBasis pursuit is exact for all S-sparse vectors (if sparsity is not of large size)⇔There is at most one decomposition of vector θ such that the l1 relaxation problem issatisfied⇔Matrix X satisfies RN(S)
Restricted nullspace: necessary and sufficient
Proof: (Sufficiency)
I Error vector ∆ = θ∗ − θ satisfies X∆ = 0 and hence ∆ ∈ N(X)
I We need to show that ∆ ∈ C(S)
I Optimality of θ: ‖θ‖1 ≤ ‖θ∗‖1 = ‖θ∗S ‖1
I Sparsity of θ∗: ‖θ‖1 = ‖θ∗ + ∆‖1 = ‖θ∗S + ∆S‖1 + ‖∆Sc ‖1
I Triangle Inequality: ‖θ∗S + ∆S‖1‖∆Sc ‖1 ≥ ‖θ∗S ‖1 − ‖∆S‖1 + ‖∆Sc ‖1
Hence ∆ ∈ N(X) ∩ C(S)⇒ ∆ = 0⇒ θ = θ∗
Restricted nullspace: necessary and sufficient
I Suppose θ∗ = (0, 0, θ∗3 ), S = 3
I Then ∆ = θ − θ∗ belongs to this set
C(S , 1) = {(∆1,∆2,∆3) : |∆1|+ |∆2| ≤ |∆3|}
Some sufficient conditions
How to verify RN condition ?
Donoho & Huo (2001)::Elementwise incoherencecondition
maxj,k=1,··· ,p
∣∣∣∣< xj , xk >
n− 1{j = k}
∣∣∣∣ ≤ δ1
s
Matrices with i.i.d. sub-Gaussian entries: holds withn ≥ s2 log p.
Candes & Tao (2005)::Restricted Isometry(submatrix incoherence)
max|U|≤2s
∥∥∥∥XtU XU
n− I|U|×|U|
∥∥∥∥2
≤ δ2s
Matrices with i.i.d. sub-Gaussian entries: holds withn ≥ s log (p/s).
Incoherence conditions imply RN, but are far from necessary ..... It is very easy toviolate them
Example:Let Xi ∼ N (0,Σ) i.i.d. and
X = [x1, · · · , xp ] =
XT
1XT
2. . .XT
n
∈ Rn×p
LetΣ = (1− µ)Ip×p + µ11T
I Elementwise incoherence violated for any j 6= k
P[< xj , xk >
n≥ µ− ε
]≥ 1− c1 exp{−c2nε2}
I RIP constants teds to infinity as n, s increase
P[∥∥∥∥Xt
S XS
n− Is×s
∥∥∥∥2
≥ µ(s − 1)− 1− ε]≥ 1− c1 exp{−c2nε2}
Hoeffding’s inequality
LemmaIf Z is a random variable with mean zero and a ≤ Z ≤ b then
E[esZ ] ≤ es2((b−a)2)
8
TheoremLet Y1, · · · ,Yn be bounded independent random variables such that ai ≤ Yi ≤ bi withprobability 1. Let Sn =
∑ni=1 Yi . Then, for any t > 0
P(|Sn − E(Sn)| > t) ≤ 2e−2t2∑n
i=1(bi−ai )2
Direct Result for restricted nullspace/eigenvalues
Theorem (Raskutti & Wainwright & Yu, 2009)Consider a random design X ∈ Rn×p with each row Xi ∼ N (0,Σ) i.i.d, and defineκ(σ) = max Σjj . Then, for universal constants c1, c2
‖Xθ‖2√n≥
1
2‖Σ1/2θ‖2 − 9κ(Σ)
√log p
n‖θ‖1
for all θ ∈ Rp with probability greater than 1− c1 exp{−c2n}.
Direct Result for restricted nullspace/eigenvalues
Theorem (Raskutti & Wainwright & Yu, 2009)Consider a random design X ∈ Rn×p with each row Xi ∼ N (0,Σ) i.i.d, and defineκ(σ) = max Σjj . Then, for universal constants c1, c2
‖Xθ‖2√n≥
1
2‖Σ1/2θ‖2 − 9κ(Σ)
√log p
n‖θ‖1
for all θ ∈ Rp with probability greater than 1− c1 exp{−c2n}.
I Much less restrictive than incoherence/RIPI Many matrix families covered:
I Toeplitz dependencyI Constant µ-correlationI Covariance matrix Σ can be degenerate
Shrinkage Methods
I Impose a penalty on the size of the coefficients:
minβ
RSS(β) + λQ(β)
I This is equivalent to
minβ
RSS(β) subject to Q(β) ≤ s
for any given λ ∈ [0, 1) there exists a s > 0 such that the two problems have thesame solution, and vice versa.
I The tuning parameter λ (or s) is chosen to minimize (an estimate of) predictionerror.
I Often, the predictors are normalized to have mean 0 and same ‘size’; the responseis centered and β0 set to .
I For best subset selection, Q(β) = |β|0 =∑
1βj 6= 0.
Ridge Regression
I Ridge regression employs Q(β) = |β|22 =∑β2
j :
β = arg minβ
RSS(β) + λ|β|22
= arg minβ
RSS(β) subject to |β|22 ≤ s
I Explicitlyβ = (XT X + λI )−1XT y
I In analogy with least squares, the degrees of freedom are defined as
df(λ) = tr(Hλ) = XT (XT X + λI )−1XT
Lasso
I Lasso employs Q(β) = |β|1 =∑|βj | :
β = arg minβ
RSS(β) + λ|β|1
= arg minβ
RSS(β) subject to |β|1 ≤ s
I In general, no explicitly form is available – optimization is convex.
Figure: Contour lines of residual sum of squares and l1-ball corresponding to the Lasso problem.Right: Analogous to left panel but with l2-ball corresponding to Ridge regression
Scad
I Scad employs Q′(β) = λ1{|β|1 ≤ λ}+(aλ−|β|1)+
(a−1)λ1{|β| > λ} :
β = arg minβ
RSS(β) + Q(β)
= arg minβ
RSS(β) subject to |β|1 ≤ s
I In general, no explicitly form is available – optimization is non-convex.
Orthogonal Predictors
I Suppose that X has orthonormal column vectors.
I Let β be ordinary least squares estimator.
Method Formula for jth coefficient
Best subset (size q) βj 1|βj | > |β|(p−q)
Ridge βj/(1 + λ)
Lasso (|βj | − λ)1|βj | > λequiv to soft thresholding by Donoho and Johnstone (1994)
I We see that ridge regression does not set coefficients to zero, while lasso does.
Lasso and orthogonal predictors
Remember that Lasso solves the following optimization problem
βlasso = arg minβ
RSS(β) + λ|β|1
which is equivalent to
βlasso = arg minβ−2yT Xβ + βTβ + λ
p∑j=1
|βj |
(because we know that XT X = I) = arg minβ−2ββ + βTβ + λ
p∑j=1
|βj |
=
p∑j=1
(−2βjβj + β2
j + λ|βj |)
Hence, the optimization can be solved for each index j separately:
min
{minβ>0
(−2ββ + β2 − λβ
),minβ<0
(−2ββ + β2 − 2λβ
)}
Implementation Perspective
The Lars algorithm ( Efron, Hastie, Johnstone and Tibshirani (2004).)I Build on Forward Stagewise Regression estimation : an iterative procedure, where
successive estimates are built via a series of small stepsI Let η = Xβ. Set initial estimator η0 = 0.I Let η be the current estimate.I The next step is taken in the direction of the greatest correlation between covariate xj
and the current residual.
r = XT (y − η), j = arg max |rj |I Then the next step estimate is chained at that one coordinate j by the following update
η ← η + ε sign(rj ) xj
where ε > 0 is some constant. Smaller ε yield less greedy algorithms.
Lars
I The algorithm begins at η0 = 0
I Suppose η is the current estimate and write
r = XT (y − η) for its residual
I Define the active set A as the set of the indices corresponding to the covariateswith the largest absolute correlations:
R = maxj|rj |, A =
{j : |rj | = R
}I Define the active matrix corresponding to A as XA = (sj xj )j∈A, sj = sign(rj )
I The next step of the Lars estimate gives the update
η ← η + γuA
I For γ is the smallest positive number such that one and only one new index joins theactive set A.
I A unit equiangular vector with columns of the active set matrix XA
uA = XA(
1A(XA)−11A)−1/2 (
XTAXA
)−11A
Lars cont’
The Lasso, the FSW and the Lars all build a sequence of candidate models, fromwhich the final model is chosen.
I In the Lasso, the sequence is controlled by s
I In the FSW, it is controlled by the number of steps
I The Lars builds (p + 1) models with the number of variables ranging from 0top.
There is a close relationship among these procedures in that they give almost identicalsolution paths. That is, if the candidate models are connected in each of theseprocedures, the resulting graphs are very similar . In the special case of orthogonaldesign matrix, the solution paths of the procedures are identical.
Goal of Variable Selection
I In many practical situations, some covariates are superuous.
I That is, conditional on a subset of the covariates, the response does not dependon the other covariates.
I In other words, only a proper subset of the regression coecients are nonzero.
I The problem of variable selection is to identify this set of impor- tant covariates.
Lasso model selection-toy lemma
LemmaWhen the true model β∗ = (β∗1 , 0, 0, · · · , 0) ∈ Rp with p − 1 zero components and
XT X = Ip , the Lasso estimator tuned for the prediction accuracy selects the right
model is and only if δ = βOLS − β∗ ∈ R where
R = {δ ∈ Rp : δ1β∗1 > 0, |δ1| > max{|δ2|, · · · , |δp |}} .
The probability of the right model being selected is 1/(2p).
Necessity: Recall that the Lasso solution in the orthonormal case is
βLasso,j = sign(βj )(|βj | − λ)+, ∀j ∈ 1, · · · , p
If the correct model is selected then βLasso = β∗ and we need
|β1| −max{|β2|, · · · , |βp |
}≥ |β∗1 |
W.L.G. assume that β∗1 > 0 and |δ2| = max{|δ2|, · · · , |δp |
}. First note that, if β1 < 0
then β1 ≤ 0 < β∗1 and the Lasso never selects the true model. Then, when β1 ≥ 0 to
have βLasso = β∗ we need
β1 − |β2| ≥ β∗1 ⇔ δ1 > |δ2|Sufficiency: The prediction error is minimized at the desired point of Lasso estimator.
PE = (βLasso−β∗)T XT X(βLasso−β
∗) =
β? if γ ≥ |β1|
(δ1 − γ)2 if |β2| ≤ γ < β1
(δ1 − γ)2 +∑p
j=2(β2)2 if γ < |β2|
I In practice prediction accuracy is the golden standard and the Lasso can improvegreatly over the ordinary least estimate in terms of accuracy
EXnew,Ynew[(Ynew − Xnewβ)2] = σ2 + (β − β∗)T Σ(β − β∗)
I For random designs (β − β∗)T Σ(β − β∗) = EXnew (Xnew(β − β∗)2)I For fixed designs (β − β∗)T Σ(β − β∗) = ‖X(β − β∗)‖2
2/n
I Model selection properties depend highly on the way tuning parameter λ ischoosen