Linear Regression with Sparsity Constraintsjbradic/math287d/Untitled.pdfLinear Regression with...

Linear Regression with Sparsity Constraints

April 22, 2013

MATH 287D Spring 2013: Statistical LearningUniversity of California, San DiegoInstructor: Jelena Bradic

http://math.ucsd.edu/~jbradic

http://math.ucsd.edu/~jbradic

Restricted Null Space and l1 minimization

A Random Matrix Theory Result

Lasso, Ridge, Scad and all that fun stuff

Noiseless linear model and basis pursuit

y = Xθ∗

I under-determined system of linear equations: unidentifiable without constraints

I say θ∗ ∈ Rp is sparse: supported on S ⊂ {1, 2, 3, · · · , p}.

l0 optimization

θ∗ = arg minθ∈Rp ‖θ‖0

s.t. Xθ = y

Computationally intractableNP hard

l1 relaxation

θ∗ = arg minθ∈Rp ‖θ‖1

s.t. Xθ = y

Linear Program (easy to solve)Basis pursuit relaxation

Restricted nullspace: necessary and sufficient

DefinitionFor a fixed S ⊂ {1, · · · , p} the matrix X ∈ Rn×p satisfies the restricted nullspaceproperty with respect to S (RN(S) for short) if

N(X) ∩ C(S) = {0}

(Donoho & Huo, 2001; Feuer & Nemirovski, 2003; Cohen et al, 2009)

TheoremBasis pursuit is exact for all S-sparse vectors (if sparsity is not of large size)⇔There is at most one decomposition of vector θ such that the l1 relaxation problem issatisfied⇔Matrix X satisfies RN(S)


DefinitionFor a fixed S ⊂ {1, · · · , p} the matrix X ∈ Rn×p satisfies the restricted nullspaceproperty with respect to S (RN(S) for short) if

N(X) ∩ C(S) = {0}

{∆ ∈ Rp : X∆ = 0} ∩ {∆ ∈ Rp : ‖∆Sc ‖1 ≤ ‖∆S‖1} = {0}

(Donoho & Huo, 2001; Feuer & Nemirovski, 2003; Cohen et al, 2009)

TheoremBasis pursuit is exact for all S-sparse vectors (if sparsity is not of large size)⇔There is at most one decomposition of vector θ such that the l1 relaxation problem issatisfied⇔Matrix X satisfies RN(S)


Proof: (Sufficiency)

I Error vector ∆ = θ∗ − θ satisfies X∆ = 0 and hence ∆ ∈ N(X)

I We need to show that ∆ ∈ C(S)

I Optimality of θ: ‖θ‖1 ≤ ‖θ∗‖1 = ‖θ∗S ‖1

I Sparsity of θ∗: ‖θ‖1 = ‖θ∗ + ∆‖1 = ‖θ∗S + ∆S‖1 + ‖∆Sc ‖1

I Triangle Inequality: ‖θ∗S + ∆S‖1‖∆Sc ‖1 ≥ ‖θ∗S ‖1 − ‖∆S‖1 + ‖∆Sc ‖1

Hence ∆ ∈ N(X) ∩ C(S)⇒ ∆ = 0⇒ θ = θ∗


I Suppose θ∗ = (0, 0, θ∗3 ), S = 3

I Then ∆ = θ − θ∗ belongs to this set

C(S , 1) = {(∆1,∆2,∆3) : |∆1|+ |∆2| ≤ |∆3|}

Some sufficient conditions

How to verify RN condition ?

Donoho & Huo (2001)::Elementwise incoherencecondition

maxj,k=1,··· ,p

∣∣∣∣< xj , xk >

n− 1{j = k}

∣∣∣∣ ≤ δ1

s

Matrices with i.i.d. sub-Gaussian entries: holds withn ≥ s2 log p.

Candes & Tao (2005)::Restricted Isometry(submatrix incoherence)

max|U|≤2s

∥∥∥∥XtU XU

n− I|U|×|U|

∥∥∥∥2

≤ δ2s

Matrices with i.i.d. sub-Gaussian entries: holds withn ≥ s log (p/s).

Incoherence conditions imply RN, but are far from necessary ..... It is very easy toviolate them

Example:Let Xi ∼ N (0,Σ) i.i.d. and

X = [x1, · · · , xp ] =

XT

1XT

2. . .XT

n

∈ Rn×p

LetΣ = (1− µ)Ip×p + µ11T

I Elementwise incoherence violated for any j 6= k

P[< xj , xk >

n≥ µ− ε

]≥ 1− c1 exp{−c2nε2}

I RIP constants teds to infinity as n, s increase

P[∥∥∥∥Xt

S XS

n− Is×s

∥∥∥∥2

≥ µ(s − 1)− 1− ε]≥ 1− c1 exp{−c2nε2}

Hoeffding’s inequality

LemmaIf Z is a random variable with mean zero and a ≤ Z ≤ b then

E[esZ ] ≤ es2((b−a)2)

8

TheoremLet Y1, · · · ,Yn be bounded independent random variables such that ai ≤ Yi ≤ bi withprobability 1. Let Sn =

∑ni=1 Yi . Then, for any t > 0

P(|Sn − E(Sn)| > t) ≤ 2e−2t2∑n

i=1(bi−ai )2

Direct Result for restricted nullspace/eigenvalues

Theorem (Raskutti & Wainwright & Yu, 2009)Consider a random design X ∈ Rn×p with each row Xi ∼ N (0,Σ) i.i.d, and defineκ(σ) = max Σjj . Then, for universal constants c1, c2

‖Xθ‖2√n≥

1

2‖Σ1/2θ‖2 − 9κ(Σ)

√log p

n‖θ‖1

for all θ ∈ Rp with probability greater than 1− c1 exp{−c2n}.

Direct Result for restricted nullspace/eigenvalues

Theorem (Raskutti & Wainwright & Yu, 2009)Consider a random design X ∈ Rn×p with each row Xi ∼ N (0,Σ) i.i.d, and defineκ(σ) = max Σjj . Then, for universal constants c1, c2

‖Xθ‖2√n≥

1

2‖Σ1/2θ‖2 − 9κ(Σ)

√log p

n‖θ‖1

for all θ ∈ Rp with probability greater than 1− c1 exp{−c2n}.

I Much less restrictive than incoherence/RIPI Many matrix families covered:

I Toeplitz dependencyI Constant µ-correlationI Covariance matrix Σ can be degenerate

Shrinkage Methods

I Impose a penalty on the size of the coefficients:

minβ

RSS(β) + λQ(β)

I This is equivalent to

minβ

RSS(β) subject to Q(β) ≤ s

for any given λ ∈ [0, 1) there exists a s > 0 such that the two problems have thesame solution, and vice versa.

I The tuning parameter λ (or s) is chosen to minimize (an estimate of) predictionerror.

I Often, the predictors are normalized to have mean 0 and same ‘size’; the responseis centered and β0 set to .

I For best subset selection, Q(β) = |β|0 =∑

1βj 6= 0.

Ridge Regression

I Ridge regression employs Q(β) = |β|22 =∑β2

j :

β = arg minβ

RSS(β) + λ|β|22

= arg minβ

RSS(β) subject to |β|22 ≤ s

I Explicitlyβ = (XT X + λI )−1XT y

I In analogy with least squares, the degrees of freedom are defined as

df(λ) = tr(Hλ) = XT (XT X + λI )−1XT

Lasso

I Lasso employs Q(β) = |β|1 =∑|βj | :

β = arg minβ

RSS(β) + λ|β|1

= arg minβ


I In general, no explicitly form is available – optimization is convex.

Figure: Contour lines of residual sum of squares and l1-ball corresponding to the Lasso problem.Right: Analogous to left panel but with l2-ball corresponding to Ridge regression

Scad

I Scad employs Q′(β) = λ1{|β|1 ≤ λ}+(aλ−|β|1)+

(a−1)λ1{|β| > λ} :

β = arg minβ

RSS(β) + Q(β)

= arg minβ


I In general, no explicitly form is available – optimization is non-convex.

Orthogonal Predictors

I Suppose that X has orthonormal column vectors.

I Let β be ordinary least squares estimator.

Method Formula for jth coefficient

Best subset (size q) βj 1|βj | > |β|(p−q)

Ridge βj/(1 + λ)

Lasso (|βj | − λ)1|βj | > λequiv to soft thresholding by Donoho and Johnstone (1994)

I We see that ridge regression does not set coefficients to zero, while lasso does.

Lasso and orthogonal predictors

Remember that Lasso solves the following optimization problem

βlasso = arg minβ

RSS(β) + λ|β|1

which is equivalent to

βlasso = arg minβ−2yT Xβ + βTβ + λ

p∑j=1

|βj |

(because we know that XT X = I) = arg minβ−2ββ + βTβ + λ

p∑j=1

|βj |

=

p∑j=1

(−2βjβj + β2

j + λ|βj |)

Hence, the optimization can be solved for each index j separately:

min

{minβ>0

(−2ββ + β2 − λβ

),minβ<0

(−2ββ + β2 − 2λβ

)}

Implementation Perspective

The Lars algorithm ( Efron, Hastie, Johnstone and Tibshirani (2004).)I Build on Forward Stagewise Regression estimation : an iterative procedure, where

successive estimates are built via a series of small stepsI Let η = Xβ. Set initial estimator η0 = 0.I Let η be the current estimate.I The next step is taken in the direction of the greatest correlation between covariate xj

and the current residual.

r = XT (y − η), j = arg max |rj |I Then the next step estimate is chained at that one coordinate j by the following update

η ← η + ε sign(rj ) xj

where ε > 0 is some constant. Smaller ε yield less greedy algorithms.

Lars

I The algorithm begins at η0 = 0

I Suppose η is the current estimate and write

r = XT (y − η) for its residual

I Define the active set A as the set of the indices corresponding to the covariateswith the largest absolute correlations:

R = maxj|rj |, A =

{j : |rj | = R

}I Define the active matrix corresponding to A as XA = (sj xj )j∈A, sj = sign(rj )

I The next step of the Lars estimate gives the update

η ← η + γuA

I For γ is the smallest positive number such that one and only one new index joins theactive set A.

I A unit equiangular vector with columns of the active set matrix XA

uA = XA(

1A(XA)−11A)−1/2 (

XTAXA

)−11A

Lars cont’

The Lasso, the FSW and the Lars all build a sequence of candidate models, fromwhich the final model is chosen.

I In the Lasso, the sequence is controlled by s

I In the FSW, it is controlled by the number of steps

I The Lars builds (p + 1) models with the number of variables ranging from 0top.

There is a close relationship among these procedures in that they give almost identicalsolution paths. That is, if the candidate models are connected in each of theseprocedures, the resulting graphs are very similar . In the special case of orthogonaldesign matrix, the solution paths of the procedures are identical.

Goal of Variable Selection

I In many practical situations, some covariates are superuous.

I That is, conditional on a subset of the covariates, the response does not dependon the other covariates.

I In other words, only a proper subset of the regression coecients are nonzero.

I The problem of variable selection is to identify this set of impor- tant covariates.

Lasso model selection-toy lemma

LemmaWhen the true model β∗ = (β∗1 , 0, 0, · · · , 0) ∈ Rp with p − 1 zero components and

XT X = Ip , the Lasso estimator tuned for the prediction accuracy selects the right

model is and only if δ = βOLS − β∗ ∈ R where

R = {δ ∈ Rp : δ1β∗1 > 0, |δ1| > max{|δ2|, · · · , |δp |}} .

The probability of the right model being selected is 1/(2p).

Necessity: Recall that the Lasso solution in the orthonormal case is

βLasso,j = sign(βj )(|βj | − λ)+, ∀j ∈ 1, · · · , p

If the correct model is selected then βLasso = β∗ and we need

|β1| −max{|β2|, · · · , |βp |

}≥ |β∗1 |

W.L.G. assume that β∗1 > 0 and |δ2| = max{|δ2|, · · · , |δp |

}. First note that, if β1 < 0

then β1 ≤ 0 < β∗1 and the Lasso never selects the true model. Then, when β1 ≥ 0 to

have βLasso = β∗ we need

β1 − |β2| ≥ β∗1 ⇔ δ1 > |δ2|Sufficiency: The prediction error is minimized at the desired point of Lasso estimator.

PE = (βLasso−β∗)T XT X(βLasso−β

∗) =

β? if γ ≥ |β1|

(δ1 − γ)2 if |β2| ≤ γ < β1

(δ1 − γ)2 +∑p

j=2(β2)2 if γ < |β2|

I In practice prediction accuracy is the golden standard and the Lasso can improvegreatly over the ordinary least estimate in terms of accuracy

EXnew,Ynew[(Ynew − Xnewβ)2] = σ2 + (β − β∗)T Σ(β − β∗)

I For random designs (β − β∗)T Σ(β − β∗) = EXnew (Xnew(β − β∗)2)I For fixed designs (β − β∗)T Σ(β − β∗) = ‖X(β − β∗)‖2

2/n

I Model selection properties depend highly on the way tuning parameter λ ischoosen

Date post:	22-Jul-2020
Category:	Documents
Upload:	others
View:	8 times
Download:	0 times

Linear Regression with Sparsity Constraintsjbradic/math287d/Untitled.pdfLinear Regression with...

Documents