Sparsity with sign-coherent groups of variables via the cooperative-Lasso

Sparsity with sign-coherent groups of variables via thecooperative-Lasso

Julien Chiquet1, Yves Grandvalet2, Camille Charbonnier1

1 Statistique et Genome, CNRS & Universite d’Evry Val d’Essonne

2 Heudiasyc, CNRS & Universite de Technologie de Compiegne

SSB – 29 mars 2011

arXiv preprint.

http://arxiv.org/abs/1103.2697

R-package scoop.

http://stat.genopole.cnrs.fr/logiciels/scoop

cooperative-Lasso 1

http://arxiv.org/abs/1103.2697

http://stat.genopole.cnrs.fr/logiciels/scoop

Notations

Let

I Y be the output random variable,

I X = (X1, . . . , Xp) be the input random variables, where Xj is thejth predictor.

The dataGiven a sample (yi, xi), i = 1, . . . , n of i.id. realizations of (Y,X),denote

I y = (y1, . . . , yn)ᵀ the response vector,

I xj = (xj1, . . . , xjn)ᵀ the vector of data for the jth predictor,

I X the n× p design matrix of data whose jth column is xj ,

I D = i : (yi, xi) ∈ training set,I T = i : (yi, xi) ∈ test set.

cooperative-Lasso 2

Generalized linear models

Suppose Y depends linearly on X through a function g:

E(Y ) = g(Xβ?).

We predict a response yi by yi = g(xiβ) for any i ∈ T by solving

β = arg maxβ

`D(β) = arg minβ

∑i∈D

Lg(yi,xiβ),

where Lg is a loss function depending on the function g. Typically,

I if Y is Gaussian and g = Id (OLS),

Lg(y, xβ) = (y − xβ)2

I if Y is binary and g : t 7→ g(t) = (1 + e−t)−1 (logistic regression)

Lg(y, xβ) = −(y · xβ − log

1 + exβ

)or any negative log-likelihood ` of an exponential family distribution.

cooperative-Lasso 3

Generalized linear models

Suppose Y depends linearly on X through a function g:

E(Y ) = g(Xβ?).

We predict a response yi by yi = g(xiβ) for any i ∈ T by solving

β = arg maxβ

`D(β) = arg minβ

∑i∈D

Lg(yi,xiβ),

where Lg is a loss function depending on the function g. Typically,

I if Y is Gaussian and g = Id (OLS),

Lg(y, xβ) = (y − xβ)2

I if Y is binary and g : t 7→ g(t) = (1 + e−t)−1 (logistic regression)

Lg(y, xβ) = −(y · xβ − log

1 + exβ

)or any negative log-likelihood ` of an exponential family distribution.

cooperative-Lasso 3

Estimation and selection at the group level

1. Structure: the set I = 1, . . . , p splits into a known partition.

I =

K⋃k=1

Gk, with Gk ∩ G` = ∅, k 6= `.

2. Sparsity: the support S of β? has few entries.

S = i : β?i 6= 0, such as |S| p.

The group-Lasso estimator

Grandvalet and Canu ’98, Bakin ’99, Yuan and Lin ’06

βgroup

= arg minβ∈Rp

−`D(β) + λ

K∑k=1

wk∥∥βGk∥∥

.

I λ ≥ 0 controls the overall amount of penalty,

I wk > 0 adapts the penalty between groups (dropped hereafter).

cooperative-Lasso 4

Estimation and selection at the group level

1. Structure: the set I = 1, . . . , p splits into a known partition.

I =

K⋃k=1

Gk, with Gk ∩ G` = ∅, k 6= `.

2. Sparsity: the support S of β? has few entries.

S = i : β?i 6= 0, such as |S| p.

The group-Lasso estimator

Grandvalet and Canu ’98, Bakin ’99, Yuan and Lin ’06

βgroup

= arg minβ∈Rp

−`D(β) + λ

K∑k=1

wk∥∥βGk∥∥

.

I λ ≥ 0 controls the overall amount of penalty,

I wk > 0 adapts the penalty between groups (dropped hereafter).

cooperative-Lasso 4

Toy example: the prostate dataset

Examines the correlation between the prostate specific antigen and 8clinical measures for 97 patients.

-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0

lambda (log scale)

coeffi

cien

ts

lcp

agepgg45

gleason

lbph

lcavol

lweight

svi

Figure: Lasso

I lcavol log(cancer volume)

I lweight log(prostate weight)

I age age

I lbph log(benign prostatichyperplasia amount)

I svi seminal vesicle invasion

I lcp log(capsular penetration)

I gleason Gleason score

I pgg45 percentage Gleason scores 4or 5

cooperative-Lasso 5



age

pgg45

lbph

lcavol

svi

lcp

lweigh

t

gleason

0100

200

300

400

500

600

Heigh

t

Figure: hierarchical clustering



I age age






cooperative-Lasso 5



-3 -2 -1 0

lambda (log scale)

coeffi

cien

ts

lcp

agepgg45

gleason

lbph

lcavol

lweight

svi

Figure: group-Lasso



I age age






cooperative-Lasso 5



-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0

lambda (log scale)

coeffi

cien

ts

lcp

agepgg45

gleason

lbph

lcavol

lweight

svi

Figure: Lasso



I age age






cooperative-Lasso 5

Application to splice site detection

Predict splice site status (0/1) by a sequence of 7 bases and theirinteractions.

1 2 3 4 5 6 7 8 9

Position

0

0.5

1

1.5

2

Inform

ationcontent I order 0: 7 factors with 4 levels,

I order 1: C27 factors with 42 levels,


I using dummy coding for factor,we form groups.

L. Meier, S. van de Geer, P. Buhlmann, 2008.The group-Lasso for logistic regression, JRSS series B.

cooperative-Lasso 6

Application to splice site detection

Predict splice site status (0/1) by a sequence of 7 bases and theirinteractions.

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0

g18

g5

g4

g44

g54

g42

g49

g45

g61

order 0order 1order 2

I order 0: 7 factors with 4 levels,



I using dummy coding for factor,we form groups.

L. Meier, S. van de Geer, P. Buhlmann, 2008.The group-Lasso for logistic regression, JRSS series B.

cooperative-Lasso 6

Group-Lasso limitations

1. Not a single zero should belong to a group with non-zerosI Strong group sparsity (Huang and Zhang, ’10 arXiv)

establish the conditions where the group-Lasso outperforms the Lasso,and conversely.

2. No sign-coherence within groupI Required if groups gather consonant variables

e.g., groups defined by clusters of positively correlated variables.

The cooperative-Lasso

A penalty which assumes a sign-coherent group structure, that is to say,groups which gather either

I non-positive,

I non-negative,

I or null parameters.

cooperative-Lasso 7

Group-Lasso limitations

1. Not a single zero should belong to a group with non-zerosI Strong group sparsity (Huang and Zhang, ’10 arXiv)

establish the conditions where the group-Lasso outperforms the Lasso,and conversely.

2. No sign-coherence within groupI Required if groups gather consonant variables

e.g., groups defined by clusters of positively correlated variables.

The cooperative-Lasso

A penalty which assumes a sign-coherent group structure, that is to say,groups which gather either

I non-positive,

I non-negative,

I or null parameters.

cooperative-Lasso 7

Motivation: multiple network inference

experiment 1 experiment 2 experiment 3

inference inference inference

A group is a set of corresponding edges across tasks (e.g., red or blueones): sign-coherence matters!

J. Chiquet, Y. Grandvalet, C. Ambroise, 2010.Inferring multiple graphical structures, Statistics and Computing.

cooperative-Lasso 8

Motivation: joint segmentation of aCGH profiles

0 50 100 150 200

-2-1

01

position on chromosom

log-ratio(C

NVs)

minimize

β∈Rp‖β − y‖2,

s.t

p∑i=1

|βi − βi−1| < s,

where

I y a vector in Rp,

I β a vector in Rp,

I βi a size-n vector with ith probesfor the n profiles.

I a group gathers every position iacross profiles.

Sign-coherence may avoid inconsistent

variations across profiles.

K. Bleakley and J.-P. Vert, 2010.Joint segmentation of many aCGH profiles using fast group LARS, NIPS.

cooperative-Lasso 9


0 50 100 150 200

-2-1

01


log-ratio(C

NVs)

minimizeβ∈Rn×p

‖β −Y‖2,

s.t

p∑i=1

‖βi − βi−1‖ < s,

where

I Y a n× p matrix with n profileswith size p.






cooperative-Lasso 9


0 50 100 150 200

-2-1

01


log-ratio(C

NVs)

minimizeβ∈Rn×p

‖β −Y‖2,

s.t

p∑i=1

‖βi − βi−1‖ < s,

where







cooperative-Lasso 9


0 50 100 150 200

-2-1

01


log-ratio(C

NVs)

minimizeβ∈Rn×p

‖β −Y‖2,

s.t

p∑i=1

‖βi − βi−1‖ < s,

where







cooperative-Lasso 9


0 50 100 150 200

-2-1

01


log-ratio(C

NVs)

minimizeβ∈Rn×p

‖β −Y‖2,

s.t

p∑i=1

‖βi − βi−1‖ < s,

where







cooperative-Lasso 9


0 50 100 150 200

-2-1

01


log-ratio(C

NVs)

minimizeβ∈Rn×p

‖β −Y‖2,

s.t

p∑i=1

‖βi − βi−1‖ < s,

where







cooperative-Lasso 9


0 50 100 150 200

-2-1

01


log-ratio(C

NVs)

minimizeβ∈Rn×p

‖β −Y‖2,

s.t

p∑i=1

‖βi − βi−1‖ < s,

where







cooperative-Lasso 9

Outline

Definition

Resolution

Consistency

Model selection

Simulation studies

Sibling probe sets and gene selection

cooperative-Lasso 10

Outline

Definition

Resolution

Consistency

Model selection

Simulation studies



The cooperative-Lasso estimator

Definition

βcoop

= arg minβ∈Rp

J(β), with J(β) = −`D(β) + λ‖β‖coop,

where, for any v ∈ Rp,

‖v‖coop = ‖v+‖group + ‖v−‖group =

K∑k=1

∥∥∥v+Gk

∥∥∥+∥∥∥v−Gk∥∥∥ ,

and

I v+ = (v+1 , . . . , v+p ), v+j = max(0, vj),

I v− = (v−1 , . . . , v−p ), v+j = max(0,−vj).


A geometric view of sparsity`(β1,β

2)

β2 β1

minimizeβ1,β2

−`(β1, β2) + λΩ(β1, β2)

mmaximize

β1,β2`(β1, β2)

s.t. Ω(β1, β2) ≤ c


A geometric view of sparsityβ2

β1

minimizeβ1,β2

−`(β1, β2) + λΩ(β1, β2)

mmaximize

β1,β2`(β1, β2)

s.t. Ω(β1, β2) ≤ c


Ball crafting: group-Lasso

Admissible set

I β = (β1, β2, β3, β4)ᵀ,

I G1 = 1, 2, G2 = 3, 4.

Unit ball

‖β‖group ≤ 1

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

β2=

0β2=

0.3

β4 = 0 β4 = 0.3



Admissible set

I β = (β1, β2, β3, β4)ᵀ,

I G1 = 1, 2, G2 = 3, 4.

Unit ball

‖β‖group ≤ 1

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

β2=

0β2=

0.3

β4 = 0 β4 = 0.3



Admissible set

I β = (β1, β2, β3, β4)ᵀ,

I G1 = 1, 2, G2 = 3, 4.

Unit ball

‖β‖group ≤ 1

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

β2=

0β2=

0.3

β4 = 0 β4 = 0.3



Admissible set

I β = (β1, β2, β3, β4)ᵀ,

I G1 = 1, 2, G2 = 3, 4.

Unit ball

‖β‖group ≤ 1

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

β2=

0β2=

0.3

β4 = 0 β4 = 0.3


Ball crafting: cooperative-Lasso

Admissible set

I β = (β1, β2, β3, β4)ᵀ,

I G1 = 1, 2, G2 = 3, 4.

Unit ball

‖β‖coop ≤ 1

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

β2=

0β2=

0.3

β4 = 0 β4 = 0.3



Admissible set

I β = (β1, β2, β3, β4)ᵀ,

I G1 = 1, 2, G2 = 3, 4.

Unit ball

‖β‖coop ≤ 1

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

β2=

0β2=

0.3

β4 = 0 β4 = 0.3



Admissible set

I β = (β1, β2, β3, β4)ᵀ,

I G1 = 1, 2, G2 = 3, 4.

Unit ball

‖β‖coop ≤ 1

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

β2=

0β2=

0.3

β4 = 0 β4 = 0.3



Admissible set

I β = (β1, β2, β3, β4)ᵀ,

I G1 = 1, 2, G2 = 3, 4.

Unit ball

‖β‖coop ≤ 1

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

1

1

−1

−1

β1

β3

β2=

0β2=

0.3

β4 = 0 β4 = 0.3


Outline

Definition

Resolution

Consistency

Model selection

Simulation studies



Convex analysisSupporting Hyperplane

An hyperplane supports a set iff

I the set is contained in one half-space

I the set has at least one point on the hyperplane

β2

β1






β2

β1

β2

β1






β2

β1

β2

β1






β2

β1

β2

β1






β2

β1

β2

β1

β2

β1

There are Supporting Hyperplane at all points of convex sets:Generalize tangents


Convex analysisDual Cone and subgradient

Generalizes normals

β2

β1

β2

β1

β2

β1

g is a subgradient at xm

the vector (g,−1) is normal to the supporting hyperplane at this point

The subdifferential at x is the set of all subgradient at x.



Generalizes normals

β2

β1

β2

β1

β2

β1






Generalizes normals

β2

β1

β2

β1

β2

β1






Generalizes normals

β2

β1

β2

β1

β2

β1





Optimality conditions

TheoremA necessary and sufficient condition for the optimality of β is that thenull vector 0 belong to the subdifferential of the convex function J :

0 3 ∂βJ(β) = v ∈ Rp : v = −∇β`(β) + λθ,

where θ ∈ Rp belongs to the subdifferential of the coop-norm.Define

ϕj(v) = (sign(vj)v)+,

then θ is such as

∀k ∈ 1, . . . ,K , ∀j ∈ Sk(β) , θj =βj∥∥ϕj(βGk)

∥∥ ,

∀k ∈ 1, . . . ,K , ∀j ∈ Sck(β) ,∥∥ϕj (θGk)

∥∥ ≤ 1.

We derive a subset algorithm to solve that problem (that you canenjoy in the paper and the package).




0 3 ∂βJ(β) = v ∈ Rp : v = −∇β`(β) + λθ,



then θ is such as


∥∥ ,

∀k ∈ 1, . . . ,K , ∀j ∈ Sck(β) ,∥∥ϕj (θGk)

∥∥ ≤ 1.





0 3 ∂βJ(β) = v ∈ Rp : v = −∇β`(β) + λθ,



then θ is such as


∥∥ ,

∀k ∈ 1, . . . ,K , ∀j ∈ Sck(β) ,∥∥ϕj (θGk)

∥∥ ≤ 1.



Linear regression with orthonormal design

Consider

β = arg minβ

1

2‖y −Xβ‖2 + λΩ(β)

,

with XᵀX = I. Hence, (xj)ᵀ(Xβ − y) = βj − βols

and

β = arg minβ

1

2βᵀ(β − βols

) + λΩ(β)

.

We may find a closed-form of β for, e.g.,

1. Ω(β) = ‖β‖lasso,

2. Ω(β) = ‖β‖group,

3. Ω(β) = ‖β‖coop.



Consider

β = arg minβ

1

2‖y −Xβ‖2 + λΩ(β)

,

with XᵀX = I. Hence, (xj)ᵀ(Xβ − y) = βj − βols

and

β = arg minβ

1

2βᵀ(β − βols

) + λΩ(β)

.

We may find a closed-form of β for, e.g.,

1. Ω(β) = ‖β‖lasso,

2. Ω(β) = ‖β‖group,

3. Ω(β) = ‖β‖coop.



βlasso1

βols2 βols

1

∀j ∈ 1, . . . , p ,

βlassoj =

1− λ∣∣∣βolsj

∣∣∣+ βols

j ,

∣∣∣βlassoj

∣∣∣ =(∣∣∣βols

j

∣∣∣− λ)+ .

Fig.: Lasso as a function of the OLS coefficients



βgroup1

βols2 βols

1

∀k ∈ 1, . . . ,K , ∀j ∈ Gk ,

βgroupj =

1− λ∥∥∥βols

Gk

∥∥∥+ βols

j ,

∥∥∥βgroup

Gk

∥∥∥ =(∥∥∥βols

Gk

∥∥∥− λ)+ .

Fig.: Group-Lasso as a function of the OLS coefficients



βcoop1

βols2 βols

1

∀k ∈ 1, . . . ,K , ∀j ∈ Gk ,

βcoopj =

1− λ∥∥∥ϕj(βols

Gk )∥∥∥+ βols

j ,

∥∥∥ϕj(βcoop

Gk )∥∥∥ =

(∥∥∥ϕj(βols

Gk )∥∥∥− λ)+ .

Fig.: Coop-Lasso as a function of the OLS coefficients


Outline

Definition

Resolution

Consistency

Model selection

Simulation studies



Linear regression setupTechnical assumptions

(A1) X and Y have finite fourth order moments

E‖X‖4 <∞, E|Y |4 <∞,

(A2) the covariance matrix Ψ = EXXᵀ ∈ Rp×p is invertible,

(A3) for every k = 1, . . . ,K,if ‖(β?)+‖ > 0 and ‖(β?)−‖ > 0 then for every j ∈ Gk β?j 6= 0.(All sign-coherent groups are either included or excluded from the true support).


Irrepresentability condition

Define Sk = S ∩ Gk the support within a group and

[D(β)]jj = ‖[sign(βj)βGk ]+‖−1.

Assume there exists η > 0 such that

(A4) For every group Gk including at least one null coefficient:

max(‖(ΨSckSΨ−1SSD(β?S)β?S)+‖, ‖(ΨSckSΨ−1SSD(β?S)β?S)−‖) ≤ 1− η,

(A5) For every group Gk intersecting the support and including eitherpositive or negative coefficients, let νk be the sign of thesecoefficients (νk = 1 if ‖(β?Gk)+‖ > 0 and νk = −1 if ‖(β?Gk)−‖ > 0):

νkΨSckSΨ−1SSD(β?S)β?S 0,

where denotes componentwise inequality.


Consistency results

TheoremIf assumptions (A1-5) are satisfied and if there exists η > 0, then forevery sequence λn such that λn = λ0n

−γ , γ ∈]0, 1/2[,

βcoop P−→ β? and P(S(β

coop) = S)→ 1.

Asymptotically, the cooperative-Lasso is unbiased and enjoys exactsupport recovery (even when there are irrelevant variables within agroup).


Sketch of the proof

1. Construct an artifical estimator βS restricted to the true support Sand extend it with 0 coefficients on Sc.

2. Consider the event En on which β satisfies the original optimalityconditions. On En, βS = β

coop

S and βcoop

Sc = 0, by uniqueness.

3. We need to prove that limn→∞ P(En) = 1.

4. Derive the asymptotic distribution of the derivative of the lossfunction Xᵀ(y −Xβ) from

I TCL on second order moments,I Optimality conditions on βS .I Right choice of λn provides convergence in probability.

5. Assumptions (A4-5) assume that the limits in probability satisfyoptimality constraints with strict inequalities.

6. As a result, optimility conditions are satisfied (with largeinequalities) with probability tending to 1.


Sketch of the proof



coop

S and βcoop








Sketch of the proof



coop

S and βcoop








Sketch of the proof



coop

S and βcoop








Illustration

-3 -2 -1 0 1

-1.0

-0.5

0.0

0.5

1.0

log10(λ)

coeffi

cien

ts

Generate data y = Xβ? + σε,

I β? = (1, 1,−1,−1, 0, 0, 0, 0)I G = 1, 2, 3, 4, 5, 6, 7, 8I σ = 0.1, R2 ≈ 0.99, n = 20,

I irrepresentability conditionsI holds for the coop-Lasso,I holds not for the group-Lasso.

I average over 100 simulations.

Fig.:: 50% coverage intervals (upper / lower quartiles)


Illustration

-3 -2 -1 0 1

-1.0

-0.5

0.0

0.5

1.0

log10(λ)

coeffi

cien

ts


I β? = (1, 1,−1,−1, 0, 0, 0, 0)I G = 1, 2, 3, 4, 5, 6, 7, 8I σ = 0.1, R2 ≈ 0.99, n = 20,



Fig.:group-Lasso: 50% coverage intervals (upper / lower quartiles)


Illustration

-3 -2 -1 0 1

-1.0

-0.5

0.0

0.5

1.0

log10(λ)

coeffi

cien

ts


I β? = (1, 1,−1,−1, 0, 0, 0, 0)I G = 1, 2, 3, 4, 5, 6, 7, 8I σ = 0.1, R2 ≈ 0.99, n = 20,



Fig.:coop-Lasso: 50% coverage intervals (upper / lower quartiles)


Outline

Definition

Resolution

Consistency

Model selection

Simulation studies



Optimism of the training error

I The training error:

err =1

|D|∑i∈D

L(yi,xiβ).

I The test error (“extra-sample” error):

Errex = EX,Y [L(Y,Xβ)|D].

I The “in-sample” error

Errin =1

|D|∑i∈D

EY[L(Yi,xiβ)|D

].

Definition (Optimism)

Errin = err + ”optimism”.


Optimism of the training error

I The training error:

err =1

|D|∑i∈D

L(yi,xiβ).

I The test error (“extra-sample” error):

Errex = EX,Y [L(Y,Xβ)|D].

I The “in-sample” error

Errin =1

|D|∑i∈D

EY[L(Yi,xiβ)|D

].

Definition (Optimism)

Errin = err + ”optimism”.


Cp statistics

For squared-error loss (and some other loss),

Errin = err +2

|D|∑i∈D

cov(yi, yi).

The amount by which err underestimates the true error dependson how strongly yi affects its own prediction. The harder we fitthe data, the greater the covariance will be thereby increasingthe optimism (ESLII 5th print).

Mallows’ Cp Statistic

For a linear regression fit yi with p inputs∑

i∈D cov(yi, yi) = pσ2 :

Cp = err + 2 · df

|D| σ2, with df = p.


Cp statistics

For squared-error loss (and some other loss),

Errin = err +2

|D|∑i∈D

cov(yi, yi).

The amount by which err underestimates the true error dependson how strongly yi affects its own prediction. The harder we fitthe data, the greater the covariance will be thereby increasingthe optimism (ESLII 5th print).

Mallows’ Cp Statistic

For a linear regression fit yi with p inputs∑

i∈D cov(yi, yi) = pσ2 :

Cp = err + 2 · df

|D| σ2, with df = p.


Generalized degrees of freedom

Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator.

Proposition (Efron (’04)+ Stein’s Lemma (’81))

df(λ).=

1

σ2

∑i∈D

cov(yi(λ), yi) = Ey

[tr∂yλ∂y

].

For the Lasso, Zou et al. (’07) show that

df lasso(λ) =∥∥∥βlasso

(λ)∥∥∥0.

Assuming XᵀX = I Yuan and Lin (’06) show for the group-Lasso thatthe trace term equals

dfgroup(λ) =

K∑k=1

1(∥∥∥βgroup

Gk (λ)∥∥∥ > 0

)1 +

∥∥∥βgroup

Gk (λ)∥∥∥∥∥βols

Gk∥∥ (pk − 1)

.





df(λ).=

1

σ2

∑i∈D


[tr∂yλ∂y

].



(λ)∥∥∥0.


dfgroup(λ) =

K∑k=1

1(∥∥∥βgroup

Gk (λ)∥∥∥ > 0

)1 +

∥∥∥βgroup


Gk∥∥ (pk − 1)

.





df(λ).=

1

σ2

∑i∈D


[tr∂yλ∂y

].



(λ)∥∥∥0.


dfgroup(λ) =

K∑k=1

1(∥∥∥βgroup

Gk (λ)∥∥∥ > 0

)1 +

∥∥∥βgroup


Gk∥∥ (pk − 1)

.


Approximated degrees of freedom for the coop-Lasso

Proposition

Assuming that data are generated according to a linear regression modeland that X is orthonormal, the following expression of dfcoop(λ) is anunbiased estimate of df(λ)

dfcoop(λ) =

K∑k=1

1∥∥∥∥(βcoopGk

(λ))+∥∥∥∥>0

1 + (pk+ − 1)

∥∥∥∥(βcoop

Gk (λ))+∥∥∥∥∥∥∥∥(βols

Gk

)+∥∥∥∥

+ 1∥∥∥∥(βcoopGk

(λ))−∥∥∥∥>0

1 + (pk− − 1)

∥∥∥∥(βcoop

Gk (λ))−∥∥∥∥∥∥∥(βols

Gk)−∥∥∥

,

where pk+ and pk− are respectively the number of positive and negative

entries in βols

Gk (γ).


Approximated degrees of freedom for the coop-Lasso

Proposition

Assuming that data are generated according to a linear regression modeland that X is orthonormal, the following expression of dfcoop(λ) is anunbiased estimate of df(λ)

dfcoop(λ) =

K∑k=1

1∥∥∥∥(βcoopGk

(λ))+∥∥∥∥>0

1 +

pk+− 1

1 + γ

∥∥∥∥(βcoop

Gk (λ))+∥∥∥∥∥∥∥∥(βridge

Gk (γ))+∥∥∥∥

+ 1∥∥∥∥(βcoopGk

(λ))−∥∥∥∥>0

1 +

pk−− 1

1 + γ

∥∥∥∥(βcoop

Gk (λ))−∥∥∥∥∥∥∥∥(βridge

Gk (γ))−∥∥∥∥

,

where pk+ and pk− are respectively the number of positive and negative

entries in βridge

Gk (γ).cooperative-Lasso 31

Approximated information criteria

Following Zou et al, we extend the Cp stat to an “approximated” AIC

AIC(λ) =‖y − y(λ)‖

σ2+ 2df(λ),

and from the AIC, there is (small) step to BIC:

BIC(λ) =‖y − y(λ)‖

σ2+ log(n)df(λ).

I The K–fold cross-validation works well but is computationallyintensive.

I It is required when we do not meet the linear regression setup. . .


Outline

Definition

Resolution

Consistency

Model selection

Simulation studies



Revisiting Elastic-Net experiments (1)

lasso enet group coop

1020

3040

5060

70

MS

E


I β? =(0, . . . , 0︸︷︷︸

10

, 2, . . . , 2︸︷︷︸10

, 0, . . . , 0︸︷︷︸10

, 2, . . . , 2︸︷︷︸10

)

I G1 = 1, . . . , 10,G2 = 11, . . . , 20,G3 = 21, . . . , 30,G4 = 31, . . . , 40.

I σ = 15, corr(xi,xj) = 0.5,

I training/validation/test/ =100/100/400,



Revisiting Elastic-Net experiments (2)

lasso enet group coop

050

100

150

200

250

MS

E


I β? = (3, . . . , 3︸︷︷︸15

, 0, . . . , 0︸︷︷︸25

)

I σ = 15,

I G1 = 1, . . . , 5,G2 = 6, . . . , 10,G3 = 11, . . . , 15,G4 = 16, . . . , 40.

I xj = Z1 + ε, Z1 ∼ N (0, 1), ∀j ∈ G1I xj = Z3 + ε, Z2 ∼ N (0, 1), ∀j ∈ G2I xj = Z3 + ε, Z3 ∼ N (0, 1), ∀j ∈ G3I xj ∼ N (0, 1), ∀j ∈ G4.

I training/validation/test/ =50/50/400,



Breiman’s setupSimulations setting

A wave-like vector of parameters β?

I p = 90 variables partitioned into K = 10 groups of size pk = 9,

I 3 (partially) active groups, 6 groups of zeros,

I in active groups, β?j ∝ (h− |5− j|) with h = 1, . . . , 5.

0 20 40 60 80

Figure: β? with h = 1, |Sk| = 1 non-zero coefficients in each active group.







0 20 40 60 80








0 20 40 60 80








0 20 40 60 80








0 20 40 60 80



Breiman’s setupExample of path of solution and signal recovery with BIC choice

The signal strength is generated so as

I y = Xβ? + σε, with σ = 1, n = 30 to 500,

I X ∼ N (0,Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example),

I magnitude in β chosen so as R2 ≈ 0.75.

Remark

Covariance structure is purposely disconnected from the group structure.

None of the support recovery conditions are fulfilled.




I y = Xβ? + σε, with σ = 1, n = 30 to 500,



One shot sample with n = 120




I y = Xβ? + σε, with σ = 1, n = 30 to 500,



-0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0

-0.2

0.0

0.2

0.4

0.6

log10(λ)

βlasso

0 20 40 60 80

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

i

βlasso

True signal

Estimated signal

Figure: Lassocooperative-Lasso 37



I y = Xβ? + σε, with σ = 1, n = 30 to 500,



-0.4 -0.2 0.0 0.2 0.4 0.6 0.8

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

log10(λ)

βgroup

0 20 40 60 80

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

i

βgroup

True signal

Estimated signal

Figure: Group-Lassocooperative-Lasso 37



I y = Xβ? + σε, with σ = 1, n = 30 to 500,



-0.4 -0.2 0.0 0.2 0.4 0.6 0.8

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

log10(λ)

βco

op

0 20 40 60 80

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

i

βco

op

True signal

Estimated signal

Figure: Coop-Lassocooperative-Lasso 37

Breiman’s setupErrors as a function of the sample size n

pred

icti

on

erro

r

100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

1.0

1.2

sig

ner

ror

100 200 300 400 500

0.00

0.05

0.10

0.15

0.20

0.25

0.30

n n

Figure: h = 3, |Sk| = 5 (favoring Lasso).

lasso group coop



pred

icti

on

erro

r

100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

1.0

1.2

sig

ner

ror

100 200 300 400 500

0.00

0.05

0.10

0.15

0.20

0.25

0.30

n n

Figure: h = 4, |Sk| = 7 (intermediate).

lasso group coop



pred

icti

on

erro

r

100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

1.0

1.2

sig

ner

ror

100 200 300 400 500

0.00

0.05

0.10

0.15

0.20

0.25

0.30

n n

Figure: h = 5, |Sk| = 9 (favoring group-Lasso).

lasso group coop


Outline

Definition

Resolution

Consistency

Model selection

Simulation studies



Robust microarray gene selection

Affymetrix typically contains multiple probes per gene defined as siblingprobes.

Reasons (Li, Zhu, Cook, BMC genomics 2008)

1. lack of knowledgegenome annotation maps probe sets to the same genes after chip design.

2. instabilityprobe sets cross-hybridize in an unpredictable manner.

3. designed on purposeprobe sets specific to RNA variant (splicing).

at least two good reasons to put sibling probe sets in the same group


Robust microarray gene selection

Affymetrix typically contains multiple probes per gene defined as siblingprobes.

Reasons (Li, Zhu, Cook, BMC genomics 2008)

1. lack of knowledgegenome annotation maps probe sets to the same genes after chip design.

2. instabilityprobe sets cross-hybridize in an unpredictable manner.

3. designed on purposeprobe sets specific to RNA variant (splicing).

at least two good reasons to put sibling probe sets in the same group


Application: Basal tumor

Methodology

1. select a restricted number of d probes from differential analysis,

2. determine the genes associated to these d probes, retrieve all the pprobes related to the genes, regardless of their signal,

3. fit a model with group penalties where groups are defined by genes.

Breast cancer data set

I 22269 probes,

I n = 29 patients with basal tumor,

I predict response to chemotherapy: pCR / not-pCR.


Application: Basal tumor

Pretreatment

I ordered p-values with differential analysis (Jeanmougin et al. 2011),

I keep the d = 10 most differentiated probes,

I this corresponds to exactly 10 genes for a total of 27 probes.

Methods comparison

1. probes: logistic Lasso on the d = 10 most differentiated probes,

2. lasso: logistic Lasso on the p = 27 probes (with no group effect),

3. group: logistic group-Lasso on the p = 27 probes (with groupeffect),

4. coop: logistic coop-Lasso on the p = 27 probes (with signed groupeffect).


Results

Gk (gene) pk symbol probes lasso group coop

frmd4b 3 0.38 0.62 0.68 0.75rnps1 2 0 0 0 0phlda3 1 1.82 1.93 4.12 7.32tbc1d22a 3 0 0 0 0ece1 2 0.89 0 0 1.87lzts1 6 1.34 1.57 1.15 0rpp38 1 0.95 0.90 1.92 3.66gtse1 5 0.88 0.85 1.21 0pak4 3 1.68 0.96 1.70 4.58chst10 1 0.79 0.36 1.08 2.50

Table: Genes corresponding to the probes selected by differential analysis, sizeof groups of probes, and `2-norm of each group of parameters for each estimate.


Results0

24

6

Figure: Lasso

Gk (gene) pk symbol

frmd4b 3

rnps1 2

phlda3 1

tbc1d22a 3

ece1 2

lzts1 6

rpp38 1

gtse1 5

pak4 3

chst10 1


Results0

24

6

Figure: Group-Lasso

Gk (gene) pk symbol

frmd4b 3

rnps1 2

phlda3 1

tbc1d22a 3

ece1 2

lzts1 6

rpp38 1

gtse1 5

pak4 3

chst10 1


Results0

24

6

Figure: Coop-Lasso

Gk (gene) pk symbol

frmd4b 3

rnps1 2

phlda3 1

tbc1d22a 3

ece1 2

lzts1 6

rpp38 1

gtse1 5

pak4 3

chst10 1


Results

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

1.2

‖β‖

binom

ialdeviance

lassogroupcoopprobes CV(λ?) CV

?

probes 0.511 0.474lasso 0.513 0.499group 0.430 0.372coop 0.263 0.194

Table: Best average CV scoreCV(λ?) and averaged best CV

score CV?

.


Conclusion

Summary

I A variant of the group-Lasso which assumes sign-coherent groups,possibly sparse.

I the coop-Lasso comes with the “usual” accompanying toolsI consistency theorem,I model selection criteria,I subset algorithm,I R-package scoop

I very encouraging results on real genomic data

Perspective

I enhance algorithms/implementation for large scale experiments

I deeper analysis in the gene selection framework

I other application in genomics (aCGH segmentation ?)


Date post:	24-May-2015
Category:	Documents
Upload:	laboratoire-statistique-et-genome
View:	813 times
Download:	0 times

Sparsity with sign-coherent groups of variables via the cooperative-Lasso

Documents