Tree-Guided Group Lasso for Multi-Task Regression with ...sssykim/papers/2010_ICML_tlasso.pdf ·...

Tree-Guided Group Lasso for Multi-Task Regression with

Structured Sparsity

Seyoung Kim [email protected]

Eric P. Xing [email protected]

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Abstract

We consider the problem of learning a sparsemulti-task regression, where the structure inthe outputs can be represented as a tree withleaf nodes as outputs and internal nodes asclusters of the outputs at multiple granular-ity. Our goal is to recover the common setof relevant inputs for each output cluster.Assuming that the tree structure is avail-able as prior knowledge, we formulate thisproblem as a new multi-task regularized re-gression called tree-guided group lasso. Ourstructured regularization is based on a group-lasso penalty, where groups are defined withrespect to the tree structure. We describe asystematic weighting scheme for the groupsin the penalty such that each output variableis penalized in a balanced manner even if thegroups overlap. We present an efficient op-timization method that can handle a large-scale problem. Using simulated and yeastdatasets, we demonstrate that our methodshows a superior performance in terms ofboth prediction errors and recovery of truesparsity patterns compared to other methodsfor multi-task learning.

1. Introduction

Many real world problems in data mining and scien-tific discovery amount to finding a parsimonious andconsistent mapping function from high dimensional in-put factors to a structured output signal. For exam-ple, in a genetic problem known as expression quan-titative trait loci (eQTL) mapping, one attempts todiscover an association function from a small set of

Appearing in Proceedings of the 27 th International Confer-ence on Machine Learning, Haifa, Israel, 2010. Copyright2010 by the author(s)/owner(s).

causal variables known as single nucleotide polymor-phisms (SNPs) out of a few million candidates, to aset of genes whose expression levels are interdepen-dent in a complex manner. In computer vision, onetries to relate the high-dimensional image features toa structured labeling of objects in the image. An ef-fective approach to this kind of problems is to formu-late it as a regression problem from inputs to outputs.In the simplest case where the output is a univari-ate continuous or discrete response (e.g., a gene ex-pression measurement for a single gene), techniquessuch as lasso (Tibshirani, 1996) or L1-regularized lo-gistic regression (Ng, 2004; Wainwright et al., 2006)have been developed to identify a parsimonious subsetof covariates that determine the outputs. However, inthe problem of multi-task regression, where the outputis a multivariate vector with an internal sparsity struc-ture, the estimation of the regression parameters canpotentially benefit from taking into account this spar-sity structure in the estimation process. This will allowthe strongly related output variables to be mapped tothe input factors in a synergistic way, which is notpossible in the standard lasso.

In a univariate-output regression setting, sparse re-gression methods that extend lasso have been proposedto allow the recovered relevant inputs to reflect the un-derlying structural information among the inputs. Forexample, group lasso assumed that the groupings ofthe inputs are available as prior knowledge, and usedgroups of inputs instead of individual inputs as a unitof variable selection (Yuan & Lin, 2006). Group lassoachieved this by applying an L1 norm of the lassopenalty over groups of inputs, while using an L2 normfor the input variables within each group. This L1/L2

norm for group lasso has been extended to a moregeneral setting to encode prior knowledge on varioussparsity patterns, where the key idea is to allow thegroups to have an overlap. The hierarchical selectionmethod (Zhao et al., 2008) assumed that the inputvariables form a tree structure, and designed groups

Tree-Guided Group Lasso for Multi-task Regression

Inpu

ts

Outputs (tasks)

�

�Gv5

= {βj1, βj

2, βj

3}

�

�Gv4

= {βj1, βj

2}

�

�Gv1

= {βj1}

�

�Gv2

= {βj2}

�

�Gv3

= {βj3}

@@

@@@

��

�� @@

(a) (b)

Figure 1. Tree regularization for multiple-output regres-sion. (a) An example of a multiple-output regression whenthe correlation structure in output variables forms a tree.(b) Groups of regression coefficients associated with eachnode of the tree in (a) in tree-guided group lasso.

so that the child nodes enter the set of relevant inputsonly if its parent node does. The situations with arbi-trary overlapping groups have been considered as well(Jacob et al., 2009; Jenatton et al., 2009).

Many of these ideas related to group lasso in aunivariate-output regression may be directly appliedto multi-task regression problems. The L1/L2 penaltyof group lasso has been used to recover inputs thatare jointly relevant to all of the outputs, or tasks, byapplying the L2 norm to outputs instead of groups ofinputs as in group lasso (Obozinski et al., 2008; 2009).Although the L1/L2 penalty has been shown to beeffective in a joint covariate selection for multi-tasklearning, it does not assume any structure among theoutputs. The extensions of group lasso with over-lapping groups (Zhao et al., 2008; Jacob et al., 2009;Jenatton et al., 2009) may be directly applicable in amulti-task learning to incorporate prior knowledge onstructures. However, the overlapping groups in theirregularization methods can cause an imbalance amongdifferent outputs, because the regression coefficientsfor an output that appears in a large number of groupsare more heavily penalized than for other outputs withmemberships to fewer groups. Although a weightingscheme that weights each group differently in the reg-ularization function has been proposed to correct forthis imbalance, this ad hoc approach can still lead toinconsistent estimates (Jenatton et al., 2009).

In this paper, we consider a particular case of a sparsemulti-task regression problem when the outputs canbe grouped at multiple granularity. We assume thatthis multi-level grouping structure is encoded as a treeover the outputs, where each leaf node represents anindividual output variable and each internal node in-dicates the cluster of the output variables that corre-spond to the leaf nodes of the subtree rooted at thegiven internal node. As illustrated in Figure 1(a), theoutputs in each cluster are likely to be influenced by

a common set of inputs. In order to achieve this typeof structured sparsity at multiple levels of the hierar-chy among the outputs, we propose a novel regularizedregression method called tree-guided group lasso thatapplies group lasso to groups of output variables de-fined in terms of a hierarchical clustering tree. Weassume this tree is available as prior knowledge, anddefine groups at multiple granularity along the tree toencourage a joint covariate selection within each clus-ter of outputs. Our approach can handle any types oftrees with an arbitrary height.

In particular, we describe a novel weighting schemethat systematically weights each group in the tree-guided group-lasso penalty such that clusters ofstrongly correlated outputs are more encouraged toshare common covariates than clusters of weakly corre-lated outputs. As was noted in Jenatton et al. (2009),an arbitrary assignment of values for the group weightscan lead to an inconsistent estimate. Our approach isthe first method for achieving a structured sparsitythat offers a systematic weighting scheme and penal-izes regression coefficients corresponding to each groupin a balanced manner even when the groups overlap.

Our work is primarily motivated by the genetic associ-ation mapping problem, where the goal is to identify asmall number of SNPs (inputs) out of millions of SNPsthat influence phenotypes (outputs) such as gene ex-pression measurements. Many previous studies havefound that multiple genes in the same biological path-ways are often co-expressed. Furthermore, evidencehas been found that these genes within a module mayshare a common genetic basis for the variations in theirexpression levels (Zhu et al., 2008; Chen et al., 2008).Although the hierarchical agglomerative clustering al-gorithm has been a popular method for visualizing theclustering structure in the genes, statistical methodsthat can take advantage of this clustering structure toidentify causal genetic variants associated with genemodules were unavailable. In our experiments, usingboth simulated and yeast datasets, we demonstratethat our proposed method can be successfully appliedto select genetic variants affecting multiple genes.

2. Background on Sparse Regression

and Multi-task Learning

Assume a sample of N instances, each represented by aJ-dimensional input vector and a K-dimensional out-put vector. Let X denote the N × J input matrix,whose column corresponds to observations for the jthinput xj = (x1

j , . . . , xNj )T . Let Y denote the N × K

output matrix, whose column is a vector of observa-tions for the k-th output yk = (y1

k, . . . , yNk )T . For each


of the K output variables, we assume a linear model:

yk = Xβk + ǫk, ∀k = 1, . . . , K, (1)

where βk is a vector of J regression coefficients(β1

k, . . . , βJk )T for the k-th output, and ǫk is a vector

of N independent error terms having mean 0 and aconstant variance. We center the yk’s and xj ’s suchthat

∑

i yik = 0 and

∑

i xij = 0, and consider the model

without an intercept.

When J is large and the number of inputs relevantto the output is small, lasso offers an effective fea-ture selection method for the model in Equation (1)(Tibshirani, 1996). Let B = (β1, . . . , βK) denote theJ × K matrix of regression coefficients for all K out-puts. Then, lasso obtains B̂lasso by solving the follow-ing optimization problem:

B̂lasso = argmin∑

k

(yk − Xβk)T · (yk − Xβk)

+λ∑

j

∑

k

|βjk|,

where λ is a tuning parameter that controls theamount of sparsity in the solution. Setting λ to a largevalue leads to a smaller number of non-zero regressioncoefficients. Clearly, the standard lasso above offersno mechanism to explicitly couple the estimates of theregression coefficients for correlated output variables.

In multi-task learning, where the goal is to select in-put variables that are relevant to at least one task, anL1/L2 penalty has been used to take advantage of therelatedness of the outputs. An L2 norm is applied tothe regression coefficients β

j for all outputs for eachinput j, and these J L2 norms are combined throughan L1 norm to encourage sparsity across input vari-ables. The L1/L2-penalized multi-task regression isdefined as the following optimization problem:

B̂L1/L2 = argmin∑

k


+λ∑

j

‖βj‖2 (2)

The L1 part of the penalty plays the role of select-ing inputs relevant to at least one task, and the L2

part combines information across tasks. Since the L2

penalty does not have the property of encouragingsparsity, if the jth input is selected as relevant, allof the elements of β

j take non-zero values. Thus, theestimate B̂L1/L2 is sparse only across inputs but notacross outputs.

3. Tree-Guided Group Lasso for Sparse

Multiple-output Regression

The L1/L2-penalized regression assumes that all ofthe outputs in the problem share the common setof relevant input variables, and has been shown tobe effective in this scenario (Obozinski et al., 2008;2009). However, in many real-world applications, dif-ferent outputs are related in a complex manner such asin gene expression data, where subsets of genes formfunctional modules. In this case, it is not realistic toassume that all of the tasks share the same set of rele-vant inputs as in the L1/L2-regularized regression. Asubset of highly related outputs may share a commonset of relevant inputs, whereas weakly related outputsare less likely to be affected by the same inputs.

We assume that the relationships among the outputscan be represented as a tree T with the set of verticesV of size |V |, as shown in Figure 1(a). In this tree T ,each of the K leaf nodes is associated with an outputvariable, and the internal nodes of the tree representgroupings of the output variables located at the leavesof the subtree rooted at the given internal node. Eachinternal node near the bottom of the tree shows thatthe output variables of its subtree are highly corre-lated, whereas the internal node near the root repre-sents relatively weaker correlations among the outputsin its subtree. This tree structure may be available asprior knowledge, or can be learned from data usingmethods such as a hierarchical agglomerative cluster-ing algorithm. Furthermore, we assume that each nodev ∈ V is associated with weight wv, representing theheight of the subtree rooted at v.

Given this tree T over the outputs, we generalize theL1/L2 regularization in Equation (2) to a tree regular-ization as follows. We expand the L2 part of the L1/L2

penalty into a group-lasso penalty, where the group isdefined based on tree T as follows. Each node v ∈ Vof tree T is associated with group Gv whose membersconsist of all of the output variables (or leaf nodes)in the subtree rooted at node v. For example, Figure1(b) shows the groups associated with each node ofthe tree in Figure 1(a). Given these groups of outputsthat arise from tree T , tree-guided group lasso can bewritten as

B̂Tree = argmin∑

k


+λ∑

j

∑

v∈V

wv‖βjGv

‖2, (3)

where βjGv

is a vector of regression coefficients {βjk :

k ∈ Gv}. Each group of regression coefficients βjGv

is


weighted with wv that reflects the strength of correla-tion within the group.

In order to define the weights wv’s, we first associateeach internal node v of the tree T with two quantitiessv and gv that satisfy the condition sv + gv = 1, andthen, define wv’s in Equation (3) in terms of sv’s andgv’s as we describe below. The sv represents the weightfor selecting the output variables associated with eachof the children of node v separately, and the gv rep-resents the weight for selecting them jointly. We firstconsider a simple case with two outputs (K = 2) witha tree of three nodes that consists of two leaf nodes (v1

and v2) and one root node (v3), and then, generalizethis to an arbitrary tree. When K = 2, the penaltyterm in Equation (3) can be written as

∑

j

∑

v∈V

wv‖βjGv

‖2

=∑

j

[

s3

(

|βj1 | + |βj

2 |)

+g3

(

√

(βj1)

2 + (βj2)

2)]

,

where the weights are given as w1 = s3, w2 = s3,and w3 = g3. This is similar to the elastic-net penalty(Zou & Hastie, 2005), where βj

1 and βj2 can be selected

either jointly or separately according to the weights s3

and g3.

Given an arbitrary tree T , we recursively apply thesimilar operation starting from the root node towardsthe leaf nodes as follows:

∑

j

∑

v∈V

wv‖βjGv

‖2

= λ∑

j

Wj(vroot), (4)

where

Wj(v) =

sv ·∑

c∈Children(v)

|Wj(c)| + gv · ‖βjGv

‖2

if v is an internal node,∑

m∈Gv

|βjm| if v is a leaf node.

It can be shown that the following relationship holdsbetween wv’s and (sv, gv)’s.

wv =

gv

∏

m∈Ancestors(v)

sm if v is an internal node

∏

m∈Ancestors(v)

sm if v is a leaf node.

The above weighting scheme extends the elastic-net-like penalty hierarchically. Thus, at each internal nodev, a high value of sv encourages a separate selectionof inputs for the outputs associated with the givennode v, whereas high values of gv encourages a jointcovariate selection across the outputs. If sv=1 and

(a) (b) (c) (d) (e) (f)

Figure 2. Unit contour surface for {βj1, βj

2, βj

3} in various

penalties, assuming the tree structure of output variablesin Figure 1. (a) Lasso, (b) L1/L2, (c) tree-guided grouplasso with g1 = 0.5 and g2 = 0.5, (d) g1 = 0.7 and g2 = 0.7,(e) g1 = 0.2 and g2 = 0.7, and (f) g1 = 0.7 and g2 = 0.2.

gv = 0 for all v ∈ V , then only separate selectionsare performed, and the tree-guided group lasso penaltyreduces to the lasso penalty. On the other hand, ifsv=0 and gv = 1 for all v ∈ V , the penalty reducesto the L1/L2 penalty in Equation (2) that performsonly a joint covariate selection for all outputs. Theunit contour surfaces of various penalties for βj

1, βj2,

and βj3 with groups as defined in Figure 1 are shown

in Figure 2.

Example 1. Given the tree in Figure 1, the tree-guided group-lasso penalty for the jth input in Equa-tion (4) is given as follows:

Wj(vroot) = Wj(v5)

= gv5· ‖βj

Gv5

‖2+ sv5

· (|Wj(v4)| + |Wj(v3)|)

= gv5· ‖βj

Gv5

‖2

+sv5·(

gv4‖βj

Gv4

‖2+ sv4

(|Wj(v1)| + |Wj(v2)|))

+sv5|βj

3|

= gv5· ‖βj

Gv5

‖2+ sv5

· gv4‖βj

Gv4

‖2

+sv5· sv4

(|βj1 | + |βj

2|) + sv5|βj

3 |.

Proposition 1. For each of the kth output, the sumof the weights wv for all nodes v ∈ V in T whose groupGv contains the kth output as a member equals one. Inother words, the following holds:

∑

v:k∈Gv

wv =∏

m∈Ancestors(vk)

sm

+∑

l∈Ancestors(vk)

gl

∏

m∈Ancestors(vl)

sm = 1.

Proof. We assume an ordering of the nodes {v : k ∈Gv} along the path from the leaf vk to the root vroot,and represent the ordered nodes as v1, . . . , vM . Since


we have sv + gv = 1 for all v ∈ V , we have

∑

v:k∈Gv

wv =M∏

m=1

sm +M∑

l=1

gl

M∏

m=l+1

sm

= s1

M∏

m=2

sm + g1

M∏

m=2

sm +

M∑

l=2

gl

M∏

m=l+1

sm

= (s1 + gl) ·

M∏

m=2

sm +

M∑

l=2

gl

M∏

m=l+1

sm

=

M∏

m=2

sm +

M∑

l=2

gl

M∏

m=l+1

sm = . . . = 1

Even if each output k belongs to multiple groups as-sociated with internal nodes {v : k ∈ Gv} and appearsmultiple times in the overall penalty in Equation (4),Proposition 1 states that the sum of weights over allof the groups that contain the given output variableis always one. Thus, the weighting scheme in Equa-tion (4) guarantees that the regression coefficients forall of the outputs are penalized equally. In contrast,group lasso with overlapping groups in Jenatton et al.(2009) used arbitrarily defined weights, which was em-pirically shown to lead to an inconsistent estimate.Another main difference between our method and thework in Jenatton et al. (2009) is that we take advan-tage of groups that contain other groups along the treestructure, whereas they tried to remove such groups asredundant in Jenatton et al. (2009).

Our proposed penalty function differs from the tree-structured penalty in Zhao et al. (2008) in that thetrees are defined diffrently and contain different in-formation. In the tree in our work, leaf nodes rep-resent variables (or tasks) and internal nodes corre-sponed to clustering information. On the other hand,in Zhao et al. (2008), the variables themselves forma tree structure, where both leaf and internal nodescorrespond to variables. Thus, the tree in Zhao et al.(2008) does not correspond to clustering structure butplays the role of prescribing which variables shouldenter the set of relevant variables first before othervariables.

4. Parameter Estimation

In order to estimate the regression coefficients in tree-guided group lasso, we use an alternative formulationof the problem in Equation (3) that was previously

introduced for group lasso (Bach, 2008), given as

B̂Tree = argmin∑

k


+λ(

∑

j

∑

v∈V

wv‖βjGv

‖2

)2

. (5)

Since the L1/L2 norm in the above equation is a non-smooth function, it is not trivial to optimize it directly.We make use of the fact that the variational formu-lation of a mixed-norm regularization is equal to aweighted L2 regularization (Argyriou et al., 2008) asfollows:

(

∑

j

∑

v∈V

wv‖βjGv

‖2

)2

≤∑

j

∑

v∈V

w2v‖β

jGv

‖2

2

dj,v,

where∑

j

∑

v dj,v = 1, dj,v ≥ 0, ∀j, v, and the equalityholds for

dj,v =wv‖βj,v‖2

∑

j

∑

v∈V wv‖βj,v‖2

. (6)

Thus, we can re-write the problem in Equation (5) sothat it contains only smooth functions, as follows:

B̂Tree = argmin∑

k


+λ∑

j

∑

v∈V

w2v‖β

jGv

‖2

2

dj,v(7)

subject to∑

j

∑

v

dj,v = 1, dj,v ≥ 0, ∀j, v,

where we introduced additional variables dj,v’s thatneed to be estimated. We solve the problem in theabove equation by optimizing βk’s and dj,v’s alter-nately over iterations until convergence. In each it-eration, we first fix the values for βk’s, and updatedj,v’s, where the update equations for dj,v’s are givenas in Equation (6). Then, we hold dj,v’s as constant,and optimize for βk’s. We differentiate the objectivein Equation (7) with respect to βk’s, set it to zero, andsolve for βk’s to obtain the update equation:

βk =(

XTX + λD)−1

XT yk,

where D is a J×J diagonal matrix with∑

v∈V w2v/dj,v

in the jth element along the diagonal.

Finally, the regularization parameter λ can be selectedusing a cross-validation.


��

@@

��XX

��XX

��XX

((hh((hh((hh((hh((hh((hh

(a) (b) (c) (d) (e)

Figure 3. An example of regression coefficients estimated from a simulated dataset. (a) Tree structure of the outputvariables, (b) true regression coefficients, (c) lasso, (d) L1/L2, (e) tree-guided group lasso. The rows represent outputs,and the columns inputs.

5. Experiments

We demonstrate the performance of our method onsimulated datasets and a yeast dataset of genotypesand gene expressions, and compare the results withthose from lasso and the L1/L2-regularized regressionthat do not assume any structure among outputs. Weevaluate these methods based on two criteria, test er-ror and sensitivity/specificity in detecting true rele-vant covariates.

5.1. Simulation Study

We simulate data using the following scenario anal-ogous to genetic association mapping. We simulate(X,Y) with K = 60, J = 200 and N = 150 for thetraining set as follows. We first generate the inputs X

by sampling each element in X from a uniform distri-bution over {0, 1, 2} that corresponds to the numberof mutated alleles at each genetic locus. Then, we setthe values of B by first selecting non-zero entries andfilling these entries with a pre-defined value. We as-sume a hierarchical structure of height four over theoutputs, and select the non-zero elements of B so thatthey correspond to the groupings in the sparsity struc-ture given by this tree. The tree is shown in Figure3(a), where we only draw the top three levels to avoidclutter. Figure 3(b) shows the selected non-zero ele-ments as white pixels with outputs as rows and inputsas columns. Given the X and B, we generate Y withnoise distributed as N(0, 1.0).

We fit lasso, the L1/L2-regularized regression, and ourmethod to the dataset simulated with signal strengthsof the non-zero elements of B set to 0.4, and showthe results in Figures 3(c)-(e), respectively. Sincelasso does not have any mechanism to borrow strengthacross different tasks, false positives are distributedrandomly across the matrix B̂lasso in Figure 3(c).On the other hand, the L1/L2-regularization methodblindly combines information across the outputs re-gardless of the sparsity structure. As a result, oncean input is selected as relevant for an output, it getsselected for all of the other outputs, which tends tocreate a vertical stripes of non-zero values as shown

0 0.5 10

0.5

1

1−Specificity

Sen

sitiv

ity

LassoL

1L

2

T

0 0.5 10

0.5

1

1−Specificity

Sen

sitiv

ity

0 0.5 10

0.5

1

1−Specificity

Sen

sitiv

ity

(a) (b) (c)

Figure 4. ROC curves for the recovery of true non-zeroregression coefficients. Results are averaged over 50 sim-ulated datasets. (a) βj

k = 0.2, (b) βj

k = 0.4, and (c)βj

k = 0.6.

Lasso T T0.9 T0.717

18

19

Tes

t err

or

L1/L

2 Lasso T T0.9 T0.734

36

38

40

Tes

t err

or

L1/L

2 Lasso T T0.9 T0.772

76

80

84

88

92

Tes

t err

or

L1/L

2

(a) (b) (c)

Figure 5. Prediction errors of various regression methodsusing simulated datasets. Results are averaged over 50simulated datasets. (a) βj

k = 0.2, (b) βj

k = 0.4, and (c)βj

k = 0.6.

in Figure 3(d). When the true hierarchical structurein Figure 3(a) was available as prior knowledge, it isvisually clear from Figure 3(e) that our method is ableto suppress false positives and recover the true under-lying sparsity structure significantly better than othermethods.

In order to systematically evaluate the performance ofdifferent methods, we generate 50 simulated datasets,and show in Figure 4 receiver operating characteris-tic (ROC) curves for the recovery of the true sparsitypattern averaged over these datasets. Figures 4(a)-(c)represent results from different signal strengths in B

of sizes 0.2, 0.4, and 0.6, respectively. Our methodclearly outperforms lasso and the L1/L2 regulariza-tion method. Especially when the signal strength isweak in Figure 4(a), the advantage of incorporatingthe prior knowledge of the tree as a sparsity structureis significant.


We compare the performance of the different meth-ods in terms of prediction errors, using additional 50samples as test data, and show the results in Figures5(a)-(c) for signal strengths of sizes 0.2, 0.4, and 0.6,respectively. We find that our method has a lowerprediction error than the methods that do not takeadvantage of the structure in the outputs.

We also consider the scenario where the true tree struc-ture in Figure 3(a) is not known a priori. In this case,we learn a tree by running a hierarchical agglomer-ative clustering on the K × K correlation matrix ofthe outputs, and use this tree and the weights hv’s as-sociated with each internal node in our method. Theweight hv of each internal node v returned by the hier-archical agglomerative clustering indicates the heightof the subtree rooted at the node, or how tightly itsmembers are correlated. After normalizing the weights(denoted as h′

v) of all of the internal nodes such thatthe root is at height one, we assign gv = 1 − h′

v andsv = h′

v. Since the tree obtained in this manner rep-resents a noisy realization of the true underlying treestructure, we discard the nodes for weak correlationsnear the root of the tree by thresholding h′

v at ρ = 0.9and 0.7, and show the prediction errors in Figure 5 asT0.9 and T0.7. Even when the true tree structure isnot available, our method is able to benefit from tak-ing into account the output structure, and gives lowerprediction errors.

5.2. Analysis of Yeast Data

We analyze the genotype and gene expression dataof 114 yeast strains (Zhu et al., 2008) using varioussparse regression methods. We focus on the chromo-some 3 with 21 SNPs and 3684 genes. Although it iswell established that genes form clusters in terms ofexpression levels that correspond to functional mod-ules, the hierarchical clustering structure over corre-lated genes is not directly available as prior knowledge.Instead, we learn the tree structure and node weightsfrom the gene expression data by running the hier-archical agglomerative clustering algorithm as we de-scribed in the previous section. We use only the inter-nal nodes with heights h′

v < 0.7 or 0.9 in our method.The goal of the analysis is to identify SNPs (inputs)whose variations induce significant variations in geneexpression levels (outputs) over different strains. Byapplying our method that incorporates information ongene modules at multiple granularity along the hierar-chical clustering tree, we expect to be able to iden-tify SNPs that influence groups of genes that are co-expressed.

In Figure 6(a), we show the K × K correlation ma-

(a) (b) (c) (d) (e)

Figure 6. Results for the yeast dataset. (a) Correla-tion matrix of the gene expression data, where rows andcolumns are reordered after applying agglomerative hier-archical clustering. Estimated regression coefficients areshown for (b) lasso, (c) L1/L2, (d) tree-guided group lassowith ρ = 0.9, and (e) with ρ = 0.7.

lasso T0.9 T0.7

52

54

56

Tes

t err

or

L1/L

2

Figure 7. Prediction errors for the yeast dataset.

trix of the gene expressions after reordering the rowsand columns according to the results of running thehierarchical agglomerative clustering algorithm. Theestimated B is shown for lasso, the L1/L2-regularizedregression and our method with ρ = 0.9 and 0.7 inFigures 6(b)-(e), respectively, where the rows repre-sent genes and the columns SNPs. The lasso estimatesare extremely sparse and do not reveal any interest-ing structure in SNP-gene relationships. We believethat the association signals are very weak as is typi-cally the case in a genetic association study, and thatlasso is unable to detect such weak signals since it doesnot borrow strength across genes. The estimates fromthe L1/L2-regularized regression in Figure 6(c) are notsparse across genes, and tend to form vertical stripesof non-zero regression coefficients. Our method in Fig-ures 6(d)-(e) reveals clear groupings in the patterns ofassociations between genes and SNPs. Our methodperforms significantly better in terms of prediction er-rors as can be seen in Figure 7.

Given the estimates of B in Figure 6, we look for anenrichment of GO categories among the genes withnon-zero regression coefficients for each SNP. A groupof genes that form a module often participate in thesame pathways, leading to an enrichment of a GOcategory among the members of the module. Sincewe are interested in identifying SNPs influencing genemodules and our method reflects this joint associationthrough the hierarchical clustering tree, we hypothe-size that our method would reveal a more significantGO enrichment in the estimated non-zero elements inB. Because the estimates of the L1/L2-regularized


1.0e−3 1.0e−5 1.0e−15 1.0e−200

5

10

p−value cutoff

Num

ber

of S

NP

s

L1L

2 0.005

L1L

2 0.01

L1L

2 0.03

L1L

2 0.05

T

1.0e−3 1.0e−5 1.0e−15 1.0e−200

5

10

p−value cutoff

Num

ber

of S

NP

s

L1L

2 0.005

L1L

2 0.01

L1L

2 0.03

L1L

2 0.05

T

1.0e−3 1.0e−5 1.0e−15 1.0e−200

5

10

p−value cutoff

Num

ber

of S

NP

s

L1L

2 0.005

L1L

2 0.01

L1L

2 0.03

L1L

2 0.05

T

(a) (b) (c)

Figure 8. Enrichment of GO category in estimated regres-sion coefficients for the yeast dataset. (a) Biological pro-cess, (b) molecular function, and (c) cellular component.

method are not sparse across genes, we threshold theabsolute values of the estimated B at 0.005, 0.01, 0.03,and 0.05, and search for GO enrichment only for thosegenes with βj

k above the threshold. On the other hand,for our method, we use all of the genes with non-zeroelements in B for each SNP.

In Figure 8, we show the number of SNPs with signif-icant enrichments at different p-value cutoffs for sub-categories within each of the three broad GO cate-gories, biological processes, molecular functions, andcellular components. For example, within biologicalprocesses, SNPs were found to be enriched for GOterms such as mitocondrial translation, amino acidbiosynthetic process, and carboxylic acid metabolicprocess. Regardless of the thresholds for selectingsignificant associations in the L1/L2 estimates, ourmethod generally finds more significant enrichment.

6. Conclusions

In this paper, we considered a feature selection prob-lem in a multiple-output regression setting when thegroupings of the outputs can be defined hierarchicallyusing a tree. We proposed tree-guided group lasso thatfinds a sparse estimate of regression coefficients whiletaking into account the structure among outputs givenby a tree. We demonstrated our method using simu-lated and yeast datasets.

Acknowledgements

EPX is supported by grants ONR N000140910758,NSF DBI-0640543, NSF CCF-0523757, NIH1R01GM087694, and an Alfred P. Sloan ResearchFellowship.

References

Argyriou, A., Evgeniou, T., and Pontil, M. Convexmulti-task feature learning. Machine Learning, 73(3):243–272, 2008.

Bach, F. Consistency of the group lasso and multi-

ple kernel learning. Journal of Machine LearningResearch, 9:1179–1225, 2008.

Chen, Y., Zhu, J., Lum, P.K., Yang, X., Pinto, S.,MacNeil, D.J., Zhang, C., Lamb, J., Edwards, S.,Sieberts, S.K., et al. Variations in DNA elucidatemolecular networks that cause disease. Nature, 452(27):429–35, 2008.

Jacob, L., Obozinski, G., and Vert, J. Group lassowith overlap and graph lasso. In Proceedings of the26th International Conference on Machine Learn-ing, 2009.

Jenatton, R., Audibert, J., and Bach, F. Structuredvariable selection with sparsity-inducing norms.Technical report, INRIA, 2009.

Ng, A. Feature selection, l1 vs. l2 regularization, androtational invariance. In Proceedings of the 21st In-ternational Conference on Machine Learning, 2004.

Obozinski, G., Wainwright, M.J., and Jordan, M.J.High-dimensional union support recovery in multi-variate regression. In Advances in Neural Informa-tion Processing Systems 21, 2008.

Obozinski, G., Taskar, B., and Jordan, M. Jointcovariate selection and joint subspace selection formultiple classification problems. Journal of Statis-tics and Computing, 2009.

Tibshirani, R. Regression shrinkage and selection viathe lasso. Journal of Royal Statistical Society, SeriesB, 58(1):267–288, 1996.

Wainwright, M. J., Ravikumar, P., and Lafferty, J.High-dimensional graphical model selection using l1-regularized logistic regression. In Advances in Neu-ral Information Processing Systems 18, 2006.

Yuan, M. and Lin, Y. Model selection and estima-tion in regression with grouped variables. Journalof Royal Statistical Society, Series B, 68(1):49–67,2006.

Zhao, P., Rocha, G., and Yu, B. Grouped and hi-erarchical model selection through composite abso-lute penalties. Technical Report 703, Department ofStatistics, University of California, Berkeley, 2008.

Zhu, J., Zhang, B., Smith, E.N., Drees, B., Brem,R.B., Kruglyak, L., Bumgarner, R.E., and Schadt,E.E. Integrating large-scale functional genomic datato dissect the complexity of yeast regulatory net-works. Nature Genetics, 40:854–61, 2008.

Zou, H. and Hastie, T. Regularization and variableselection via the elastic net. Journal of Royal Sta-tistical Society, Series B, 67(2):301–320, 2005.

Date post:	24-Jun-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Tree-Guided Group Lasso for Multi-Task Regression with ...sssykim/papers/2010_ICML_tlasso.pdf ·...

Documents