+ All Categories
Home > Documents > Multiple task learning with flexible structure regularization

Multiple task learning with flexible structure regularization

Date post: 10-Feb-2017
Category:
Upload: lecong
View: 213 times
Download: 0 times
Share this document with a friend
15
Multiple task learning with exible structure regularization Jian Pu a,n , Jun Wang b,c , Yu-Gang Jiang d , Xiangyang Xue d a Institute of Neuroscience, Chinese Academy of Sciences, Shanghai, China b School of Computer Science and Software Engineering, East China Normal University, Shanghai, China c Institute of Data Science and Technology, Alibaba Group, Seattle, USA d School of Computer Science, Fudan University, Shanghai, China article info Article history: Received 16 August 2015 Received in revised form 17 October 2015 Accepted 6 November 2015 Communicated by Jinhui Tang Keywords: Generic multiple task learning Flexible structure regularization Joint 11 =21 -norm regularization Iteratively reweighted least square Accelerated proximal gradient abstract Due to the theoretical advances and empirical successes, Multi-task Learning (MTL) has become a popular design paradigm for training a set of tasks jointly. Through exploring the hidden relationships among multiple tasks, many MTL algorithms have been developed to enhance learning performance. In general, the complicated hidden relationships can be considered as a combination of two key structural elements: task grouping and task outlier. Based on such task relationship, here we propose a generic MTL frame- work with exible structure regularization, which aims in relaxing any type of specic structure assumptions. In particular, we directly impose a joint 11 =21 -norm as the regularization term to reveal the underlying task relationship in a exible way. Such a exible structure regularization term takes into account any convex combination of grouping and outlier structural characteristics among the multiple tasks. In order to derive efcient solutions for the generic MTL framework, we develop two algorithms, i.e., the Iteratively Reweighted Least Square (IRLS) method and the Accelerated Proximal Gradient (APG) method, with different emphasis and strength. In addition, the theoretical convergence and performance guarantee are analyzed for both algorithms. Finally, extensive experiments over both synthetic and real data, and the comparisons with several state-of-the-art algorithms demonstrate the superior perfor- mance of the proposed generic MTL method. & 2015 Elsevier B.V. All rights reserved. 1. Introduction Realizing the existence of sparse training data and the task correlations, multiple task learning (MTL) is designed to train multiple models jointly and simultaneously, and often leads to better learnt models than those trained independently. The key idea of MTL is to explore the hidden relationships among multiple tasks to enhance learning performance. MTL has been shown particularly useful if there exist intrinsic relationships among multiple learning tasks and the training data is inadequate for each single task. Due to its empirical successes, MTL has been applied to various application domains, including social media categorization and search [12,54], ne-grained visual categorization [42], disease modeling and pre- diction [8,63], spam ltering [3], reinforcement learning [9] and even nancial stock selection [19]. The key ingredient of MTL is to explore model commonality among the multiple learning tasks, and use such model com- monality to improve the learning performance. Some earlier MTL work assume that there is a common structure or a common set of parameters shared by all the learning tasks [51,52]. However, sharing a model commonality among all the learning tasks is a fairly strong assumption, which is often invalid in real applica- tions. Therefore, two compromised yet more realistic scenarios, i.e., task grouping and task outlier, have been explored recently. For task grouping, one assumes that the commonality only exists among tasks within the same group. During the learning process, through identifying such task groups, the unrelated tasks from different groups will not inuence each other [21,23,52,55]. In the task outlier scenario [13], a robust MTL algorithm was proposed to capture the commonality for a major group of tasks while detecting the outlier tasks. A popular way to tackle the robust MTL problem is to use a decomposition framework, which forms the learning objective with a structure term and an outlier penalty term. To efciently solve the optimization problem, the target model can be further decomposed into two components, reecting the major group structure and the outliers [22]. Representative decomposition schemes for MTL include the low-rank structure [13] and the group sparsity based approaches [20]. Note that the aforementioned assumptions of task grouping and task outlier were exclusively considered in most of the existing works. In other words, the task grouping based methods neglected the existence of outlier tasks and many robust MTL frameworks Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing http://dx.doi.org/10.1016/j.neucom.2015.11.029 0925-2312/& 2015 Elsevier B.V. All rights reserved. n Corresponding author. E-mail address: [email protected] (J. Pu). Please cite this article as: J. Pu, et al., Multiple task learning with exible structure regularization, Neurocomputing (2015), http://dx. doi.org/10.1016/j.neucom.2015.11.029i Neurocomputing (∎∎∎∎) ∎∎∎∎∎∎
Transcript
Page 1: Multiple task learning with flexible structure regularization

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing

http://d0925-23

n CorrE-m

Pleasdoi.o

journal homepage: www.elsevier.com/locate/neucom

Multiple task learning with flexible structure regularization

Jian Pu a,n, Jun Wang b,c, Yu-Gang Jiang d, Xiangyang Xue d

a Institute of Neuroscience, Chinese Academy of Sciences, Shanghai, Chinab School of Computer Science and Software Engineering, East China Normal University, Shanghai, Chinac Institute of Data Science and Technology, Alibaba Group, Seattle, USAd School of Computer Science, Fudan University, Shanghai, China

a r t i c l e i n f o

Article history:Received 16 August 2015Received in revised form17 October 2015Accepted 6 November 2015

Communicated by Jinhui Tang

task grouping and task outlier. Based on such task relationship, here we propose a generic MTL frame-

Keywords:Generic multiple task learningFlexible structure regularizationJoint ℓ11=ℓ21-norm regularizationIteratively reweighted least squareAccelerated proximal gradient

x.doi.org/10.1016/j.neucom.2015.11.02912/& 2015 Elsevier B.V. All rights reserved.

esponding author.ail address: [email protected] (J. Pu).

e cite this article as: J. Pu, et al., Murg/10.1016/j.neucom.2015.11.029i

a b s t r a c t

Due to the theoretical advances and empirical successes, Multi-task Learning (MTL) has become a populardesign paradigm for training a set of tasks jointly. Through exploring the hidden relationships amongmultiple tasks, many MTL algorithms have been developed to enhance learning performance. In general,the complicated hidden relationships can be considered as a combination of two key structural elements:

work with flexible structure regularization, which aims in relaxing any type of specific structureassumptions. In particular, we directly impose a joint ℓ11=ℓ21-norm as the regularization term to revealthe underlying task relationship in a flexible way. Such a flexible structure regularization term takes intoaccount any convex combination of grouping and outlier structural characteristics among the multipletasks. In order to derive efficient solutions for the generic MTL framework, we develop two algorithms,i.e., the Iteratively Reweighted Least Square (IRLS) method and the Accelerated Proximal Gradient (APG)method, with different emphasis and strength. In addition, the theoretical convergence and performanceguarantee are analyzed for both algorithms. Finally, extensive experiments over both synthetic and realdata, and the comparisons with several state-of-the-art algorithms demonstrate the superior perfor-mance of the proposed generic MTL method.

& 2015 Elsevier B.V. All rights reserved.

1. Introduction

Realizing the existence of sparse training data and the taskcorrelations, multiple task learning (MTL) is designed to trainmultiple models jointly and simultaneously, and often leads tobetter learnt models than those trained independently. The key ideaof MTL is to explore the hidden relationships among multiple tasksto enhance learning performance. MTL has been shown particularlyuseful if there exist intrinsic relationships among multiple learningtasks and the training data is inadequate for each single task. Due toits empirical successes, MTL has been applied to various applicationdomains, including social media categorization and search [12,54],fine-grained visual categorization [42], disease modeling and pre-diction [8,63], spam filtering [3], reinforcement learning [9] andeven financial stock selection [19].

The key ingredient of MTL is to explore model commonalityamong the multiple learning tasks, and use such model com-monality to improve the learning performance. Some earlier MTLwork assume that there is a common structure or a common set of

ltiple task learning with flex

parameters shared by all the learning tasks [51,52]. However,sharing a model commonality among all the learning tasks is afairly strong assumption, which is often invalid in real applica-tions. Therefore, two compromised yet more realistic scenarios,i.e., task grouping and task outlier, have been explored recently. Fortask grouping, one assumes that the commonality only existsamong tasks within the same group. During the learning process,through identifying such task groups, the unrelated tasks fromdifferent groups will not influence each other [21,23,52,55]. In thetask outlier scenario [13], a robust MTL algorithm was proposed tocapture the commonality for a major group of tasks whiledetecting the outlier tasks. A popular way to tackle the robust MTLproblem is to use a decomposition framework, which forms thelearning objective with a structure term and an outlier penaltyterm. To efficiently solve the optimization problem, the targetmodel can be further decomposed into two components, reflectingthe major group structure and the outliers [22]. Representativedecomposition schemes for MTL include the low-rank structure[13] and the group sparsity based approaches [20].

Note that the aforementioned assumptions of task grouping andtask outlier were exclusively considered in most of the existingworks. In other words, the task grouping based methods neglectedthe existence of outlier tasks and many robust MTL frameworks

ible structure regularization, Neurocomputing (2015), http://dx.

Page 2: Multiple task learning with flexible structure regularization

Fig. 1. The illustration of different target models W learned using various assumptions of task structures: (a) shared model commonality, (b) task grouping, (c) outlier tasks,and (d) generic multi-tasks. Each column of W is corresponding to a single task and each row represents a feature dimension. For each element in W, white color meanszero-valued elements and gray color indicates non-zero values with the intensity indicating the magnitude of the values.

J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎2

only assumed the case of one major task group peppered with afew outlier tasks. In this paper, we address MTL under a verygeneral setting where multiple major task groups and outlier taskscould occur simultaneously. In particular, without decomposingthe target model, we directly impose a flexible structure regular-ization term with a joint ℓ11=ℓ21-norm that reflects a mixture ofstructure and outlier penalties. The final objective is formulated asan unconstrained non-smooth convex problem and two efficientalgorithms, i.e., the Iteratively Reweighted Least Square (IRLS)method and the Accelerated Proximal Gradient (APG) method, areapplied to derive optimal solutions with different strength. Parti-cularly, the IRLS method can handle the learning process for alarge number of tasks efficiently, while the APG method providesrobust performance when the active features are either sparse ordense. In addition, we provide theoretical analysis on both con-vergence and performance bound of the proposed MTL method.Finally, empirical studies on synthetic and real benchmark data-sets corroborate that the proposed MTL learning method clearlyoutperforms several state-of-the-art MTL approaches.

The remainder of this paper is organized as follows. Section 2briefly reviews several major MTL schemes in the existing works.Section 3 presents our proposed generic MTL framework and twoefficient solutions. Sections 4 and 5 provide theoretical analysis ofthe proposed methods, including convergence properties andperformance bounds. Section 6 gives experimental validations andcomparative studies, and, finally, Section 7 concludes this paper.

2. Related work

Here, we first define notations used in this paper. Then webriefly survey several major multi-task learning paradigms andsummarize their strengthness and weakness.

2.1. Notations

Assume the data is represented as a matrix XARd�n, where thecolumn vector xiARd is the i-th data point and d is the dimension.In addition, we denote xi� as the i-th row of X, which correspondsto the i-th feature of the data. The norm of matrix X is denoted asJXJp;q ¼ ðPi‖xi�‖qpÞ1=q ¼ ðPið

Pjx

pijÞq=pÞ1=q. For example, JXJ2;1 ¼P

i Jxi� J2 ¼P

iðP

jx2ijÞ1=2.

In a typical setting of multiple task regression or classification,we are given L tasks associated with training data fðX1; y1Þ;…; ðXL; yLÞg, where XlARd�nl ; ylARnl are the input and response ofthe l-th task with a total of nl samples. We want to employ MTL to

Please cite this article as: J. Pu, et al., Multiple task learning with fledoi.org/10.1016/j.neucom.2015.11.029i

derive optimal prediction models for all the tasks simultaneously.In particular, for linear regression models, the prediction model forthe l-th task is represented as f ðwl;XlÞ ¼X>

l wl. We then use acoefficient matrix W¼ ½w1;w2;…;wL� to represent all the regres-sion tasks. The goal of MTL is to derive an optimalWn across all thelearning tasks, and meanwhile satisfying desired structurecharacteristics.

2.2. Shared model commonality

One of the straightforward ways for designing a MTL algorithmis to assume that all the tasks share certain model commonality.Typically, such commonality can be represented as shared com-mon structures or parameters by the learned models. For instance,structure commonality includes subspace sharing [28,35,40] andfeature set sharing [2,24,31,32,34,56,61,18]. In terms of the para-meter commonality, it includes a wide range of options dependingon the used learning methods, such as the hidden units in neuralnetworks [10], kernels [17], the priors in hierarchical Bayesianmodels [4,49,57,58,60], the parameters in Gaussian process cov-ariance [26,43,50], the feature mapping matrices [1], and thesimilarity metrics [39,59]. Fig. 1(a) demonstrates an example of thefeature sharing among the learned model W, where all thelearning tasks select the same subset of features. Throughexploring various types of model commonalities, either structuresor parameters, simultaneously learning multiple tasks will benefitfrom the learning of each other. Hence, the MTL paradigm isexpected to achieve better generalization performance thanindependently learning a prediction model for each task. However,the real applications tend to have more complicated situations andoften there are not commonly shared structure or parametersamong all the tasks [47,38].

2.3. MTL with task grouping

Note that in many real applications for learning multi-tasks, thetasks are gathered into several groups according to their related-ness. Intuitively, the tasks in the same group are more related thanthe tasks in different groups. As shown in Fig. 1(b), the learningtasks form two groups, where the tasks within the same groupselect the same subset of features and share no common featureswith the tasks from the other group.

To deal with this scenario, one of the representative methods isto use grouping matrices to model the task grouping effect

xible structure regularization, Neurocomputing (2015), http://dx.

Page 3: Multiple task learning with flexible structure regularization

J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 3

explicitly, as suggested in [23]:

arg minW;Q

Vðf ðwl;XlÞ;YlÞþγXg

JWQ g J ;

where wl is the l-th column vectors of W and Q g is the groupassignment matrix for g-th group. For a standard regression pro-blem, the empirical loss is formed quadratically:

Vðf ðW;XlÞ; ylÞ ¼XLl ¼ 1

JX>l wl�yl J

2:

The above objective function is minimized using alternatingoptimization strategy that converges to a local minimum [23]. Todeal with a more complex setting with overlapped groups, one candecompose the weight matrix W as [25]

W¼MS;

where each column of the latent task matrix M represents onetask group, and S is the coefficient matrix of the representation inthe latent task space. The objective function is formulated as

arg minM;S

Vðf ðMsl;XlÞ;YlÞþαJSJ1;1þβ‖M‖2F :

The sparsity penalty on the matrix S enforces that eachobserved task is only related to a few latent tasks. The secondpenalty regularizes the ℓ2 norm of the latent matrix and avoidsoverfitting. An alternating optimization strategy is employed tosolve the above problem [25]. More recently, task grouping andsparse structural analysis has been combined by Li et al. [29],which is solved by nonnegative spectral clustering.

2.4. MTL with outlier tasks

Another scenario in real applications for learning multi-tasks isthat there exist a certain amount of tasks which are independentwith other tasks. Hence, they are named as outlier tasks. Forinstance, Fig. 1(c) shows a major task group peppered with twooutlier tasks. Though it can be regarded as a special case of taskgroup with one big group and several groups with only one task,such an imbalanced grouping structure is problematic for manytask grouping methods.

As mentioned earlier, an intuitive yet effective way to distin-guish those outlier tasks is to employ the decomposition frame-work to design robust MTL methods. Robust MTL assumes that thetarget model W can be represented by the supposition of a block-structured component P with row-sparsity and an outlier com-ponent Q with elementwise sparsity [20], as shown in below

W¼ PþQ :

where P reveals a set of tasks with commonly shared structure andQ identifies the outlier tasks though enforcing elementwisesparsity.

Then the objective for robust MTL can be formed as the fol-lowing optimization problem with the two types of sparsity reg-ularization:

arg minP;Q

Vðf ðpl;ql;XlÞ;YlÞþαJPJ2;1þβJQ J1;1;

where pl and ql are the l-th column vectors of P and Q, respec-tively. The linear prediction model is written as

f ðwl;XlÞ ¼X>l wl ¼X>

l ðplþqlÞ ¼ f ðpl;ql;XlÞ:The above decomposition based objective can be efficiently opti-mized via various techniques, such as an accelerated gradientdescent method [20]. However, this formulation does not considerthe case with multiple groups of tasks, whose structures cannot besimply represented as a single block-structured component. Inaddition, although the early work provided the error bounds of

Please cite this article as: J. Pu, et al., Multiple task learning with flexdoi.org/10.1016/j.neucom.2015.11.029i

recovering the two decomposed parts P, Q [20], the error boundfor recovering the true target model W remains unrevealed.

3. Generic MTL via flexible structure regularization

In this section, we will first describe a generic MTL formulationwith a flexible structure regularization, which imposes no specificassumptions of the tasks' structure. In order to train the targetmodels and identify the task structure from the data simulta-neously, we propose two efficient algorithms, i.e., the IterativelyReweighted Least Square (IRLS) method and the AcceleratedProximal Gradient (APG) method. Finally, we will analyze therelatedness of these two methods.

3.1. Structure regularization with joint ℓ11=ℓ21-norm

Here we consider using a linear regression model for learning Ltasks simultaneously, where a single prediction model is repre-sented as f ðXlÞ ¼X>

l wl; l¼ 1;2;…; L. Motivated by the existingworks [22,20], we formulate a minimization problemwith the costfunction as a regularized quadratic loss:

W¼ arg minW

LðWÞ ¼ arg minW

XLl ¼ 1

JX>l wl�yl J

22þλJ ðWÞ: ð1Þ

We use J ðWÞ to denote a structure regularization term with λas the coefficient. Without using the superposition assumption todecompose W into two structure terms, we use a combination of astructure inducing norm and an outlier detecting norm, namelyjoint ℓ11=ℓ21-norm regularization:

hðWÞ ¼ ð1�γÞJWJ2;1þγ JWJ1;1; ð2Þwhere γA ½0;1� is a constant to balance the two norms. Thoughsharing a similar form with the elastic net regularization [64], theabove formulation uses a ℓ21 norm instead of a squared ℓ21 norm.For a non-degenerate setting with γAð0;1Þ, we illustrate the 3Dnorm balls of the elastic net and the joint ℓ11=ℓ21 norm in Fig. 2.Compared to the elastic net ball, the joint ℓ11=ℓ21 norm ball hassharper corner regions, which may induce a sparser solution.

Replacing the regularization term of the cost in Eq. (1) by thedefinition in Eq. (2), we can obtain the following regularizedconvex cost function:

LðWÞ ¼XLl ¼ 1

JX>l wl�yl J

22þλ1 JWJ2;1þλ2 JWJ1;1; ð3Þ

Note that the constant coefficients are absorbed as λ1 ¼ λð1�γÞand λ2 ¼ λγ. Instead of decomposing the target model into a fixedcombination of structure and outlier components, here we exploita flexible structure penalty without imposing any specificassumptions. Such a formulation is shown to be more flexible interms of handling various types of tasks, including both groupedtasks and outlier tasks. A similar cost function with joint regular-ization term has been used to detect eQTLs [27], to predictmemory performance [53] and to analysis fMRI data [46]. How-ever, the convergence property of [27] is not provided and thesolution of [53] relies on iteratively solving the inverse of a grammatrix, which is computationally infeasible for high dimensionaldata. In below, we will present two efficient approaches to per-form the training process and provide theoretical analysis of theconvergence properties.

3.2. Optimization via iteratively reweighted least square method

We first apply an iteratively reweighted least square method tominimize the cost function in Eq. (3), as suggested in [41]. By

ible structure regularization, Neurocomputing (2015), http://dx.

Page 4: Multiple task learning with flexible structure regularization

Re

1:2:3:4:

5:

6:

7:8:

Fig. 2. The illustration of (a) the 3D elastic net ball, and (b) the joint ℓ11=ℓ21-norm ball, with γ ¼ 0:2;0:5;0:8, respectively.

J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎4

adopting the factored representation for the gradient vector of theregularizer [45], we can zero the partial derivatives ∂L

∂wlto derive

the optimal solution:

∂L∂wl

¼ 0 ) 2XlX>l wl�2XlylþλΠ lwl ¼ 0;

where Π l is a diagonal matrix with the element ðΠ lÞii defined as:

ðΠ lÞii ¼ diagðð1�γÞJwi� J �12 þγ jwil j �1Þ:

Apparently, the element ðΠ lÞii consists of two components. Thefirst component Jwi� J �1

2 represents the group impact since itimposes row sparsity on the i-th row of W. The second componentjwil j �1 represents the individual impact by measuring the impactof the i-th feature on the l-th task. The parameter γ balances theimpacts of these two components to the diagonal matrix Π. Afterperforming some derivations, we can obtain the following equa-tion:

XlX>l þλ

2Π l

� �wl ¼Xlyl: ð4Þ

The above equation actually indicates a solution for the followingweighted least square problem [15]:

arg minwl

JX>l wl�yl J

22þ

λ2JΠ1=2

l wl J22: ð5Þ

Since solving the linear system in Eq. (4) requires to computethe inverse of a d� d matrix, a standard algorithm has a com-plexity of Oðd3Þ. To derive an efficient solution for high dimensiondata, we first reformulate Eq. (4) as a preconditioned linear systemusing Jacobi method, similar as the solution presented in Section

Please cite this article as: J. Pu, et al., Multiple task learning with fledoi.org/10.1016/j.neucom.2015.11.029i

10.2 in [48]:

M�1l XlX

>l þλ

2Π l

� �wl ¼M�1

l Xlyl; ð6Þ

where Ml ¼ diagðXlX>l þ λ

2Π lÞ. Although Jacobi iteration can bedirectly employed to solve Eq. (6), a certain condition on matrix Ml

is required to receive the convergence guarantee. Here, we use apreconditioned conjugate gradient (PCG) algorithm which providesbetter asymptotic performance in solving the linear system [48].

Algorithm 1. MTL-IRLS.

xible

quire: Xl: data matrix of the lth task; yl: response of the lthtask;

Initialize fΠ0l g

L

l ¼ 1 with the identity matrix;while not converged do

for l¼1 to L doUpdate the matrix Ml:

Ml ¼ diagðXlX>l þ λ

2Πkl Þ;

Solve the preconditioned linear system using PCG:

M�1l ðXlX

>l þ λ

2Πkl Þwl ¼M�1

l Xlyl;Update the weight matrix:

Πkþ1l ¼ diag ð1�γÞjwi�j�1þγ jwil j �1

� �;

end forUpdate the iteration counter: k¼ kþ1

end while

9:

Note that the diagonal matrixΠ l can be interpreted as a weightmatrix since it essentially enforces different weights to differentfeature dimensions, i.e., each element of wl. In addition, the

structure regularization, Neurocomputing (2015), http://dx.

Page 5: Multiple task learning with flexible structure regularization

Re

1:

2:

J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 5

calculation of the weight matrix Π l depends on the current wl,which suggests an iterative algorithm to derive the optimal wl.Specifically, in each iteration, we first use the PCG algorithm tosolve a preconditioned linear system (equivalent to the weightedleast square problem in Eq. (5)), and then recalculate the weightmatrix Π l. Hence, such an optimization procedure is a typicaliteratively reweighted least square (IRLS) method [15]. Algorithm1 summarizes the multi-task learning procedure via IRLS method,namely MTL-IRLS method. Apparently, the optimization process isformed in nested loops, where the outer loop (while-loop) is forpursuing global convergence and the inner loop (for-loop) is for

updating each single task. The weight matrices fΠ0l g

L

l ¼ 1 are initi-alized as identity matrices, which give equal weights to eachdimension for all the tasks.

To further analyze and elaborate how the above MTL-IRLSalgorithm groups major tasks and identifies outlier tasks simulta-neously, we first simplify the representation by letting XARdL�

Plnl

be a block diagonal matrix with XlARd�nl as the l-th block. Thenwedefine a vectorization operator “vec” over an arbitrary matrix ZARd�L as vecðZÞ ¼ ½z>

1 ;…; z>l ;…; z>

L �> , where zl is the l-th columnvector of Z. Thus for a single while-loop in Algorithm 1, it can beviewed as solving a weighted least square problem:

minW

JX> � vecðWÞ�vecðYÞJ22þλ2JΠ

12 � vecðWÞJ22; ð7Þ

where ΠARdL�dL is a concatenated diagonal matrix with Π l as thel-th block. Recall the definition of Π l in Eq. (4), the ðldþ iÞ-thdiagonal element of Π is computed as:

Π ldþ i;ldþ i ¼ ð1�γÞJwi� J �12 þγ jwil j �1;

which indicates the weight of the i-th feature for the l-th task. Notethat a small positive number is added to the denominators to avoidnumerical problem. We simply assume a balanced case with equalemphasis on row and element-wise sparsity. Note that if both Jwi� Jand wil are small, the value Π ldþ i;ldþ i becomes large. Thus, itimposes a heavy penalty for the i-th feature of the l-th task. As aresult, the value of wil becomes even smaller after each iteration ofthe updates. This indeed helps maintain both group sparsity andelement-wise sparsity since the i-th feature is not chosen by eitherthe grouped tasks or the outlier tasks. On the other hand, largevalues of both terms will make Π ldþ i;ldþ i becoming small andimpose a slight penalty that encourages the increase of wil after eachiteration. It is clear that the iterative algorithm helps recover thegroup structure which the l-th task belongs to.

The other two complicated cases are that the i-th feature isonly relevant to the l-th task while irrelevant to all the other tasks,or the i-th feature is only irrelevant to the l-th task but relevant toall the others. In both cases, the l-th task will be identified as anoutlier. If wil is small and Jwi� J2 is large, the current task considersthis feature as irrelevant while the other tasks consider it as animportant feature. Apparently, the penalty Π ldþ i;ldþ i will becomelarge and encourage the updated wil to become smaller. Thus, itfurther helps identify outlier tasks. If the i-th feature is relevant tothe l-th task while irrelevant to the others, the value of wil is large.However, Jwi� J2 will not be very small since it also counts thevalue of wij and satisfies Jwi� J2 ¼ ðPL

j ¼ 1 w2ijÞ1=2Zwil. Hence, the

value wil will remain fairly large after each iteration, which meansthat the element-wise sparsity is still preserved. In summary, theiterative reweighting scheme in Algorithm 1 also provides aunique power for identifying the outlier tasks.

3.3. Optimization via accelerated proximal gradient method

Now we propose to use accelerated proximal gradient (APG)method [36] to minimize the above convex cost function with jointℓ11=ℓ21-norm regularization. Due to the existence of non-smooth

Please cite this article as: J. Pu, et al., Multiple task learning with flexdoi.org/10.1016/j.neucom.2015.11.029i

term, one tends to use smoothing technique [37] to substitute thenon-smooth part with its smooth approximation [14]. In thispaper, we consider to directly solve the proximal operator of aconvex combination of the ℓ21 norm and ℓ11 norm efficiently. Wefirst rewrite the cost function in Eq. (3) as the sum of two com-ponents as follows:

LðWÞ ¼ gðWÞþhðWÞ; ð8Þwith the smooth part

gðWÞ ¼XLl ¼ 1

JX>l wl�yl J

22;

and a non-smooth part

hðWÞ ¼ λ1 JWJ2;1þλ2 JWJ1;1: ð9ÞFor the smooth part gðWÞ, we denote the gradient as

∇WgðWÞ ¼ ½∇w1gðWÞ;…;∇wL gðWÞ�, where ∇wl gðWÞ is the gradientrespect to the variable wl:

∇wl gðWÞ ¼XlðX>l wl�ylÞ:

Moreover, ∇WgðWÞ is Lipschitz continuous with the Lipschitzconstant as Lip¼maxl λmaxðXlX

>l Þ, where λmaxðXlX

>l Þ is the largest

eigenvalue of matrix XlX>l [37]. Then we approximate the smooth

function gðWÞ at the point W0 using Taylor series expansion:

gðWÞ � gðW0Þþ ⟨∇W0gðW0Þ;W�W0⟩þLip2

JW�W0 J2F ; ð10Þ

where J � JF denotes the Frobenius norm of a matrix. The aboveapproximation of gðWÞ consists of two terms: the first-order Taylorexpansion of the smooth function gð�Þ at the point W0, i.e.,gðW0Þþ⟨∇W0gðW0Þ;W�W0⟩, and the strong convex regularization termLip2 JW�W0 J2. Hence, substitute the smooth part gðWÞ by Eq. (10) andwe optimize the approximation of the cost function LðWÞ as follows:

W¼ arg minW

LðWÞ ¼ arg minW

gðWÞþhðWÞ � arg minW

gðW0Þ

þ⟨∇W0gðW0Þ;W�W0⟩þLip2JW�W0 J2F þhðWÞ ¼ arg min

W

Lip2JW

�ðW0 � 1Lip

∇W0gðW0ÞÞJ2F þhðWÞ; ð11Þ

where the last line of the above equation is obtained by incorporatingthe inner product into the square terms.

Given a function hð�Þ and a parameter Lip, the proximal operatorProxh;Lipð�Þ is defined as [33]:

Proxh;LipðVÞ ¼ arg minW

Lip2

JW�VJ2F þhðWÞ: ð12Þ

Using the above definition of proximal operator, the approximatesolution of the Eq. (11) can be written as:

W¼ Proxh;Lip W0 � 1Lip

∇W0gðW0Þ� �

: ð13Þ

Note that the proximal gradient method in Eq. (13) indeedsuggests solving the minimization problem in Eq. (8) by the fol-lowing iterates:

Wk ¼ Proxh;Lipk Wk�1� 1Lipk

∇Wk� 1gðWk�1Þ� �

; ð14Þ

where Lipk is a sequence of nonnegative scalars determined by linesearch. Hence, we obtain an iterative solution of the proposedminimization problem in Eq. (8) using the proximal operator.

Algorithm 2. MTL-APG Method.

ible

quire: Xl: data matrix of the lth task; yl: response of the lth

task; Initialize U0 as a 0 matrix.while not converged do

Vk ¼Uk�1�tk∇gðUk�1Þ;

structure regularization, Neurocomputing (2015), http://dx.

Page 6: Multiple task learning with flexible structure regularization

3:4:

5:6:

J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎6

Pledo

Determine the step length tk by line search.Evaluate the proximal operator associated with the convex

combination of matrix ð2;1Þ norm and ð1;1Þ norm inEq. (15).

Uk ¼Wkþk�1kþ2ðWk�Wk�1Þ;

Update the iteration counter: k¼ kþ1.end while

7:

Next, we will discuss how to evaluate the proximal operator ofthe nonsmooth term hð�Þ in Eq. (9). Let Vk ¼Wk�1

� 1Lipk

∇Wk� 1gðWk�1Þ, the evaluation of the proximal operator of hð�Þis equivalent to minimizing the following cost function:

Wk ¼ arg minW

Lipk2

JW�Vk J2F þλ1 JWJ2;1þλ2 JWJ1;1: ð15Þ

The evaluation of such a proximal operator remains difficultsince it contains two nonsmooth terms. Though the smoothtechnique [37] can be applied to transform one of the nonsmoothterm into smooth, it will decrease the convergence rate into sub-optimal [7]. However, we notice that the evaluation of the prox-imal operator in Eq. (15) and its two degraded cases have analyticsolutions, which are summarized in the following theorems.

Theorem 1. The closed-form solution to the degraded case with λ1¼ 0 is obtained by the following operation:

wkij ¼ jvkij j �

λ2Lipk

� �þsignðvkijÞ;

where signð�Þ denotes the sign of a scalar, and �½ �þ is an operator toextract the positive part of a scalar which is defined as follows:

vij�

þ ¼0; vijr0;vij; vij40:

(

Theorem 2. The closed-form solution to the degraded case with λ2¼ 0 is obtained by the following operation:

wki� ¼ 1� λ1

Lipk Jvki� J2

" #þvki�:

Theorem 3. The closed-form solution to the proximal problem 15 isobtained by the following operation:

wki� ¼

Lipk⟨uki�; v

ki�⟩�λ1 Juk

i� J2�λ2 Juki� J1

Lipk Juki� J

22

" #þuki�;

where uij ¼ jvkij j �λ2=Lipkh i

þsignðvkijÞ.

The above solution of the non-degraded case can be viewed asa combination of two degraded cases. In particular, it first per-forms ℓ1 thresholding, and then conducts a combined shrinkagewhich uses the values of the ℓ1 and ℓ2 norms, the inner product ofthe vectors before and after the ℓ1 thresholding. To obtain theproximal operator associated with the ℓ11 norm, it requires to goover the entire matrix V, while obtaining the solution with the ℓ21norm requires to scan V twice. The combination of both the ℓ21norm and ℓ11 norm requires scanning the matrix V multiple timesto perform the ℓ1 thresholding and compute the ℓ1 norm, ℓ2 normand the inner product. However, the complexity of all theseproximal operators remains the same as O(nL).

To further accelerate the convergence speed of the algorithm,we use the Accelerate Proximal Gradient (APG) method to solvethe optimization problem in Eq. (8) to achieve accelerated con-vergence speed. In our setting, function gð�Þ is smooth and stronglyconvex, and the proximal operator is computed efficiently byTheorem 3. According to [5], the APG algorithm converges quad-ratically. Algorithm 2 summarizes the optimization procedureusing APG to derive the optimal solutions for the proposed MTL

ase cite this article as: J. Pu, et al., Multiple task learning with flei.org/10.1016/j.neucom.2015.11.029i

method. It is easy to see that the final algorithm is formulated as asingle loop. The step length is denoted by tk instead of Lipschitzconstant Lipk, which is determined by line search [6].

3.4. Relationship between MTL-IRLS and MTL-APG

MTL-IRLS and MTL-APG are closely related, as both of thembelong to the operator splitting method [30,16,11]. As discussed in[6], the proximal gradient method can be viewed as a forwardbackward splitting method. Recall the stationary point conditionfor our optimization problem (3), it can be written as:

ð2tXlX>l wn

l �2tXlyl�wn

l ÞþðλtΠ lwn

l þwn

l Þ ¼ 0;

where the constant t satisfies t40. Then we obtain the optimalsolution which is defined in an iterative form

wn

l ¼ ðIþλtΠ lÞ�1ðwn

l �2tðXlX>l wn

l �XlylÞÞ: ð16ÞAs stated by [6],

Proxt;gðwnÞ ¼ ðIþλtΠ lÞ�1

is the proximal operator, and wn

l �2tðXlX>l wn

l �XlylÞ is a gradientdescend operator with step length t.

Considering the IRLS method, we split the objective functioninto the linear part and the nonlinear part:

LðWÞ ¼ ~gðWÞþ ~hðWÞ; ð17Þwhere the nonlinear part

~gðWÞ ¼XLl ¼ 1

w>l XlX

>l wlþλJ ðWÞ;

and the linear part

~hðWÞ ¼XLl ¼ 1

2ylX>l wlþy>

l yl� �

:

The optimal condition for minimizing the objective function 17is written as:

∇ ~hþ∂ ~gðWÞ ¼ 0:

We have

Wn ¼ � 2XX> þλ∂J ðWnÞWn

� ��1

∇ ~h: ð18Þ

Comparing the iterative form in Eqs. (16) and (18), the main effortof both methods is to evaluate the inversion: the APG method is tocompute the proximal operator, and the IRLS method is to computethe solution of a weighted least square. In general, the APG method isdesigned for simple penalties which have analytical solution of theproximal operator, while the IRLS method is designed for simplemodels where pruning methods can be effectively applied. For ourformulation with the joint ℓ11=ℓ2;1 regularization, though both MTL-IRLS and MTL-APG are less affected by the curse of dimension, thetraining time of the MTL-APG is in proportion to the number of tasks,which is confirmed by our empirical study in Section 6.4.

4. Convergence analysis

Note that the convergence property of the APG algorithm hasbeen clearly proved in earlier studies [5,7]. In this section, we willprovide convergence analysis of the IRLS algorithm for multi-tasklearning. Here, we start introducing the lemma presented by [44].

Lemma 1. Given two d-dimensional vectors x¼ ½x1⋯xd�> andy¼ ½y1⋯yd�> , we define EðxÞ ¼ Pd

i ¼ 1 jxi j and EðyÞ ¼ Pdi ¼ 1 jyi j .

xible structure regularization, Neurocomputing (2015), http://dx.

Page 7: Multiple task learning with flexible structure regularization

J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 7

Then the following inequation holds

EðyÞ�EðxÞr12

y>Cy�x>Cx� �

;

where C¼ diagfc11;…; cii;…; cddg is a diagonal matrix with cii ¼jxi j �1.

We will use the above lemma to prove the convergence of theMTL-IRLS algorithm as following. Since it is straightforward to seethat the cost function LðWÞ defined in Eq. (3) is lower bounded asLðWÞZ0, we only need to show that the cost function ismonotonic.

Theorem 4. In the proposed MTL-IRLS algorithm, the regularizedcost function LðWÞ decreases monotonically, i.e., LðWkþ1ÞrLðWkÞ.

Proof 1. We have shown that the proposed algorithm equals tosolving a weighted least square problem, as shown in Eq. (7).Decompose the diagonal matrix Π into two diagonal matrices as:Π ¼ΦþΨ (Φ;Ψ ARdL�dL), where the diagonal elements of Φ andΨ are defined as:

Φldþ i;ldþ i ¼ ð1�γÞJwi� J �12

Ψ ldþ i;ldþ i ¼ γ jwil j �1; l¼ 1;…; L:

Let us denote a column vector e¼ ½e1;…; ei;…; ed�> with the ele-ment ei ¼ Jwi� J2 and a diagonal matrix Φ ¼ diagfΦ11; ⋯; Φ ii;⋯;

Φddg with Φ ii ¼ ð1�γÞJwi� J �12 . Then we can derive the weighted

ℓ2 penalty term of Eq. (7) as:

JΠ1=2 � vecðWÞJ22 ¼ vecðWÞ½ �> �Π � vecðWÞ ¼ vecðWÞ½ �> �Φ � vecðWÞþ vecðWÞ½ �> �Ψ � vecðWÞ ¼ e> Φeþ vecðWÞ½ �> �Ψ � vecðWÞ:

For simplicity, we denote a dL-dimensional vector v¼ vecðWÞ.Recalling Eq. (2) and using Lemma 1, we can connect the jointℓ11=ℓ21-norm J ðWÞ with the weighted ℓ2 penalty:

J ðWkþ1Þ�J ðWkÞ

¼ ð1�γÞXi

Jwkþ1i� J2�

Xi

Jwki� J2

!þγ

Xi

jvkþ1i j �

Xi

jvki Þj !

r12

ðekþ1Þ> Φkekþ1�ðekÞ> Φk

ek� �

þ12

ðvkþ1Þ>Ψ kvkþ1Þ�ðvkÞ>Ψ kvk �

¼ 12

ðvkþ1Þ>Φkvkþ1�ðvkÞ>Φkvk �

þ12

ðvkþ1Þ>Ψ kvkþ1Þ�ðvkÞ>Ψ kvk �

¼ 12

J ðΠkÞ1=2vecðWkþ1ÞJ22� JðΠkÞ1=2vecðWkÞJ22 �

:

Adding the same quadratic empirical loss term to the both sidesof the above inequality, then it becomes LðWkþ1Þ�LðWkÞrGðWkþ1Þ�GðWkÞ. Since the IRLS algorithm guarantees GðWkþ1Þ�GðWkÞo0, it is easy to see LðWkþ1Þ�LðWkÞr0. Then the mono-tonic property of the cost function is proved. Finally, since the costfunction is lower bounded by zero, the convergence of proposedMTL-IRLS algorithm is guaranteed. □

5. Performance bound

To further explore the performance bound, we define twoindex sets I ðWÞ; I cðWÞ for the nonzero and zero rows of matrix W,respectively:

I ðWÞ ¼ fi : Jwi� J1a0g; I cðWÞ ¼ fi : Jwi� J1 ¼ 0g:The following three lemmas, i.e., Lemma 2, 3, 4, are introduced

by [20].

Please cite this article as: J. Pu, et al., Multiple task learning with flexdoi.org/10.1016/j.neucom.2015.11.029i

Lemma 2. For any matrix pair W; W , we have the followinginequality:

JW�WJ2;1þ JWJ2;1� JW J2;1r2J ðW�WÞI ðWÞ J2;1;

and

JW�WJ1;1þ JWJ1;1� JW J1;1r2J ðW�WÞI ðWÞ J1;1:Lemma 3. Let δi be i.i.d. random variables with δi �Nð0;σ2Þ, andP

iα2i ¼ 1. Then, we have

v¼ 1σ

Xi

αiδi

is a standard normal random variable.

Lemma 4. Let χ2ðdÞ be a χ2 random variable with d degrees offreedom. Then, for 8b40, we have

Pr χ2ðdÞrdþb� �

41�exp �12

b�d log 1þbd

� �� �� �:

We use the above lemmas to prove our performance bound inthe following.

Theorem 5. Let W ¼ ½w1;⋯wL� be the optimal solution of Eq. (3) forLZ2 and n; dZ1. Assume that the regression model is given by alinear representation with Gaussian noise, i.e., yl ¼ f lþδl ¼X>

l wlþδl, where δl is a vector and each entry δli �Nð0;σ2Þ. If dataX>

l is normalized, and we choose the regularization parameter λ1and λ2 as

λ1; λ2ZσnL

ffiffiffiffiffiffiffiffiffiffiffiffiffidLþb

p;

where b is a positive scalar. Then with the probability at least1�exp �1

2 b�dL log 1þ bdL

� �� �� �, we have

XLl ¼ 1

1nL

JX>l w l�f l J22r

XLl ¼ 1

1nL

JX>l wl�f l J22þ2λ1 J ðW�WÞI ðWÞ J2;1

þ2λ2 J ðW�WÞI ðWÞ J1;1:Proof 2. As W is the optimal solution of Eq. (3), we have

XLl ¼ 1

1nL

JX>l wl�yl J

22r

XLl ¼ 1

1nL

JX>l wl�yl J

22þλ1ðJWJ2;1� JW J2;1Þ

þλ2ðJWJ1;1� JW J1;1Þ:

According to the linear regression model with Gaussian noise,we have

XLl ¼ 1

1nL

JX>l wl�f l J22

rXLl ¼ 1

1nL

JX>l wl�f l J22þλ1ðJWJ2;1� JW J2;1Þþλ2ðJWJ1;1� JW J1;1Þþ

2nL

⟨Z; W�W⟩;

ð19Þwhere Z¼ ½X1δ1;…;XLδL�ARd�L, and its (i, j)-th entry is given byzij ¼

Pnk ¼ 1 xikδkj:

If define vij ¼ 1σzij, it is easy to see that vij are i.i.d standard

normal variables vij �Nð0;1Þ (according to Lemma 3). Thus

1σJZJ2F ¼

Xi

Xj

v2ij

is a χ2 random variable with dL degrees of freedom. Based onLemma 4, we have

Pr1nL

JZJ Frα� �

r1�exp �12

b�dL log 1þ bdL

� �� �� �;

where αZ σnL

ffiffiffiffiffiffiffiffiffiffiffiffiffidLþb

p.

ible structure regularization, Neurocomputing (2015), http://dx.

Page 8: Multiple task learning with flexible structure regularization

J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎8

Thus, with probability at least 1�exp �12 b�dL log 1þ b

dL

� �� �� �,

we have

2nL

⟨Z; W�W⟩

r 2nL

JZJ f JW�WJF

r2αJW�WJF

rαJW�WJ2;1þαJW�WJ1;1:

Let λ1; λ2Z σnL

ffiffiffiffiffiffiffiffiffiffiffiffiffidLþb

p, then

2nL

⟨Z; W�W⟩rλ1 JW�WJ2;1þλ2 JW�WJ1;1:

Substitute the above inequality into the inequality (19) andusing Lemma 2, we verify the theorem. □

Now we make the following assumption about the trainingdata and the weight matrix, which is a generalized case of therestricted eigenvalue assumption.

Assumption 1. For a matrix ΓARd�L, let srd. We assume thatthere exist constant κðsÞ such that

κðsÞ ¼ minΓARðsÞ

JX> vecðΓÞJffiffiffiffiffidL

pJΓI ðWÞ JF

40;

where the restricted set RðsÞ is defined as

RðsÞ ¼ fΓARd�L : Γa0; jI ðWÞjrs;

JΓI cðWÞ J2;1rα1 JΓI ðWÞ J2;1;

JΓI cðWÞ J1;1rα2 JΓI ðWÞ J1;1g;

where jI j counts the number of elements in the set I . Note thatAssumption 1 is similar to the assumptions made by someprevious works on analyzing the performance bounds of MTL[13,20]. The following theorem gives a bound to measure how wellour proposed method can approximate the true W matrix definedin Eq. (3).

Theorem 6. Let W be the optimal solution of Eq. (3) for LZ2 andn; dZ1, andWn be the oracle solution. The regularization parametersλ1 and λ2 are chosen as

λ1; λ2ZσnL

ffiffiffiffiffiffiffiffiffiffiffiffiffidLþb

p;

where b is a positive scalar. Then the above Assumption, the followingresults hold with the probability of at least 1�expð�1

2ðb�dLlog ð1þ b

dLÞÞÞ:1nL

JX> vecðWÞ�vecðYÞJ2r 1κ2ðsÞ 2λ1

ffiffis

p þ2λ2ffiffiffiffiffisL

p �2;

JW�Wn J2;1rðα1þ1Þsκ2ðsÞ 2λ1þ2λ2

ffiffiffiL

p �;

JW�Wn J1;1rðα2þ1Þsκ2ðsÞ 2λ1

ffiffiffiL

pþ2λ2L

�:

Proof 3. Using the Theorem 5 and setting W¼Wn, we have

1nL

JX> � vecðWÞ�vecðFÞJ2r2λ1 J ðW�WnÞI ðWnÞ J2;1þ2λ2 JðW�WnÞIðWnÞ J1;1:

Under Assumption 1, we have

J ðW�WnÞI ðWnÞ J2;1rffiffis

pJ ðW�WnÞI ðWnÞ JFr

ffiffis

p

κðsÞffiffiffiffiffinL

p JX> �

vecðWÞ�vecðFÞJ ;

Please cite this article as: J. Pu, et al., Multiple task learning with fledoi.org/10.1016/j.neucom.2015.11.029i

and

J ðW�WnÞI ðWnÞ J1;1rffiffiffiffiffisL

pJ ðW�WnÞI ðWnÞ JFr

ffiffiffiffiffisL

p

κðsÞffiffiffiffiffinL

p JX> �

vecðWÞ�vecðFÞJ :

Thus, we obtain

JX> vecðWÞ�vecðYÞJrffiffiffiffiffinL

p

κðsÞ 2λ1ffiffis

p þ2λ2ffiffiffiffiffisL

p �:

Note that Assumption 1 gives

JΓI cðWÞ J2;1rα1 JΓI ðWÞ J2;1;

JΓI cðWÞ J1;1rα2 JΓI ðWÞ J1;1:

If set Γ ¼ W�Wn, we obtain

JW�Wn J2;1r ðα1þ1ÞJ ðW�WnÞI ðWnÞ J2;1;

JW�Wn J1;1r ðα2þ1ÞJ ðW�WnÞI ðWnÞ J1;1:

Finally, we can derive

JW�Wn J2;1rðα1þ1Þsκ2ðsÞ 2λ1þ2λ2

ffiffiffiL

p �;

JW�Wn J1;1rðα2þ1Þsκ2ðsÞ 2λ1

ffiffiffiL

pþ2λ2L

�: □

The above theorem provides an important theoretical guaran-tee for the global optimum of the optimization problem (3). Morespecifically, the first inequality measures the squared data fittingloss, and the second and third inequalities both bound the targetmatrix W in terms of ℓ2;1 norm and ℓ1;1 norm.

6. Experiments

In this section, we conduct extensive experiments on bothsynthetic and real data to evaluate the effectiveness of ourapproaches. We compare with several state-of-the-art MTLmethods, including grouping based MTL (GMTL) [23], dirty MTL(DMTL) [22], and two robust MTL methods with different reg-ularization terms, (robust MTL (RMTL) [13] and robust multi-taskfeature learning (rMTFL) [20]). For the compared methods, we usethe codes provided by the authors. To provide fair comparison, wepartition the data into training, validation and test sets. Theparameters for each methods are fine tuned using the validationset. Performance is measured by both normalized mean squarederror (nMSE) and averaged mean squared error (aMSE) and wereport the mean and standard deviation of the errors of 10 randomtrials.

6.1. Synthetic data

We first describe the procedure of generating the syntheticdataset. We set the number of tasks L¼50, and each task has 100training samples with 200-dimensional features. The indices ofthe nonzero entries for both grouped tasks and outlier tasks arechosen independently from a discrete uniform distribution, andthe values of all these nonzero entries are chosen randomly from astandard Gaussian distribution. The data matrices fXlgLl ¼ 1 aresampled from a standard Gaussian distribution. The response iscomputed as yl ¼XT

l wlþξl. Here ξl is a Gaussian noise with zeromean and the variance σ2l specified by a certain signal-to-noise

xible structure regularization, Neurocomputing (2015), http://dx.

Page 9: Multiple task learning with flexible structure regularization

Fig. 3. Illustration the learned coefficient matrices using various MTL algorithms and the ground truth matrix: (a) a major task group peppered with outlier tasks and(b) multiple groups of major tasks peppered with outlier tasks. This figure is best viewed on screen with magnification.

Table 1The error rates of the learned target matrix W on the synthetic data.

Settings Measure DMTL rMTFL MTL-IRLS MTL-APG

One group þ Outliers ℓ0 error 0.2118 0.2460 0.1939 0:090470.0107 70:0535 70:0050 70:0098

ℓ1 error 0.0200 0.0216 0.0129 0.008570:0008 70:0020 70:0002 70:0007

ℓ2 error 0.0033 0.0062 0.0015 0.000970:0002 70:0003 70:0001 70:0001

Two groups þ Outliers ℓ0 error 0.2147 0.2232 0.1997 0.113970:0182 70:0100 70:0070 70:0472

ℓ1 error 0.0197 0.0276 0.0136 0:009170:0010 70:0002 70:0004 70:0009

ℓ2 error 0.0032 0.0083 0.0016 0:000870:0002 70:0001 70:0001 70:0002

J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 9

(SNR) level as

σ2l ¼

1nlJyl J

210�SNR=10:

Please cite this article as: J. Pu, et al., Multiple task learning with flexdoi.org/10.1016/j.neucom.2015.11.029i

We start from a simple scenario where the ground truth con-tains one major task group with a total of 40 tasks and 10 outliertasks. In addition, we also test a more complicated case with twomajor task groups and 10 outlier tasks, where each major groupcontains 20 tasks. Fig. 3 illustrates the recovered coefficientmatrices W using different methods and the ground truth. It canbe seen that all these methods can recover the major task groupsto some extent. However, neither the DMTL nor the rMTFLmethods can identify the outlier tasks effectively due to the lim-itation of the used task structure assumptions. In contrast, MTL-IRLS and MTL-APG are able to recover the major task group, aswell as the outlier tasks. To provide quantitative evaluations of therecovered W matrix, we report the error rates using the mean ℓ0,ℓ1 and ℓ2 error in Table 1. Note that the proposed MTL-IRLS andMTL-APG methods significantly outperform other competingmethods and the MTL-APG method achieves the best recovery inall the tests, which is consistent with the qualitative observations.

Finally, we also compare the values of nMSE and aMSE of theregression results for all the tested cases. In particular, we vary themagnitudes of the SNR levels, ranging from 20 dB to 50 dB, and

ible structure regularization, Neurocomputing (2015), http://dx.

Page 10: Multiple task learning with flexible structure regularization

Table 2Performance comparison of various methods on the synthetic data with different noise levels.

Measure SNR (dB) GMTL DMTL RMTL rMTFL MTL-IRLS MTL-APG

nMSE 20 1.4344 0.6169 0.5806 0.6497 0.5849 0.563970:0642 70:0244 70:0199 70:0243 70:0416 70:0363

30 1.4803 0.6007 0.6549 0.5748 0:3541 0:307170:0645 70:0113 70:0274 70:0728 70:0199 70:0350

40 1.1957 0.4588 0.4895 0.4189 0:3739 0:251370:0467 70:0124 70:0137 70:0640 70:0246 70:0251

50 1.2253 0.4821 0.5211 0.4147 0:3441 0:226170:0791 70:0236 70:0217 70:0263 70:0293 70:0335

aMSE 20 0.1032 0.0444 0:0418 0.0468 0.0417 0:040270:0046 70:0019 70:0015 70:0019 70:0035 70:0031

30 0.0927 0.0376 0.0410 0.0360 0:0287 0:024970:0032 70:0011 70:0019 70:0048 70:0018 70:0030

40 0.0980 0.0376 0.0401 0.0344 0:0268 0:018070:0035 70:0011 70:0014 70:0058 70:0016 70:0016

50 0.0965 0.0380 0.0411 0.0327 0:0266 0:017570:0061 70:0022 70:0022 70:0023 70:0018 70:0023

Table 3Performance comparison of various methods in terms of nMSE and aMSE on the SARCOS dataset.

Measure Training # GMTL DMTL RMTL rMTFL MTL-IRLS MTL-APG

nMSE 50 0.0669 0.0668 0.0721 0.0749 0:0524 0:053070:0078 70:0088 70:0081 70:0264 70:0040 70:0035

100 0.0457 0.0474 0.0575 0.0474 0:0448 0:044470:0020 70:0036 70:0266 70:0024 70:0025 70:0024

150 0.0402 0.0426 0.0427 0.0427 0:0385 0:038370:0012 70:0020 70:0013 70:0019 70:0012 70:0012

aMSE 50 0.0633 0.0632 0.0683 0.0709 0:0497 0:050170:0074 70:0083 70:0076 70:0250 70:0038 70:0033

100 0.0433 0.0449 0.0544 0.0449 0:0425 0:042170:0019 70:0034 70:0252 70:0022 70:0024 70:0022

150 0.0380 0.0403 0.0404 0.0404 0:0364 0:036370:0012 70:0019 70:0013 70:0018 70:0012 70:0012

Table 4Performance comparison of various methods in terms of nMSE and aMSE on the School dataset.

Measure Training (%) GMTL DMTL RMTL rMTFL MTL-IRLS MTL-APG

nMSE 10 0.8939 0.9230 0.9004 0.9184 0.8267 0.903270:0258 70:0187 70:0151 70:0178 70:0154 70:0192

20 0.7591 0.7866 0.7773 0.7865 0:7269 0.769170:0113 70:0110 70:0080 70:0099 70:0084 70:0079

30 0.7098 0.7397 0.7341 0.7396 0:6905 0.712770:0142 70:0114 70:0121 70:0115 70:0086 70:0146

aMSE 10 0.2458 0.2538 0.2476 0.2526 0:2287 0.249970:0071 70:0053 70:0045 70:0049 70:0041 70:0046

20 0.2091 0.2167 0.2141 0.2167 0.2009 0.212670:0030 70:0033 70:0026 70:0028 70:0019 70:0014

30 0.1955 0.2037 0.2022 0.2037 0.1911 0.197270:0039 70:0032 70:0035 70:0032 70:0021 70:0029

J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎10

report the results in Table 2. Apparently, reducing the noise levelby increasing the SNR will clearly improve the performance for allthe approaches. However, among all the tested cases, MTL-APGprovides the best performance with lowest estimation errors, andMTL-IRLS is the second best choice for most of the cases. Noticethat the quantitative analysis results of estimation error is con-sistent with the visualized results of the recovered matrices inFig. 3.

6.2. Real data

We now evaluate the proposed method and compare with thestate-of-the-arts approaches using several standard benchmarks

Please cite this article as: J. Pu, et al., Multiple task learning with fledoi.org/10.1016/j.neucom.2015.11.029i

from the real world applications. Note that the selected bench-marks, including the SARCOS dataset, the School dataset, andthe ADNI dataset, have been widely used in literature forvalidating the performance of various MTL algorithms[31,63,25,20,62].

The first dataset used in the experiments is SARCOS data, col-lected from an inverse dynamics prediction system with a sevendegrees-of-freedom anthropomorphic robot arm. The datasetconsists of 48,933 observations corresponding to 7 joint torques;each of the observation is described by a 21-dimensional featureincluding 7 joint positions, 7 joint velocities, and 7 joint accel-erations. The goal here is to construct mappings from eachobservation to the 7 joint torques. Following the setup of the

xible structure regularization, Neurocomputing (2015), http://dx.

Page 11: Multiple task learning with flexible structure regularization

Table 5Performance comparison of various methods in terms of nMSE and aMSE on the ADNI dataset.

Measure Training (%) GMTL DMTL RMTL rMTFL MTL-IRLS MTL-APG

nMSE 20 0.9550 0.6886 0.6677 0.6976 0.5463 0.566470:0143 70:0210 70:0166 70:0194 70:0178 70:0168

40 0.8717 0.6486 0.6137 0.6693 0.5424 0.543670:0166 70:0179 70:0213 70:0160 70:0182 70:0079

60 0.8131 0.6198 0.5975 0.6429 0:5290 0.537370:0269 70:0302 70:0258 70:0310 70:0308 70:0295

aMSE 20 0.0252 0.0182 0.0176 0.0184 0.0146 0.015170:0008 70:0006 70:0005 70:0007 70:0006 70:0008

40 0.0231 0.0172 0.0162 0.0177 0.0143 0.014370:0011 70:0008 70:0008 70:0008 70:0011 70:0011

60 0.0211 0.0161 0.0155 0.0167 0.0139 0.014170:0014 70:0013 70:0014 70:0014 70:0011 70:0012

0.010.1

110

100

00.2

0.40.6

0.810

0.05

0.1

0.15

0.2

λγ

nMSE

0.010.1

110

100

00.2

0.40.6

0.810

0.05

0.1

0.15

0.2

λγ

nMSE

0.010.1

110

100

00.2

0.40.6

0.81

0.7

0.8

0.9

1

λγ

nMSE

0.010.1

110

100

00.2

0.40.6

0.81

0.7

0.8

0.9

1

λγ

nMSE

0.010.1

110

00.2

0.40.6

0.810

1

2

3

4

λγ

nMSE

0.010.1

110

00.2

0.40.6

0.810

0.5

1

1.5

2

λγ

nMSE

Fig. 4. The evaluation of parameter sensitivity of the MTL-IRLS (left) and MTL-APG (right) on three datasets: (a) SARCOS, (b) School, and (c) ADNI.

J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 11

existing works, we vary the size of the training sets by randomlyselecting 50, 100, 150 observations, and use 200 and 5000 obser-vations as validation and test sets, respectively.

Please cite this article as: J. Pu, et al., Multiple task learning with flexdoi.org/10.1016/j.neucom.2015.11.029i

In addition, we also apply all the methods to the second realdataset, i.e., the School dataset, which was obtained from the InnerLondon Education Authority. The dataset consists of exam scores of

ible structure regularization, Neurocomputing (2015), http://dx.

Page 12: Multiple task learning with flexible structure regularization

200 400 600 800 1000 12000

20

40

60

80

100

Dimension

Trai

ning

Tim

e (s

econ

ds)

DMTLrMTFLMtL−IRLSMTL−APG

20 40 80 160 320 6400

20

40

60

80

Number of Tasks

Trai

ning

Tim

e (s

econ

ds)

DMTLrMTFLMTL−IRLSMTL−APG

Fig. 5. Evaluation of the algorithm efficiency with respect to: (a) the data dimensionality, (b) the number of tasks.

Number of Iterations

NM

SE

0

1

2

3

4MTL-IRLS

TrainingTest

Number of Iterations0 10 20 30 40 50 0 200 400 600 800 1000

NM

SE

0

1

2

3

4MTL-APG

TrainingTest

Fig. 6. Convergence analysis of (a) MTL-IRLS and (b) MTL-APG.

J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎12

15,362 students from 139 secondary schools. Each student is describedby 27 attributes such as year, gender and examination scores. Fol-lowing prior works, we vary the ratio of training set as 10%, 20%, 30%respectively, fix the validation ratio as 30%, and use the rest for testing.

The third dataset is the ADNI dataset from the Alzheimer'sDisease Neuroimaging Initiative database. The ADNI project is alongitudinal study, which provides a variety of measurementsfrom Alzheimer's disease patients, mild cognitive impairmentpatients and normal controls. The measurements include MRIscans, PET scans, CSF measurements, and cognitive scores. In ourexperiment, we use the structural MRI data of 675 patients. Inparticular, the dataset was processed using the well-known toolFreeSurfer and the preprocess procedure includes the followingsteps: remove the features with more than 1000 missing entries;remove the records with failed quality control; exclude thepatients without baseline MRI records; and fill the missing entriesusing the average value [62]. After all this procedure, MRI featurescan be grouped into 5 categories: cortical thickness average, cor-tical thickness standard deviation, volume of cortical parcellation,volume of white matter parcellation, and surface area. We test ourmethod to predict future Mini-Mental State Exam (MMSE) score atsix time points M06, M12, M18, M24, M36, and M48, following thestandard settings used in [63,62]. We vary the ratio of training set

Please cite this article as: J. Pu, et al., Multiple task learning with fledoi.org/10.1016/j.neucom.2015.11.029i

as 20%, 40%, 60% respectively, fix the validation ratio as 20%, anduse the rest for testing.

Tables 3, 4 and 5 report the performance measured by nMSEand aMSE for the SARCOS dataset, the School dataset and theADNI dataset, respectively. On the SARCOS dataset, MTL-APGperforms better than the other methods, and MTL-IRLS is thesecond best in most cases. However, on the school dataset and theADNI dataset, MTL-IRLS performs the best in all the cases andMTL-APG performs the second best in most cases. In addition, bothMTL-IRLS and MTL-APG often have significantly smaller standarddeviations of the performance, especially when there is lesstraining data. This indicates that the proposed methods are morerobust due to its unique power and flexibility of exploring thecomplex task relationships to compensate the lack of trainingsamples. Finally, we also observe that among the comparedmethods, the task grouping based method, i.e., GMTL, outperformsoutlier task based approaches, including DMTL, RMTL, and rMTFL.It indicates that recovering the main group structure is probablymore important then identifying few outlier tasks.

6.3. Parameter sensitivity

We also analyze parameter sensitivity for both MTL-IRLS andMTL-APG, using the SARCOS, School and ADNI datasets. Two key

xible structure regularization, Neurocomputing (2015), http://dx.

Page 13: Multiple task learning with flexible structure regularization

J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 13

parameters of the proposed methods are γ and λ. The parameter λcontrols the structure penalty, while γ balances the weights ofgroup sparsity and element-wise sparsity. We use 30% data sam-ples for training and the rest for testing. By varying γ in [0:0.1:1]and varying λ in 10i; i¼ �2 : 0:5 : 2, we show the performancemeasured by nMSE of the proposed MTL-IRLS and MTL-APG inFig. 4. It is clear to see that both algorithms are fairly robust to awide range of the parameter settings, as in the choice of 1rλr10is good for the MTL-IRLS method and 10rλr100 is good for theMTL-APG method. On the other side, though the performance ofboth approaches is also stable for the different choice of γ, the bestchoice is always in 0oγo1, which balances the effect of taskgrouping and task outlier.

6.4. Algorithm efficiency and convergence analysis

We also analyze the efficiency of the proposed methods (theMTL-IRLS and the MTL-APG) and compare with other competingmethods using the synthetic dataset. In particular, we evaluate thetraining time versus feature dimensionality d and the number oftasks L. Fig. 5 reports the experimental results, where the MTL-IRLS algorithm in general is the fastest one for training and itsefficiency is also less sensitive to the settings. In particular, Fig. 5(a) shows the training time with respect to the feature dimen-sionality ranging from 200 to 1200, where the efficiency of boththe MTL-IRLS and the MTL-APG methods is less affected by theincrease of the feature dimensionality. However, the training timeof the DMTL and the rMTFL method grows roughly in a linearcomplexity. In Fig. 5(b), the training cost with different number oftasks is evaluated, where we set the range as 20rLr40. Note thatas the number of tasks increases, the training time of the MTL-IRLSis still less affected, while the training cost of the MTL-APG sig-nificantly increases due to the time cost for calculating the prox-imal operators. The empirical evaluation results also confirm thetheoretical analysis and the comparison of the MTL-IRLS and theMTL-APG methods in Section 3.

The convergence analysis of the proposed methods is per-formed using the School dataset. The split ratio of training, vali-dation and test set is 30%, 30% and 40%, respectively. As statedabove, we use cross validation to choose all the parameters, andNMSE is used to measure the training and test error. The meanNMSE curves of 20 random trials are shown in Fig. 6. Note thatMTL-IRLS converges within five iterations of reweighting, andMTL-APG converges within 800 iterations of projection.

7. Conclusion

In this paper, we have presented a generic MTL learning frame-work with flexible structure regularization. Unlike the existingmethods such as the decomposition model, we directly impose aregularization term with a convex mixture of structure and outlierpenalties, i.e., joint ℓ11=ℓ21-norm regularization, which leads to aflexible yet robust formulation. To efficiently minimize the costfunction, we propose two efficient algorithms, namely MTL-IRLS andMTL-APG, to learn the target model under a complex setting withboth task grouping and task outlier. Besides analyzing the theoreticalrelatedness of these two solutions, we also provide rigorous proofs ofthe convergence property and the performance bound. Experimentson both synthetic and real data and the comparison study with sev-eral representative methods have verified the superior performanceof our methods. One interesting future direction is to deploy thepropose methods in more challenging real-world applications likefine grained objective categorization.

Please cite this article as: J. Pu, et al., Multiple task learning with flexdoi.org/10.1016/j.neucom.2015.11.029i

Appendix

In the following, we prove Theorem 3 in Section 3.3. We omitthe superscript for clarity in this section, and reformulate theoptimization problem in Eq. (15) as

W¼ arg minW

L2JW�VJ2F þλ1 JWJ2;1þλ2 JWJ1;1

� �

¼ arg minW

L2

Xi

Jwi� �vi� J22þλ1Xi

Jwi� J2þλ2Xi

Jwi� J1

!

¼Xi

arg minwi�

L2Jwi� �vi� J22þλ1 Jwi� J2þλ2 Jwi� J1

� �:

Thus, the minimization problem over matrix has been split intoa couple of minimization problem over vectors. After each mini-mization problem over a vector has been solved, a simple combi-nation of all these solutions forms the solution of the originalproblem. Therefore, we just consider the minimization problemover vectors.

Proof of Theorem 3.

Proof 4. The non-degraded case is formulated as follows:

wi� ¼ arg minwi�

L2Jwi� �vi� J22þλ1 Jwi� J2þλ2 Jwi� J1

� �: ð20Þ

The optimal condition for the above unconstrained nonconvexproblem is that

Lþ λ1Jwi� J2

� �wij�Lvijþλ2ξ¼ 0;

where ξA∂jwn

ij j .Then we discuss the solution in three cases:

� If jvij jrλ2L , then wn

ij ¼ 0 and ξ¼ Lλ2vij.� If vij4

λ2L 40, then wn

ij40 and ξ¼ 1. More specifically,

wn

ij ¼ Lþ λ1Jwi� J2

� ��1

L vij�λ2L

� �:

� If vijo�λ2L o0, then wn

ijo0 and ξ¼ �1. More specifically,

wn

ij ¼ Lþ λ1Jwi� J2

� ��1

L vijþλ2L

� �:

Then we can summarize the solution as

wn

ij ¼0; If jvij jr

λ2L;

Lþ λ1Jwi� J2

� ��1

L jvij j �λ2L

� �signðvijÞ; Otherwise:

8>>><>>>:

ð21Þ

As the first case is already formulated in an analytic form, weonly consider the second case with jvij j4λ2

L . Denote thatuij ¼ jvij j �λ2

L

�signðvijÞ, the optimal solution could be formulate as

a linear representation of uij

wn

ij ¼ ciuij; ð22Þ

where ci ¼ Lþ λ1Jwi� J 2

��1L is a nonnegative scalar.

By substituting the linear representation (22) into the originalminimization problem (20), we have

minci

L2Jciui� �vi� J22þλ1 Jciui� J2þλ2 Jciui� J1

� �:

Solving the above unconstrained and smooth minimization overthe nonnegative scalar ci and substituting the optimal solution into

ible structure regularization, Neurocomputing (2015), http://dx.

Page 14: Multiple task learning with flexible structure regularization

J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎14

the linear representation (22), we obtain

wn

ij ¼L⟨ui�; vi�⟩�λ1 Jui� J2�λ2 Jui� J1

LJui� J22

!þuij:

Combining the above solution with the first in Eq. (21), we have

wn

ij ¼0; If jvij jr

λ2L;

L⟨ui�; vi�⟩�λ1 Jui� J2�λ2 Jui� J1LJui� J22

!þuij; Otherwise:

8>>>><>>>>:

References

[1] R.K. Ando, T. Zhang, A framework for learning predictive structures frommultiple tasks and unlabeled data, J. Mach. Learn. Res. 6 (2005) 1817–1853.

[2] A. Argyriou, T. Evgeniou, M. Pontil, Convex multi-task feature learning, Mach.Learn. 73 (3) (2008) 243–272.

[3] J. Attenberg, K. Weinberger, A. Dasgupta, A. Smola, M. Zinkevich, Collaborativeemail-spam filtering with the hashing trick, in: The Sixth Conference on Emailand Anti-Spam, 2009.

[4] B. Bakker, T. Heskes, Task clustering and gating for Bayesian multitask learn-ing, J. Mach. Learn. Res. 4 (2003) 83–99.

[5] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm forlinear inverse problems, SIAM J. Imaging Sci. 2 (1) (2009) 183–202.

[6] A. Beck, M. Teboulle, Gradient-based algorithms with applications to signalrecovery problems, in: Y. Eldar, D. Palomar (Eds.), Convex Optimization inSignal Processing and Communications, Cambridge University Press, NewYork, NY, 2010, pp. 42–88.

[7] A. Beck, M. Teboulle, Smoothing and first order methods: a unified framework,SIAM J. Optim. 22 (2) (2012) 557–580.

[8] S. Bickel, J. Bogojeska, T. Lengauer, T. Scheffer, Multi-task learning for HIVtherapy screening, in: ICML, 2008.

[9] D. Calandriello, A. Lazaric, M. Restelli, Sparse multi-task reinforcementlearning, in: NIPS, 2014.

[10] R. Caruana, Multitask learning, Mach. Learn. 28 (1) (1997) 41–75.[11] G.H.-G. Chen, Forward-backward splitting techniques: theory and applications

(Ph.D. thesis), University of Washington, 1994.[12] J. Chen, L. Tang, J. Liu, J. Ye, A convex formulation for learning shared structures

from multiple tasks, in: ICML, 2009.[13] J. Chen, J. Zhou, J. Ye, Integrating low-rank and group-sparse structures for

robust multi-task learning, in: SIGKDD, 2011.[14] X. Chen, Q. Lin, S. Kim, J.G. Carbonell, E.P. Xing, Smoothing proximal gradient

method for general structured sparse learning, in: UAI, 2011.[15] I. Daubechies, R. Devore, M. Fornasier, C.S. Gntrk, Iteratively reweighted least

squares minimization for sparse recovery, Commun. Pure Appl. Math. 63,2010, 1-38.

[16] J. Eckstein, Splitting methods for monotone operators with applications toparallel optimization (Ph.D. thesis), Massachusetts Institute of Technology, 1989.

[17] T. Evgeniou, C.A. Micchelli, M. Pontil, Learning multiple tasks with kernelmethods, J. Mach. Learn. Res. 6 (2005) 615–637.

[18] H. Fei, J. Huan, Structured feature selection and task relationship inference formulti-task learning, Knowl. Inf. Syst. 2 (2013) 345–364.

[19] J. Ghosn, Y. Bengio, Multi-task learning for stock selection, in: NIPS, 1996.[20] P. Gong, J. Ye, C. Zhang, Robust multi-task feature learning, in: SIGKDD, 2012.[21] L. Jacob, F. Bach, J.-P. Vert, Clustered multi-task learning: a convex formulation,

in: NIPS, 2008.[22] A. Jalali, P.D. Ravikumar, S. Sanghavi, C. Ruan, A dirty model for multi-task

learning, in: NIPS, 2010.[23] Z. Kang, K. Grauman, F. Sha, Learning with whom to share in multi-task fea-

ture learning, in: ICML, 2011.[24] S. Kim, E.P. Xing, Tree-guided group lasso for multi-task regression with

structured sparsity, in: ICML, 2010.[25] A. Kumar, H. Daumé III, Learning task grouping and overlap in multi-task

learning, in: ICML, 2012.[26] N.D. Lawrence, J.C. Platt, Learning to learn with the informative vector

machine, in: ICML, 2004.[27] S. Lee, J. Zhu, E.P. Xing, Adaptive multi-task lasso: with application to eQTL

detection, in: NIPS, 2010.[28] Z. Li, J. Liu, J. Tang, H. Lu, Robust structured subspace learning for data

representation, Trans. Pattern Anal. Mach. Intell. 37 (2015) 2085–2098.[29] Z. Li, J. Liu, Y. Yang, X. Zhou, H. Lu, Clustering-guided sparse structural learning for

unsupervised feature selection, Trans. Knowl. Data Eng. 26 (2014) 2138–2150.[30] P.L. Lions, B. Mercier, Splitting algorithms for the sum of two nonlinear

operators, SIAM J. Numer. Anal. 16 (6) (1979) 964–979.

Please cite this article as: J. Pu, et al., Multiple task learning with fledoi.org/10.1016/j.neucom.2015.11.029i

[31] J. Liu, S. Ji, J. Ye, Multi-task feature learning via efficient ℓ2;1-norm mini-mization, in: UAI, 2009.

[32] K. Lounici, M. Pontil, A.B. Tsybakov, S.A. van de Geer, Taking advantage ofsparsity in multi-task learning, in: COLT, 2009.

[33] J.J. Moreau, Fonctions convexes duales et points proximaux dans un espacehilbertien, C. R. de l'Acad. Sci. (Paris), Sér. A 255 (1962) 2897–2899.

[34] S. Negahban, M.J. Wainwright, Joint support recovery under high-dimensionalscaling: benefits and perils of ℓ1;1-regularization, in: NIPS, 2008.

[35] S. Negahban, M.J. Wainwright, Estimation of (near) low-rank matrices withnoise and high-dimensional scaling, in: ICML, 2010.

[36] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course,Springer, Boston, Dordrecht, London, 2003.

[37] Y. Nesterov, Smooth minimization of non-smooth functions, Math. Program.103 (1) (2005) 127–152.

[38] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22(10) (2010) 1345–1359.

[39] S. Parameswaran, K. Weinberger, Large margin multi-task metric learning, in:NIPS, 2010.

[40] T.K. Pong, P. Tseng, S. Ji, J. Ye, Trace norm regularization: reformulations,algorithms, and multi-task learning, SIAM J. Optim. 20 (6) (2010) 3465–3489.

[41] J. Pu, Y.-G. Jiang, J. Wang, X. Xue, Multiple task learning using iterativelyreweighted least square, in: IJCAI, 2013.

[42] J. Pu, Y.-G. Jiang, J. Wang, X. Xue, Which looks like which: exploring inter-classrelationships in fine-grained visual categorization, in: ECCV, 2014.

[43] B. Rakitsch, C. Lippert, K. Borgwardt, O. Stegle, It is all in the noise: efficient multi-task Gaussian process inference with structured residuals, in: NIPS, 2013.

[44] B.D. Rao, K. Engan, S.F. Cotter, J. Palmer, K. Kreutz-delgado, Subset selection innoise based on diversity measure minimization, IEEE Trans. Signal Process.(2003) 760–770.

[45] B.D. Rao, K. Kreutz-Delgado, An affine scaling methodology for best basisselection, IEEE Trans. Image Process. 47 (1) (1999) 187–200.

[46] N. Rao, C. Cox, R. Nowak, T.T. Rogers, Sparse overlapping sets lasso for mul-titask learning and its application to fMRI analysis, in: NIPS, 2013.

[47] M.T. Rosenstein, Z. Marx, L.P. Kaelbling, T.G. Dietterich, To transfer or not totransfer, in: In NIPS05 Workshop, Inductive Transfer: 10 Years Later, 2005.

[48] Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd edition, Society forIndustrial and Applied Mathematics, Philadelphia, PA, USA, 2003.

[49] A. Schwaighofer, V. Tresp, K. Yu, Learning Gaussian process kernels via hier-archical Bayes, in: NIPS, 2004.

[50] K. Swersky, J. Snoek, R.P. Adams, Multi-task Bayesian optimization, in: NIPS,2013.

[51] S. Thrun, Is learning the n-th thing any easier than learning the first?, in: NIPS,1996.

[52] S. Thrun, J.O. Sullivan, Discovering structure in multiple learning tasks: the TCalgorithm, in: ICML, 1996.

[53] H. Wang, F. Nie, H. Huang, S.L. Risacher, C.H.Q. Ding, A. J. Saykin, L. Shen, Adni,Sparse multi-task regression and feature selection to identify brain imagingpredictors for memory performance, in: ICCV, 2011.

[54] X. Wang, C. Zhang, Z. Zhang, Boosted multi-task learning for face verificationwith applications to web image and video search, in: CVPR, 2009.

[55] Y. Xue, X. Liao, L. Carin, B. Krishnapuram, Multi-task learning for classificationwith Dirichlet process priors, J. Mach. Learn. Res. 8 (2007) 35–63.

[56] X. Yang, S. Kim, E.P. Xing, Heterogeneous multitask learning with joint sparsityconstraints, in: NIPS, 2009.

[57] K. Yu, V. Tresp, A. Schwaighofer, Learning Gaussian processes from multipletasks, in: ICML, 2005.

[58] J. Zhang, Z. Ghahramani, Y. Yang, Learning multiple related tasks using latentindependent component analysis, in: NIPS, 2005.

[59] Y. Zhang, D.-Y. Yeung, Transfer metric learning by learning task relationships,in: SIGKDD, 2010.

[60] Y. Zhang, D.-Y. Yeung, Learning high-order task relationships in multi-tasklearning, in: IJCAI, 2013.

[61] Y. Zhang, D.-Y. Yeung, Q. Xu, Probabilistic multi-task feature selection, in: NIPS,2010.

[62] J. Zhou, J. Liu, V.A. Narayan, J. Ye, Modeling disease progression via multi-tasklearning, NeuroImage 78 (0) (2013) 233–248.

[63] J. Zhou, L. Yuan, J. Liu, J. Ye, A multi-task learning formulation for predictingdisease progression, in: SIGKDD, 2011.

[64] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R.Stat. Soc. Ser. B Stat. Methodol. 67 (2) (2003) 301–320.

Jian Pu received the Ph.D. degree from Fudan Uni-versity, Shanghai, in 2014. He is a postdoctoralresearcher of Institute of Neuroscience, Chinese Acad-emy of Sciences, Shanghai. His current research inter-ests include machine learning, computer vision, andmedical image computing.

xible structure regularization, Neurocomputing (2015), http://dx.

Page 15: Multiple task learning with flexible structure regularization

J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 15

Jun Wang received the Ph.D. degree from ColumbiaUniversity, NY, in 2011. Currently, he is a professor ofSchool of Computer Science and Software Engineering,East China Normal University, Shanghai, China and anadjunct faculty member of Columbia University, NewYork, USA. He is also affiliated with Institute of DataScience and Technology, Alibaba Group, Seattle, USA.He was a research staff member in the Business Ana-lytics and Mathematical Sciences Department at IBM T.J. Watson Research Center, Yorktown Heights, NY. Hehas been the recipient of several awards and scholar-ships, including the award of “Youth 1000 Talents Plan”

program in 2014 and the Jury thesis award from

Columbia University in 2011. His research interests include machine learning,computer vision, and mobile intelligence.

Yu-Gang Jiang is an associate professor of computerscience at Fudan University, Shanghai. His researchfocuses on novel algorithms and systems for big videodata analysis. He is the lead architect of a few best-performing video analytic systems in the annual U.S. NISTTRECVID evaluation and the European MediaEval eva-luation. His work has led to many awards, including“emerging leader in multimedia” award from IBM T.J.Watson Research in 2009, early career faculty award fromIntel and CCF in 2013, the 2014 ACM China Rising StarAward, and the 2015 ACM SIGMM Rising Star Award. Heis an associate editor of Machine Vision and Applications

and has recently served as a program chair of ACM ICMR

2015. He received a Ph.D. in Computer Science at City University of Hong Kong andspent three years working at Columbia University before joining Fudan.

Please cite this article as: J. Pu, et al., Multiple task learning with flexdoi.org/10.1016/j.neucom.2015.11.029i

Xiangyang Xue received the B.S., M.S., and Ph.D.degrees in communication engineering from XidianUniversity, Xi'an, China, in 1989, 1992, and 1995,respectively. He joined the Department of ComputerScience, Fudan University, Shanghai, China, in 1995.Since 2000, he has been a full professor. His currentresearch interests include multimedia informationprocessing and retrieval, pattern recognition, andmachine learning. He has authored more than 100research papers in these fields. He is an associate editorof the IEEE Transactions on Autonomous MentalDevelopment. He is also an editorial board member of

the Journal of Computer Research and Development,

and the Journal of Frontiers of Computer Science and Technology.

ible structure regularization, Neurocomputing (2015), http://dx.


Recommended