Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Neurocomputing
http://d0925-23
n CorrE-m
Pleasdoi.o
journal homepage: www.elsevier.com/locate/neucom
Multiple task learning with flexible structure regularization
Jian Pu a,n, Jun Wang b,c, Yu-Gang Jiang d, Xiangyang Xue d
a Institute of Neuroscience, Chinese Academy of Sciences, Shanghai, Chinab School of Computer Science and Software Engineering, East China Normal University, Shanghai, Chinac Institute of Data Science and Technology, Alibaba Group, Seattle, USAd School of Computer Science, Fudan University, Shanghai, China
a r t i c l e i n f o
Article history:Received 16 August 2015Received in revised form17 October 2015Accepted 6 November 2015
Communicated by Jinhui Tangtask grouping and task outlier. Based on such task relationship, here we propose a generic MTL frame-
Keywords:Generic multiple task learningFlexible structure regularizationJoint ℓ11=ℓ21-norm regularizationIteratively reweighted least squareAccelerated proximal gradient
x.doi.org/10.1016/j.neucom.2015.11.02912/& 2015 Elsevier B.V. All rights reserved.
esponding author.ail address: [email protected] (J. Pu).
e cite this article as: J. Pu, et al., Murg/10.1016/j.neucom.2015.11.029i
a b s t r a c t
Due to the theoretical advances and empirical successes, Multi-task Learning (MTL) has become a populardesign paradigm for training a set of tasks jointly. Through exploring the hidden relationships amongmultiple tasks, many MTL algorithms have been developed to enhance learning performance. In general,the complicated hidden relationships can be considered as a combination of two key structural elements:
work with flexible structure regularization, which aims in relaxing any type of specific structureassumptions. In particular, we directly impose a joint ℓ11=ℓ21-norm as the regularization term to revealthe underlying task relationship in a flexible way. Such a flexible structure regularization term takes intoaccount any convex combination of grouping and outlier structural characteristics among the multipletasks. In order to derive efficient solutions for the generic MTL framework, we develop two algorithms,i.e., the Iteratively Reweighted Least Square (IRLS) method and the Accelerated Proximal Gradient (APG)method, with different emphasis and strength. In addition, the theoretical convergence and performanceguarantee are analyzed for both algorithms. Finally, extensive experiments over both synthetic and realdata, and the comparisons with several state-of-the-art algorithms demonstrate the superior perfor-mance of the proposed generic MTL method.
& 2015 Elsevier B.V. All rights reserved.
1. Introduction
Realizing the existence of sparse training data and the taskcorrelations, multiple task learning (MTL) is designed to trainmultiple models jointly and simultaneously, and often leads tobetter learnt models than those trained independently. The key ideaof MTL is to explore the hidden relationships among multiple tasksto enhance learning performance. MTL has been shown particularlyuseful if there exist intrinsic relationships among multiple learningtasks and the training data is inadequate for each single task. Due toits empirical successes, MTL has been applied to various applicationdomains, including social media categorization and search [12,54],fine-grained visual categorization [42], disease modeling and pre-diction [8,63], spam filtering [3], reinforcement learning [9] andeven financial stock selection [19].
The key ingredient of MTL is to explore model commonalityamong the multiple learning tasks, and use such model com-monality to improve the learning performance. Some earlier MTLwork assume that there is a common structure or a common set of
ltiple task learning with flex
parameters shared by all the learning tasks [51,52]. However,sharing a model commonality among all the learning tasks is afairly strong assumption, which is often invalid in real applica-tions. Therefore, two compromised yet more realistic scenarios,i.e., task grouping and task outlier, have been explored recently. Fortask grouping, one assumes that the commonality only existsamong tasks within the same group. During the learning process,through identifying such task groups, the unrelated tasks fromdifferent groups will not influence each other [21,23,52,55]. In thetask outlier scenario [13], a robust MTL algorithm was proposed tocapture the commonality for a major group of tasks whiledetecting the outlier tasks. A popular way to tackle the robust MTLproblem is to use a decomposition framework, which forms thelearning objective with a structure term and an outlier penaltyterm. To efficiently solve the optimization problem, the targetmodel can be further decomposed into two components, reflectingthe major group structure and the outliers [22]. Representativedecomposition schemes for MTL include the low-rank structure[13] and the group sparsity based approaches [20].
Note that the aforementioned assumptions of task grouping andtask outlier were exclusively considered in most of the existingworks. In other words, the task grouping based methods neglectedthe existence of outlier tasks and many robust MTL frameworks
ible structure regularization, Neurocomputing (2015), http://dx.
Fig. 1. The illustration of different target models W learned using various assumptions of task structures: (a) shared model commonality, (b) task grouping, (c) outlier tasks,and (d) generic multi-tasks. Each column of W is corresponding to a single task and each row represents a feature dimension. For each element in W, white color meanszero-valued elements and gray color indicates non-zero values with the intensity indicating the magnitude of the values.
J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎2
only assumed the case of one major task group peppered with afew outlier tasks. In this paper, we address MTL under a verygeneral setting where multiple major task groups and outlier taskscould occur simultaneously. In particular, without decomposingthe target model, we directly impose a flexible structure regular-ization term with a joint ℓ11=ℓ21-norm that reflects a mixture ofstructure and outlier penalties. The final objective is formulated asan unconstrained non-smooth convex problem and two efficientalgorithms, i.e., the Iteratively Reweighted Least Square (IRLS)method and the Accelerated Proximal Gradient (APG) method, areapplied to derive optimal solutions with different strength. Parti-cularly, the IRLS method can handle the learning process for alarge number of tasks efficiently, while the APG method providesrobust performance when the active features are either sparse ordense. In addition, we provide theoretical analysis on both con-vergence and performance bound of the proposed MTL method.Finally, empirical studies on synthetic and real benchmark data-sets corroborate that the proposed MTL learning method clearlyoutperforms several state-of-the-art MTL approaches.
The remainder of this paper is organized as follows. Section 2briefly reviews several major MTL schemes in the existing works.Section 3 presents our proposed generic MTL framework and twoefficient solutions. Sections 4 and 5 provide theoretical analysis ofthe proposed methods, including convergence properties andperformance bounds. Section 6 gives experimental validations andcomparative studies, and, finally, Section 7 concludes this paper.
2. Related work
Here, we first define notations used in this paper. Then webriefly survey several major multi-task learning paradigms andsummarize their strengthness and weakness.
2.1. Notations
Assume the data is represented as a matrix XARd�n, where thecolumn vector xiARd is the i-th data point and d is the dimension.In addition, we denote xi� as the i-th row of X, which correspondsto the i-th feature of the data. The norm of matrix X is denoted asJXJp;q ¼ ðPi‖xi�‖qpÞ1=q ¼ ðPið
Pjx
pijÞq=pÞ1=q. For example, JXJ2;1 ¼P
i Jxi� J2 ¼P
iðP
jx2ijÞ1=2.
In a typical setting of multiple task regression or classification,we are given L tasks associated with training data fðX1; y1Þ;…; ðXL; yLÞg, where XlARd�nl ; ylARnl are the input and response ofthe l-th task with a total of nl samples. We want to employ MTL to
Please cite this article as: J. Pu, et al., Multiple task learning with fledoi.org/10.1016/j.neucom.2015.11.029i
derive optimal prediction models for all the tasks simultaneously.In particular, for linear regression models, the prediction model forthe l-th task is represented as f ðwl;XlÞ ¼X>
l wl. We then use acoefficient matrix W¼ ½w1;w2;…;wL� to represent all the regres-sion tasks. The goal of MTL is to derive an optimalWn across all thelearning tasks, and meanwhile satisfying desired structurecharacteristics.
2.2. Shared model commonality
One of the straightforward ways for designing a MTL algorithmis to assume that all the tasks share certain model commonality.Typically, such commonality can be represented as shared com-mon structures or parameters by the learned models. For instance,structure commonality includes subspace sharing [28,35,40] andfeature set sharing [2,24,31,32,34,56,61,18]. In terms of the para-meter commonality, it includes a wide range of options dependingon the used learning methods, such as the hidden units in neuralnetworks [10], kernels [17], the priors in hierarchical Bayesianmodels [4,49,57,58,60], the parameters in Gaussian process cov-ariance [26,43,50], the feature mapping matrices [1], and thesimilarity metrics [39,59]. Fig. 1(a) demonstrates an example of thefeature sharing among the learned model W, where all thelearning tasks select the same subset of features. Throughexploring various types of model commonalities, either structuresor parameters, simultaneously learning multiple tasks will benefitfrom the learning of each other. Hence, the MTL paradigm isexpected to achieve better generalization performance thanindependently learning a prediction model for each task. However,the real applications tend to have more complicated situations andoften there are not commonly shared structure or parametersamong all the tasks [47,38].
2.3. MTL with task grouping
Note that in many real applications for learning multi-tasks, thetasks are gathered into several groups according to their related-ness. Intuitively, the tasks in the same group are more related thanthe tasks in different groups. As shown in Fig. 1(b), the learningtasks form two groups, where the tasks within the same groupselect the same subset of features and share no common featureswith the tasks from the other group.
To deal with this scenario, one of the representative methods isto use grouping matrices to model the task grouping effect
xible structure regularization, Neurocomputing (2015), http://dx.
J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 3
explicitly, as suggested in [23]:
arg minW;Q
Vðf ðwl;XlÞ;YlÞþγXg
JWQ g J ;
where wl is the l-th column vectors of W and Q g is the groupassignment matrix for g-th group. For a standard regression pro-blem, the empirical loss is formed quadratically:
Vðf ðW;XlÞ; ylÞ ¼XLl ¼ 1
JX>l wl�yl J
2:
The above objective function is minimized using alternatingoptimization strategy that converges to a local minimum [23]. Todeal with a more complex setting with overlapped groups, one candecompose the weight matrix W as [25]
W¼MS;
where each column of the latent task matrix M represents onetask group, and S is the coefficient matrix of the representation inthe latent task space. The objective function is formulated as
arg minM;S
Vðf ðMsl;XlÞ;YlÞþαJSJ1;1þβ‖M‖2F :
The sparsity penalty on the matrix S enforces that eachobserved task is only related to a few latent tasks. The secondpenalty regularizes the ℓ2 norm of the latent matrix and avoidsoverfitting. An alternating optimization strategy is employed tosolve the above problem [25]. More recently, task grouping andsparse structural analysis has been combined by Li et al. [29],which is solved by nonnegative spectral clustering.
2.4. MTL with outlier tasks
Another scenario in real applications for learning multi-tasks isthat there exist a certain amount of tasks which are independentwith other tasks. Hence, they are named as outlier tasks. Forinstance, Fig. 1(c) shows a major task group peppered with twooutlier tasks. Though it can be regarded as a special case of taskgroup with one big group and several groups with only one task,such an imbalanced grouping structure is problematic for manytask grouping methods.
As mentioned earlier, an intuitive yet effective way to distin-guish those outlier tasks is to employ the decomposition frame-work to design robust MTL methods. Robust MTL assumes that thetarget model W can be represented by the supposition of a block-structured component P with row-sparsity and an outlier com-ponent Q with elementwise sparsity [20], as shown in below
W¼ PþQ :
where P reveals a set of tasks with commonly shared structure andQ identifies the outlier tasks though enforcing elementwisesparsity.
Then the objective for robust MTL can be formed as the fol-lowing optimization problem with the two types of sparsity reg-ularization:
arg minP;Q
Vðf ðpl;ql;XlÞ;YlÞþαJPJ2;1þβJQ J1;1;
where pl and ql are the l-th column vectors of P and Q, respec-tively. The linear prediction model is written as
f ðwl;XlÞ ¼X>l wl ¼X>
l ðplþqlÞ ¼ f ðpl;ql;XlÞ:The above decomposition based objective can be efficiently opti-mized via various techniques, such as an accelerated gradientdescent method [20]. However, this formulation does not considerthe case with multiple groups of tasks, whose structures cannot besimply represented as a single block-structured component. Inaddition, although the early work provided the error bounds of
Please cite this article as: J. Pu, et al., Multiple task learning with flexdoi.org/10.1016/j.neucom.2015.11.029i
recovering the two decomposed parts P, Q [20], the error boundfor recovering the true target model W remains unrevealed.
3. Generic MTL via flexible structure regularization
In this section, we will first describe a generic MTL formulationwith a flexible structure regularization, which imposes no specificassumptions of the tasks' structure. In order to train the targetmodels and identify the task structure from the data simulta-neously, we propose two efficient algorithms, i.e., the IterativelyReweighted Least Square (IRLS) method and the AcceleratedProximal Gradient (APG) method. Finally, we will analyze therelatedness of these two methods.
3.1. Structure regularization with joint ℓ11=ℓ21-norm
Here we consider using a linear regression model for learning Ltasks simultaneously, where a single prediction model is repre-sented as f ðXlÞ ¼X>
l wl; l¼ 1;2;…; L. Motivated by the existingworks [22,20], we formulate a minimization problemwith the costfunction as a regularized quadratic loss:
W¼ arg minW
LðWÞ ¼ arg minW
XLl ¼ 1
JX>l wl�yl J
22þλJ ðWÞ: ð1Þ
We use J ðWÞ to denote a structure regularization term with λas the coefficient. Without using the superposition assumption todecompose W into two structure terms, we use a combination of astructure inducing norm and an outlier detecting norm, namelyjoint ℓ11=ℓ21-norm regularization:
hðWÞ ¼ ð1�γÞJWJ2;1þγ JWJ1;1; ð2Þwhere γA ½0;1� is a constant to balance the two norms. Thoughsharing a similar form with the elastic net regularization [64], theabove formulation uses a ℓ21 norm instead of a squared ℓ21 norm.For a non-degenerate setting with γAð0;1Þ, we illustrate the 3Dnorm balls of the elastic net and the joint ℓ11=ℓ21 norm in Fig. 2.Compared to the elastic net ball, the joint ℓ11=ℓ21 norm ball hassharper corner regions, which may induce a sparser solution.
Replacing the regularization term of the cost in Eq. (1) by thedefinition in Eq. (2), we can obtain the following regularizedconvex cost function:
LðWÞ ¼XLl ¼ 1
JX>l wl�yl J
22þλ1 JWJ2;1þλ2 JWJ1;1; ð3Þ
Note that the constant coefficients are absorbed as λ1 ¼ λð1�γÞand λ2 ¼ λγ. Instead of decomposing the target model into a fixedcombination of structure and outlier components, here we exploita flexible structure penalty without imposing any specificassumptions. Such a formulation is shown to be more flexible interms of handling various types of tasks, including both groupedtasks and outlier tasks. A similar cost function with joint regular-ization term has been used to detect eQTLs [27], to predictmemory performance [53] and to analysis fMRI data [46]. How-ever, the convergence property of [27] is not provided and thesolution of [53] relies on iteratively solving the inverse of a grammatrix, which is computationally infeasible for high dimensionaldata. In below, we will present two efficient approaches to per-form the training process and provide theoretical analysis of theconvergence properties.
3.2. Optimization via iteratively reweighted least square method
We first apply an iteratively reweighted least square method tominimize the cost function in Eq. (3), as suggested in [41]. By
ible structure regularization, Neurocomputing (2015), http://dx.
Re
1:2:3:4:
5:
6:
7:8:
Fig. 2. The illustration of (a) the 3D elastic net ball, and (b) the joint ℓ11=ℓ21-norm ball, with γ ¼ 0:2;0:5;0:8, respectively.
J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎4
adopting the factored representation for the gradient vector of theregularizer [45], we can zero the partial derivatives ∂L
∂wlto derive
the optimal solution:
∂L∂wl
¼ 0 ) 2XlX>l wl�2XlylþλΠ lwl ¼ 0;
where Π l is a diagonal matrix with the element ðΠ lÞii defined as:
ðΠ lÞii ¼ diagðð1�γÞJwi� J �12 þγ jwil j �1Þ:
Apparently, the element ðΠ lÞii consists of two components. Thefirst component Jwi� J �1
2 represents the group impact since itimposes row sparsity on the i-th row of W. The second componentjwil j �1 represents the individual impact by measuring the impactof the i-th feature on the l-th task. The parameter γ balances theimpacts of these two components to the diagonal matrix Π. Afterperforming some derivations, we can obtain the following equa-tion:
XlX>l þλ
2Π l
� �wl ¼Xlyl: ð4Þ
The above equation actually indicates a solution for the followingweighted least square problem [15]:
arg minwl
JX>l wl�yl J
22þ
λ2JΠ1=2
l wl J22: ð5Þ
Since solving the linear system in Eq. (4) requires to computethe inverse of a d� d matrix, a standard algorithm has a com-plexity of Oðd3Þ. To derive an efficient solution for high dimensiondata, we first reformulate Eq. (4) as a preconditioned linear systemusing Jacobi method, similar as the solution presented in Section
Please cite this article as: J. Pu, et al., Multiple task learning with fledoi.org/10.1016/j.neucom.2015.11.029i
10.2 in [48]:
M�1l XlX
>l þλ
2Π l
� �wl ¼M�1
l Xlyl; ð6Þ
where Ml ¼ diagðXlX>l þ λ
2Π lÞ. Although Jacobi iteration can bedirectly employed to solve Eq. (6), a certain condition on matrix Ml
is required to receive the convergence guarantee. Here, we use apreconditioned conjugate gradient (PCG) algorithm which providesbetter asymptotic performance in solving the linear system [48].
Algorithm 1. MTL-IRLS.
xible
quire: Xl: data matrix of the lth task; yl: response of the lthtask;
Initialize fΠ0l g
L
l ¼ 1 with the identity matrix;while not converged do
for l¼1 to L doUpdate the matrix Ml:
Ml ¼ diagðXlX>l þ λ
2Πkl Þ;
Solve the preconditioned linear system using PCG:
M�1l ðXlX
>l þ λ
2Πkl Þwl ¼M�1
l Xlyl;Update the weight matrix:
Πkþ1l ¼ diag ð1�γÞjwi�j�1þγ jwil j �1
� �;
end forUpdate the iteration counter: k¼ kþ1
end while
9:Note that the diagonal matrixΠ l can be interpreted as a weightmatrix since it essentially enforces different weights to differentfeature dimensions, i.e., each element of wl. In addition, the
structure regularization, Neurocomputing (2015), http://dx.
Re
1:
2:
J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 5
calculation of the weight matrix Π l depends on the current wl,which suggests an iterative algorithm to derive the optimal wl.Specifically, in each iteration, we first use the PCG algorithm tosolve a preconditioned linear system (equivalent to the weightedleast square problem in Eq. (5)), and then recalculate the weightmatrix Π l. Hence, such an optimization procedure is a typicaliteratively reweighted least square (IRLS) method [15]. Algorithm1 summarizes the multi-task learning procedure via IRLS method,namely MTL-IRLS method. Apparently, the optimization process isformed in nested loops, where the outer loop (while-loop) is forpursuing global convergence and the inner loop (for-loop) is for
updating each single task. The weight matrices fΠ0l g
L
l ¼ 1 are initi-alized as identity matrices, which give equal weights to eachdimension for all the tasks.
To further analyze and elaborate how the above MTL-IRLSalgorithm groups major tasks and identifies outlier tasks simulta-neously, we first simplify the representation by letting XARdL�
Plnl
be a block diagonal matrix with XlARd�nl as the l-th block. Thenwedefine a vectorization operator “vec” over an arbitrary matrix ZARd�L as vecðZÞ ¼ ½z>
1 ;…; z>l ;…; z>
L �> , where zl is the l-th columnvector of Z. Thus for a single while-loop in Algorithm 1, it can beviewed as solving a weighted least square problem:
minW
JX> � vecðWÞ�vecðYÞJ22þλ2JΠ
12 � vecðWÞJ22; ð7Þ
where ΠARdL�dL is a concatenated diagonal matrix with Π l as thel-th block. Recall the definition of Π l in Eq. (4), the ðldþ iÞ-thdiagonal element of Π is computed as:
Π ldþ i;ldþ i ¼ ð1�γÞJwi� J �12 þγ jwil j �1;
which indicates the weight of the i-th feature for the l-th task. Notethat a small positive number is added to the denominators to avoidnumerical problem. We simply assume a balanced case with equalemphasis on row and element-wise sparsity. Note that if both Jwi� Jand wil are small, the value Π ldþ i;ldþ i becomes large. Thus, itimposes a heavy penalty for the i-th feature of the l-th task. As aresult, the value of wil becomes even smaller after each iteration ofthe updates. This indeed helps maintain both group sparsity andelement-wise sparsity since the i-th feature is not chosen by eitherthe grouped tasks or the outlier tasks. On the other hand, largevalues of both terms will make Π ldþ i;ldþ i becoming small andimpose a slight penalty that encourages the increase of wil after eachiteration. It is clear that the iterative algorithm helps recover thegroup structure which the l-th task belongs to.
The other two complicated cases are that the i-th feature isonly relevant to the l-th task while irrelevant to all the other tasks,or the i-th feature is only irrelevant to the l-th task but relevant toall the others. In both cases, the l-th task will be identified as anoutlier. If wil is small and Jwi� J2 is large, the current task considersthis feature as irrelevant while the other tasks consider it as animportant feature. Apparently, the penalty Π ldþ i;ldþ i will becomelarge and encourage the updated wil to become smaller. Thus, itfurther helps identify outlier tasks. If the i-th feature is relevant tothe l-th task while irrelevant to the others, the value of wil is large.However, Jwi� J2 will not be very small since it also counts thevalue of wij and satisfies Jwi� J2 ¼ ðPL
j ¼ 1 w2ijÞ1=2Zwil. Hence, the
value wil will remain fairly large after each iteration, which meansthat the element-wise sparsity is still preserved. In summary, theiterative reweighting scheme in Algorithm 1 also provides aunique power for identifying the outlier tasks.
3.3. Optimization via accelerated proximal gradient method
Now we propose to use accelerated proximal gradient (APG)method [36] to minimize the above convex cost function with jointℓ11=ℓ21-norm regularization. Due to the existence of non-smooth
Please cite this article as: J. Pu, et al., Multiple task learning with flexdoi.org/10.1016/j.neucom.2015.11.029i
term, one tends to use smoothing technique [37] to substitute thenon-smooth part with its smooth approximation [14]. In thispaper, we consider to directly solve the proximal operator of aconvex combination of the ℓ21 norm and ℓ11 norm efficiently. Wefirst rewrite the cost function in Eq. (3) as the sum of two com-ponents as follows:
LðWÞ ¼ gðWÞþhðWÞ; ð8Þwith the smooth part
gðWÞ ¼XLl ¼ 1
JX>l wl�yl J
22;
and a non-smooth part
hðWÞ ¼ λ1 JWJ2;1þλ2 JWJ1;1: ð9ÞFor the smooth part gðWÞ, we denote the gradient as
∇WgðWÞ ¼ ½∇w1gðWÞ;…;∇wL gðWÞ�, where ∇wl gðWÞ is the gradientrespect to the variable wl:
∇wl gðWÞ ¼XlðX>l wl�ylÞ:
Moreover, ∇WgðWÞ is Lipschitz continuous with the Lipschitzconstant as Lip¼maxl λmaxðXlX
>l Þ, where λmaxðXlX
>l Þ is the largest
eigenvalue of matrix XlX>l [37]. Then we approximate the smooth
function gðWÞ at the point W0 using Taylor series expansion:
gðWÞ � gðW0Þþ ⟨∇W0gðW0Þ;W�W0⟩þLip2
JW�W0 J2F ; ð10Þ
where J � JF denotes the Frobenius norm of a matrix. The aboveapproximation of gðWÞ consists of two terms: the first-order Taylorexpansion of the smooth function gð�Þ at the point W0, i.e.,gðW0Þþ⟨∇W0gðW0Þ;W�W0⟩, and the strong convex regularization termLip2 JW�W0 J2. Hence, substitute the smooth part gðWÞ by Eq. (10) andwe optimize the approximation of the cost function LðWÞ as follows:
W¼ arg minW
LðWÞ ¼ arg minW
gðWÞþhðWÞ � arg minW
gðW0Þ
þ⟨∇W0gðW0Þ;W�W0⟩þLip2JW�W0 J2F þhðWÞ ¼ arg min
W
Lip2JW
�ðW0 � 1Lip
∇W0gðW0ÞÞJ2F þhðWÞ; ð11Þ
where the last line of the above equation is obtained by incorporatingthe inner product into the square terms.
Given a function hð�Þ and a parameter Lip, the proximal operatorProxh;Lipð�Þ is defined as [33]:
Proxh;LipðVÞ ¼ arg minW
Lip2
JW�VJ2F þhðWÞ: ð12Þ
Using the above definition of proximal operator, the approximatesolution of the Eq. (11) can be written as:
W¼ Proxh;Lip W0 � 1Lip
∇W0gðW0Þ� �
: ð13Þ
Note that the proximal gradient method in Eq. (13) indeedsuggests solving the minimization problem in Eq. (8) by the fol-lowing iterates:
Wk ¼ Proxh;Lipk Wk�1� 1Lipk
∇Wk� 1gðWk�1Þ� �
; ð14Þ
where Lipk is a sequence of nonnegative scalars determined by linesearch. Hence, we obtain an iterative solution of the proposedminimization problem in Eq. (8) using the proximal operator.
Algorithm 2. MTL-APG Method.
ible
quire: Xl: data matrix of the lth task; yl: response of the lth
task; Initialize U0 as a 0 matrix.while not converged do
Vk ¼Uk�1�tk∇gðUk�1Þ;
structure regularization, Neurocomputing (2015), http://dx.
3:4:
5:6:
J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎6
Pledo
Determine the step length tk by line search.Evaluate the proximal operator associated with the convex
combination of matrix ð2;1Þ norm and ð1;1Þ norm inEq. (15).
Uk ¼Wkþk�1kþ2ðWk�Wk�1Þ;
Update the iteration counter: k¼ kþ1.end while
7:Next, we will discuss how to evaluate the proximal operator ofthe nonsmooth term hð�Þ in Eq. (9). Let Vk ¼Wk�1
� 1Lipk
∇Wk� 1gðWk�1Þ, the evaluation of the proximal operator of hð�Þis equivalent to minimizing the following cost function:
Wk ¼ arg minW
Lipk2
JW�Vk J2F þλ1 JWJ2;1þλ2 JWJ1;1: ð15Þ
The evaluation of such a proximal operator remains difficultsince it contains two nonsmooth terms. Though the smoothtechnique [37] can be applied to transform one of the nonsmoothterm into smooth, it will decrease the convergence rate into sub-optimal [7]. However, we notice that the evaluation of the prox-imal operator in Eq. (15) and its two degraded cases have analyticsolutions, which are summarized in the following theorems.
Theorem 1. The closed-form solution to the degraded case with λ1¼ 0 is obtained by the following operation:
wkij ¼ jvkij j �
λ2Lipk
� �þsignðvkijÞ;
where signð�Þ denotes the sign of a scalar, and �½ �þ is an operator toextract the positive part of a scalar which is defined as follows:
vij�
þ ¼0; vijr0;vij; vij40:
(
Theorem 2. The closed-form solution to the degraded case with λ2¼ 0 is obtained by the following operation:
wki� ¼ 1� λ1
Lipk Jvki� J2
" #þvki�:
Theorem 3. The closed-form solution to the proximal problem 15 isobtained by the following operation:
wki� ¼
Lipk⟨uki�; v
ki�⟩�λ1 Juk
i� J2�λ2 Juki� J1
Lipk Juki� J
22
" #þuki�;
where uij ¼ jvkij j �λ2=Lipkh i
þsignðvkijÞ.
The above solution of the non-degraded case can be viewed asa combination of two degraded cases. In particular, it first per-forms ℓ1 thresholding, and then conducts a combined shrinkagewhich uses the values of the ℓ1 and ℓ2 norms, the inner product ofthe vectors before and after the ℓ1 thresholding. To obtain theproximal operator associated with the ℓ11 norm, it requires to goover the entire matrix V, while obtaining the solution with the ℓ21norm requires to scan V twice. The combination of both the ℓ21norm and ℓ11 norm requires scanning the matrix V multiple timesto perform the ℓ1 thresholding and compute the ℓ1 norm, ℓ2 normand the inner product. However, the complexity of all theseproximal operators remains the same as O(nL).
To further accelerate the convergence speed of the algorithm,we use the Accelerate Proximal Gradient (APG) method to solvethe optimization problem in Eq. (8) to achieve accelerated con-vergence speed. In our setting, function gð�Þ is smooth and stronglyconvex, and the proximal operator is computed efficiently byTheorem 3. According to [5], the APG algorithm converges quad-ratically. Algorithm 2 summarizes the optimization procedureusing APG to derive the optimal solutions for the proposed MTL
ase cite this article as: J. Pu, et al., Multiple task learning with flei.org/10.1016/j.neucom.2015.11.029i
method. It is easy to see that the final algorithm is formulated as asingle loop. The step length is denoted by tk instead of Lipschitzconstant Lipk, which is determined by line search [6].
3.4. Relationship between MTL-IRLS and MTL-APG
MTL-IRLS and MTL-APG are closely related, as both of thembelong to the operator splitting method [30,16,11]. As discussed in[6], the proximal gradient method can be viewed as a forwardbackward splitting method. Recall the stationary point conditionfor our optimization problem (3), it can be written as:
ð2tXlX>l wn
l �2tXlyl�wn
l ÞþðλtΠ lwn
l þwn
l Þ ¼ 0;
where the constant t satisfies t40. Then we obtain the optimalsolution which is defined in an iterative form
wn
l ¼ ðIþλtΠ lÞ�1ðwn
l �2tðXlX>l wn
l �XlylÞÞ: ð16ÞAs stated by [6],
Proxt;gðwnÞ ¼ ðIþλtΠ lÞ�1
is the proximal operator, and wn
l �2tðXlX>l wn
l �XlylÞ is a gradientdescend operator with step length t.
Considering the IRLS method, we split the objective functioninto the linear part and the nonlinear part:
LðWÞ ¼ ~gðWÞþ ~hðWÞ; ð17Þwhere the nonlinear part
~gðWÞ ¼XLl ¼ 1
w>l XlX
>l wlþλJ ðWÞ;
and the linear part
~hðWÞ ¼XLl ¼ 1
2ylX>l wlþy>
l yl� �
:
The optimal condition for minimizing the objective function 17is written as:
∇ ~hþ∂ ~gðWÞ ¼ 0:
We have
Wn ¼ � 2XX> þλ∂J ðWnÞWn
� ��1
∇ ~h: ð18Þ
Comparing the iterative form in Eqs. (16) and (18), the main effortof both methods is to evaluate the inversion: the APG method is tocompute the proximal operator, and the IRLS method is to computethe solution of a weighted least square. In general, the APG method isdesigned for simple penalties which have analytical solution of theproximal operator, while the IRLS method is designed for simplemodels where pruning methods can be effectively applied. For ourformulation with the joint ℓ11=ℓ2;1 regularization, though both MTL-IRLS and MTL-APG are less affected by the curse of dimension, thetraining time of the MTL-APG is in proportion to the number of tasks,which is confirmed by our empirical study in Section 6.4.
4. Convergence analysis
Note that the convergence property of the APG algorithm hasbeen clearly proved in earlier studies [5,7]. In this section, we willprovide convergence analysis of the IRLS algorithm for multi-tasklearning. Here, we start introducing the lemma presented by [44].
Lemma 1. Given two d-dimensional vectors x¼ ½x1⋯xd�> andy¼ ½y1⋯yd�> , we define EðxÞ ¼ Pd
i ¼ 1 jxi j and EðyÞ ¼ Pdi ¼ 1 jyi j .
xible structure regularization, Neurocomputing (2015), http://dx.
J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 7
Then the following inequation holds
EðyÞ�EðxÞr12
y>Cy�x>Cx� �
;
where C¼ diagfc11;…; cii;…; cddg is a diagonal matrix with cii ¼jxi j �1.
We will use the above lemma to prove the convergence of theMTL-IRLS algorithm as following. Since it is straightforward to seethat the cost function LðWÞ defined in Eq. (3) is lower bounded asLðWÞZ0, we only need to show that the cost function ismonotonic.
Theorem 4. In the proposed MTL-IRLS algorithm, the regularizedcost function LðWÞ decreases monotonically, i.e., LðWkþ1ÞrLðWkÞ.
Proof 1. We have shown that the proposed algorithm equals tosolving a weighted least square problem, as shown in Eq. (7).Decompose the diagonal matrix Π into two diagonal matrices as:Π ¼ΦþΨ (Φ;Ψ ARdL�dL), where the diagonal elements of Φ andΨ are defined as:
Φldþ i;ldþ i ¼ ð1�γÞJwi� J �12
Ψ ldþ i;ldþ i ¼ γ jwil j �1; l¼ 1;…; L:
Let us denote a column vector e¼ ½e1;…; ei;…; ed�> with the ele-ment ei ¼ Jwi� J2 and a diagonal matrix Φ ¼ diagfΦ11; ⋯; Φ ii;⋯;
Φddg with Φ ii ¼ ð1�γÞJwi� J �12 . Then we can derive the weighted
ℓ2 penalty term of Eq. (7) as:
JΠ1=2 � vecðWÞJ22 ¼ vecðWÞ½ �> �Π � vecðWÞ ¼ vecðWÞ½ �> �Φ � vecðWÞþ vecðWÞ½ �> �Ψ � vecðWÞ ¼ e> Φeþ vecðWÞ½ �> �Ψ � vecðWÞ:
For simplicity, we denote a dL-dimensional vector v¼ vecðWÞ.Recalling Eq. (2) and using Lemma 1, we can connect the jointℓ11=ℓ21-norm J ðWÞ with the weighted ℓ2 penalty:
J ðWkþ1Þ�J ðWkÞ
¼ ð1�γÞXi
Jwkþ1i� J2�
Xi
Jwki� J2
!þγ
Xi
jvkþ1i j �
Xi
jvki Þj !
r12
ðekþ1Þ> Φkekþ1�ðekÞ> Φk
ek� �
þ12
ðvkþ1Þ>Ψ kvkþ1Þ�ðvkÞ>Ψ kvk �
¼ 12
ðvkþ1Þ>Φkvkþ1�ðvkÞ>Φkvk �
þ12
ðvkþ1Þ>Ψ kvkþ1Þ�ðvkÞ>Ψ kvk �
¼ 12
J ðΠkÞ1=2vecðWkþ1ÞJ22� JðΠkÞ1=2vecðWkÞJ22 �
:
Adding the same quadratic empirical loss term to the both sidesof the above inequality, then it becomes LðWkþ1Þ�LðWkÞrGðWkþ1Þ�GðWkÞ. Since the IRLS algorithm guarantees GðWkþ1Þ�GðWkÞo0, it is easy to see LðWkþ1Þ�LðWkÞr0. Then the mono-tonic property of the cost function is proved. Finally, since the costfunction is lower bounded by zero, the convergence of proposedMTL-IRLS algorithm is guaranteed. □
5. Performance bound
To further explore the performance bound, we define twoindex sets I ðWÞ; I cðWÞ for the nonzero and zero rows of matrix W,respectively:
I ðWÞ ¼ fi : Jwi� J1a0g; I cðWÞ ¼ fi : Jwi� J1 ¼ 0g:The following three lemmas, i.e., Lemma 2, 3, 4, are introduced
by [20].
Please cite this article as: J. Pu, et al., Multiple task learning with flexdoi.org/10.1016/j.neucom.2015.11.029i
Lemma 2. For any matrix pair W; W , we have the followinginequality:
JW�WJ2;1þ JWJ2;1� JW J2;1r2J ðW�WÞI ðWÞ J2;1;
and
JW�WJ1;1þ JWJ1;1� JW J1;1r2J ðW�WÞI ðWÞ J1;1:Lemma 3. Let δi be i.i.d. random variables with δi �Nð0;σ2Þ, andP
iα2i ¼ 1. Then, we have
v¼ 1σ
Xi
αiδi
is a standard normal random variable.
Lemma 4. Let χ2ðdÞ be a χ2 random variable with d degrees offreedom. Then, for 8b40, we have
Pr χ2ðdÞrdþb� �
41�exp �12
b�d log 1þbd
� �� �� �:
We use the above lemmas to prove our performance bound inthe following.
Theorem 5. Let W ¼ ½w1;⋯wL� be the optimal solution of Eq. (3) forLZ2 and n; dZ1. Assume that the regression model is given by alinear representation with Gaussian noise, i.e., yl ¼ f lþδl ¼X>
l wlþδl, where δl is a vector and each entry δli �Nð0;σ2Þ. If dataX>
l is normalized, and we choose the regularization parameter λ1and λ2 as
λ1; λ2ZσnL
ffiffiffiffiffiffiffiffiffiffiffiffiffidLþb
p;
where b is a positive scalar. Then with the probability at least1�exp �1
2 b�dL log 1þ bdL
� �� �� �, we have
XLl ¼ 1
1nL
JX>l w l�f l J22r
XLl ¼ 1
1nL
JX>l wl�f l J22þ2λ1 J ðW�WÞI ðWÞ J2;1
þ2λ2 J ðW�WÞI ðWÞ J1;1:Proof 2. As W is the optimal solution of Eq. (3), we have
XLl ¼ 1
1nL
JX>l wl�yl J
22r
XLl ¼ 1
1nL
JX>l wl�yl J
22þλ1ðJWJ2;1� JW J2;1Þ
þλ2ðJWJ1;1� JW J1;1Þ:
According to the linear regression model with Gaussian noise,we have
XLl ¼ 1
1nL
JX>l wl�f l J22
rXLl ¼ 1
1nL
JX>l wl�f l J22þλ1ðJWJ2;1� JW J2;1Þþλ2ðJWJ1;1� JW J1;1Þþ
2nL
⟨Z; W�W⟩;
ð19Þwhere Z¼ ½X1δ1;…;XLδL�ARd�L, and its (i, j)-th entry is given byzij ¼
Pnk ¼ 1 xikδkj:
If define vij ¼ 1σzij, it is easy to see that vij are i.i.d standard
normal variables vij �Nð0;1Þ (according to Lemma 3). Thus
1σJZJ2F ¼
Xi
Xj
v2ij
is a χ2 random variable with dL degrees of freedom. Based onLemma 4, we have
Pr1nL
JZJ Frα� �
r1�exp �12
b�dL log 1þ bdL
� �� �� �;
where αZ σnL
ffiffiffiffiffiffiffiffiffiffiffiffiffidLþb
p.
ible structure regularization, Neurocomputing (2015), http://dx.
J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎8
Thus, with probability at least 1�exp �12 b�dL log 1þ b
dL
� �� �� �,
we have
2nL
⟨Z; W�W⟩
r 2nL
JZJ f JW�WJF
r2αJW�WJF
rαJW�WJ2;1þαJW�WJ1;1:
Let λ1; λ2Z σnL
ffiffiffiffiffiffiffiffiffiffiffiffiffidLþb
p, then
2nL
⟨Z; W�W⟩rλ1 JW�WJ2;1þλ2 JW�WJ1;1:
Substitute the above inequality into the inequality (19) andusing Lemma 2, we verify the theorem. □
Now we make the following assumption about the trainingdata and the weight matrix, which is a generalized case of therestricted eigenvalue assumption.
Assumption 1. For a matrix ΓARd�L, let srd. We assume thatthere exist constant κðsÞ such that
κðsÞ ¼ minΓARðsÞ
JX> vecðΓÞJffiffiffiffiffidL
pJΓI ðWÞ JF
40;
where the restricted set RðsÞ is defined as
RðsÞ ¼ fΓARd�L : Γa0; jI ðWÞjrs;
JΓI cðWÞ J2;1rα1 JΓI ðWÞ J2;1;
JΓI cðWÞ J1;1rα2 JΓI ðWÞ J1;1g;
where jI j counts the number of elements in the set I . Note thatAssumption 1 is similar to the assumptions made by someprevious works on analyzing the performance bounds of MTL[13,20]. The following theorem gives a bound to measure how wellour proposed method can approximate the true W matrix definedin Eq. (3).
Theorem 6. Let W be the optimal solution of Eq. (3) for LZ2 andn; dZ1, andWn be the oracle solution. The regularization parametersλ1 and λ2 are chosen as
λ1; λ2ZσnL
ffiffiffiffiffiffiffiffiffiffiffiffiffidLþb
p;
where b is a positive scalar. Then the above Assumption, the followingresults hold with the probability of at least 1�expð�1
2ðb�dLlog ð1þ b
dLÞÞÞ:1nL
JX> vecðWÞ�vecðYÞJ2r 1κ2ðsÞ 2λ1
ffiffis
p þ2λ2ffiffiffiffiffisL
p �2;
JW�Wn J2;1rðα1þ1Þsκ2ðsÞ 2λ1þ2λ2
ffiffiffiL
p �;
JW�Wn J1;1rðα2þ1Þsκ2ðsÞ 2λ1
ffiffiffiL
pþ2λ2L
�:
Proof 3. Using the Theorem 5 and setting W¼Wn, we have
1nL
JX> � vecðWÞ�vecðFÞJ2r2λ1 J ðW�WnÞI ðWnÞ J2;1þ2λ2 JðW�WnÞIðWnÞ J1;1:
Under Assumption 1, we have
J ðW�WnÞI ðWnÞ J2;1rffiffis
pJ ðW�WnÞI ðWnÞ JFr
ffiffis
p
κðsÞffiffiffiffiffinL
p JX> �
vecðWÞ�vecðFÞJ ;
Please cite this article as: J. Pu, et al., Multiple task learning with fledoi.org/10.1016/j.neucom.2015.11.029i
and
J ðW�WnÞI ðWnÞ J1;1rffiffiffiffiffisL
pJ ðW�WnÞI ðWnÞ JFr
ffiffiffiffiffisL
p
κðsÞffiffiffiffiffinL
p JX> �
vecðWÞ�vecðFÞJ :
Thus, we obtain
JX> vecðWÞ�vecðYÞJrffiffiffiffiffinL
p
κðsÞ 2λ1ffiffis
p þ2λ2ffiffiffiffiffisL
p �:
Note that Assumption 1 gives
JΓI cðWÞ J2;1rα1 JΓI ðWÞ J2;1;
JΓI cðWÞ J1;1rα2 JΓI ðWÞ J1;1:
If set Γ ¼ W�Wn, we obtain
JW�Wn J2;1r ðα1þ1ÞJ ðW�WnÞI ðWnÞ J2;1;
JW�Wn J1;1r ðα2þ1ÞJ ðW�WnÞI ðWnÞ J1;1:
Finally, we can derive
JW�Wn J2;1rðα1þ1Þsκ2ðsÞ 2λ1þ2λ2
ffiffiffiL
p �;
JW�Wn J1;1rðα2þ1Þsκ2ðsÞ 2λ1
ffiffiffiL
pþ2λ2L
�: □
The above theorem provides an important theoretical guaran-tee for the global optimum of the optimization problem (3). Morespecifically, the first inequality measures the squared data fittingloss, and the second and third inequalities both bound the targetmatrix W in terms of ℓ2;1 norm and ℓ1;1 norm.
6. Experiments
In this section, we conduct extensive experiments on bothsynthetic and real data to evaluate the effectiveness of ourapproaches. We compare with several state-of-the-art MTLmethods, including grouping based MTL (GMTL) [23], dirty MTL(DMTL) [22], and two robust MTL methods with different reg-ularization terms, (robust MTL (RMTL) [13] and robust multi-taskfeature learning (rMTFL) [20]). For the compared methods, we usethe codes provided by the authors. To provide fair comparison, wepartition the data into training, validation and test sets. Theparameters for each methods are fine tuned using the validationset. Performance is measured by both normalized mean squarederror (nMSE) and averaged mean squared error (aMSE) and wereport the mean and standard deviation of the errors of 10 randomtrials.
6.1. Synthetic data
We first describe the procedure of generating the syntheticdataset. We set the number of tasks L¼50, and each task has 100training samples with 200-dimensional features. The indices ofthe nonzero entries for both grouped tasks and outlier tasks arechosen independently from a discrete uniform distribution, andthe values of all these nonzero entries are chosen randomly from astandard Gaussian distribution. The data matrices fXlgLl ¼ 1 aresampled from a standard Gaussian distribution. The response iscomputed as yl ¼XT
l wlþξl. Here ξl is a Gaussian noise with zeromean and the variance σ2l specified by a certain signal-to-noise
xible structure regularization, Neurocomputing (2015), http://dx.
Fig. 3. Illustration the learned coefficient matrices using various MTL algorithms and the ground truth matrix: (a) a major task group peppered with outlier tasks and(b) multiple groups of major tasks peppered with outlier tasks. This figure is best viewed on screen with magnification.
Table 1The error rates of the learned target matrix W on the synthetic data.
Settings Measure DMTL rMTFL MTL-IRLS MTL-APG
One group þ Outliers ℓ0 error 0.2118 0.2460 0.1939 0:090470.0107 70:0535 70:0050 70:0098
ℓ1 error 0.0200 0.0216 0.0129 0.008570:0008 70:0020 70:0002 70:0007
ℓ2 error 0.0033 0.0062 0.0015 0.000970:0002 70:0003 70:0001 70:0001
Two groups þ Outliers ℓ0 error 0.2147 0.2232 0.1997 0.113970:0182 70:0100 70:0070 70:0472
ℓ1 error 0.0197 0.0276 0.0136 0:009170:0010 70:0002 70:0004 70:0009
ℓ2 error 0.0032 0.0083 0.0016 0:000870:0002 70:0001 70:0001 70:0002
J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 9
(SNR) level as
σ2l ¼
1nlJyl J
210�SNR=10:
Please cite this article as: J. Pu, et al., Multiple task learning with flexdoi.org/10.1016/j.neucom.2015.11.029i
We start from a simple scenario where the ground truth con-tains one major task group with a total of 40 tasks and 10 outliertasks. In addition, we also test a more complicated case with twomajor task groups and 10 outlier tasks, where each major groupcontains 20 tasks. Fig. 3 illustrates the recovered coefficientmatrices W using different methods and the ground truth. It canbe seen that all these methods can recover the major task groupsto some extent. However, neither the DMTL nor the rMTFLmethods can identify the outlier tasks effectively due to the lim-itation of the used task structure assumptions. In contrast, MTL-IRLS and MTL-APG are able to recover the major task group, aswell as the outlier tasks. To provide quantitative evaluations of therecovered W matrix, we report the error rates using the mean ℓ0,ℓ1 and ℓ2 error in Table 1. Note that the proposed MTL-IRLS andMTL-APG methods significantly outperform other competingmethods and the MTL-APG method achieves the best recovery inall the tests, which is consistent with the qualitative observations.
Finally, we also compare the values of nMSE and aMSE of theregression results for all the tested cases. In particular, we vary themagnitudes of the SNR levels, ranging from 20 dB to 50 dB, and
ible structure regularization, Neurocomputing (2015), http://dx.
Table 2Performance comparison of various methods on the synthetic data with different noise levels.
Measure SNR (dB) GMTL DMTL RMTL rMTFL MTL-IRLS MTL-APG
nMSE 20 1.4344 0.6169 0.5806 0.6497 0.5849 0.563970:0642 70:0244 70:0199 70:0243 70:0416 70:0363
30 1.4803 0.6007 0.6549 0.5748 0:3541 0:307170:0645 70:0113 70:0274 70:0728 70:0199 70:0350
40 1.1957 0.4588 0.4895 0.4189 0:3739 0:251370:0467 70:0124 70:0137 70:0640 70:0246 70:0251
50 1.2253 0.4821 0.5211 0.4147 0:3441 0:226170:0791 70:0236 70:0217 70:0263 70:0293 70:0335
aMSE 20 0.1032 0.0444 0:0418 0.0468 0.0417 0:040270:0046 70:0019 70:0015 70:0019 70:0035 70:0031
30 0.0927 0.0376 0.0410 0.0360 0:0287 0:024970:0032 70:0011 70:0019 70:0048 70:0018 70:0030
40 0.0980 0.0376 0.0401 0.0344 0:0268 0:018070:0035 70:0011 70:0014 70:0058 70:0016 70:0016
50 0.0965 0.0380 0.0411 0.0327 0:0266 0:017570:0061 70:0022 70:0022 70:0023 70:0018 70:0023
Table 3Performance comparison of various methods in terms of nMSE and aMSE on the SARCOS dataset.
Measure Training # GMTL DMTL RMTL rMTFL MTL-IRLS MTL-APG
nMSE 50 0.0669 0.0668 0.0721 0.0749 0:0524 0:053070:0078 70:0088 70:0081 70:0264 70:0040 70:0035
100 0.0457 0.0474 0.0575 0.0474 0:0448 0:044470:0020 70:0036 70:0266 70:0024 70:0025 70:0024
150 0.0402 0.0426 0.0427 0.0427 0:0385 0:038370:0012 70:0020 70:0013 70:0019 70:0012 70:0012
aMSE 50 0.0633 0.0632 0.0683 0.0709 0:0497 0:050170:0074 70:0083 70:0076 70:0250 70:0038 70:0033
100 0.0433 0.0449 0.0544 0.0449 0:0425 0:042170:0019 70:0034 70:0252 70:0022 70:0024 70:0022
150 0.0380 0.0403 0.0404 0.0404 0:0364 0:036370:0012 70:0019 70:0013 70:0018 70:0012 70:0012
Table 4Performance comparison of various methods in terms of nMSE and aMSE on the School dataset.
Measure Training (%) GMTL DMTL RMTL rMTFL MTL-IRLS MTL-APG
nMSE 10 0.8939 0.9230 0.9004 0.9184 0.8267 0.903270:0258 70:0187 70:0151 70:0178 70:0154 70:0192
20 0.7591 0.7866 0.7773 0.7865 0:7269 0.769170:0113 70:0110 70:0080 70:0099 70:0084 70:0079
30 0.7098 0.7397 0.7341 0.7396 0:6905 0.712770:0142 70:0114 70:0121 70:0115 70:0086 70:0146
aMSE 10 0.2458 0.2538 0.2476 0.2526 0:2287 0.249970:0071 70:0053 70:0045 70:0049 70:0041 70:0046
20 0.2091 0.2167 0.2141 0.2167 0.2009 0.212670:0030 70:0033 70:0026 70:0028 70:0019 70:0014
30 0.1955 0.2037 0.2022 0.2037 0.1911 0.197270:0039 70:0032 70:0035 70:0032 70:0021 70:0029
J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎10
report the results in Table 2. Apparently, reducing the noise levelby increasing the SNR will clearly improve the performance for allthe approaches. However, among all the tested cases, MTL-APGprovides the best performance with lowest estimation errors, andMTL-IRLS is the second best choice for most of the cases. Noticethat the quantitative analysis results of estimation error is con-sistent with the visualized results of the recovered matrices inFig. 3.
6.2. Real data
We now evaluate the proposed method and compare with thestate-of-the-arts approaches using several standard benchmarks
Please cite this article as: J. Pu, et al., Multiple task learning with fledoi.org/10.1016/j.neucom.2015.11.029i
from the real world applications. Note that the selected bench-marks, including the SARCOS dataset, the School dataset, andthe ADNI dataset, have been widely used in literature forvalidating the performance of various MTL algorithms[31,63,25,20,62].
The first dataset used in the experiments is SARCOS data, col-lected from an inverse dynamics prediction system with a sevendegrees-of-freedom anthropomorphic robot arm. The datasetconsists of 48,933 observations corresponding to 7 joint torques;each of the observation is described by a 21-dimensional featureincluding 7 joint positions, 7 joint velocities, and 7 joint accel-erations. The goal here is to construct mappings from eachobservation to the 7 joint torques. Following the setup of the
xible structure regularization, Neurocomputing (2015), http://dx.
Table 5Performance comparison of various methods in terms of nMSE and aMSE on the ADNI dataset.
Measure Training (%) GMTL DMTL RMTL rMTFL MTL-IRLS MTL-APG
nMSE 20 0.9550 0.6886 0.6677 0.6976 0.5463 0.566470:0143 70:0210 70:0166 70:0194 70:0178 70:0168
40 0.8717 0.6486 0.6137 0.6693 0.5424 0.543670:0166 70:0179 70:0213 70:0160 70:0182 70:0079
60 0.8131 0.6198 0.5975 0.6429 0:5290 0.537370:0269 70:0302 70:0258 70:0310 70:0308 70:0295
aMSE 20 0.0252 0.0182 0.0176 0.0184 0.0146 0.015170:0008 70:0006 70:0005 70:0007 70:0006 70:0008
40 0.0231 0.0172 0.0162 0.0177 0.0143 0.014370:0011 70:0008 70:0008 70:0008 70:0011 70:0011
60 0.0211 0.0161 0.0155 0.0167 0.0139 0.014170:0014 70:0013 70:0014 70:0014 70:0011 70:0012
0.010.1
110
100
00.2
0.40.6
0.810
0.05
0.1
0.15
0.2
λγ
nMSE
0.010.1
110
100
00.2
0.40.6
0.810
0.05
0.1
0.15
0.2
λγ
nMSE
0.010.1
110
100
00.2
0.40.6
0.81
0.7
0.8
0.9
1
λγ
nMSE
0.010.1
110
100
00.2
0.40.6
0.81
0.7
0.8
0.9
1
λγ
nMSE
0.010.1
110
00.2
0.40.6
0.810
1
2
3
4
λγ
nMSE
0.010.1
110
00.2
0.40.6
0.810
0.5
1
1.5
2
λγ
nMSE
Fig. 4. The evaluation of parameter sensitivity of the MTL-IRLS (left) and MTL-APG (right) on three datasets: (a) SARCOS, (b) School, and (c) ADNI.
J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 11
existing works, we vary the size of the training sets by randomlyselecting 50, 100, 150 observations, and use 200 and 5000 obser-vations as validation and test sets, respectively.
Please cite this article as: J. Pu, et al., Multiple task learning with flexdoi.org/10.1016/j.neucom.2015.11.029i
In addition, we also apply all the methods to the second realdataset, i.e., the School dataset, which was obtained from the InnerLondon Education Authority. The dataset consists of exam scores of
ible structure regularization, Neurocomputing (2015), http://dx.
200 400 600 800 1000 12000
20
40
60
80
100
Dimension
Trai
ning
Tim
e (s
econ
ds)
DMTLrMTFLMtL−IRLSMTL−APG
20 40 80 160 320 6400
20
40
60
80
Number of Tasks
Trai
ning
Tim
e (s
econ
ds)
DMTLrMTFLMTL−IRLSMTL−APG
Fig. 5. Evaluation of the algorithm efficiency with respect to: (a) the data dimensionality, (b) the number of tasks.
Number of Iterations
NM
SE
0
1
2
3
4MTL-IRLS
TrainingTest
Number of Iterations0 10 20 30 40 50 0 200 400 600 800 1000
NM
SE
0
1
2
3
4MTL-APG
TrainingTest
Fig. 6. Convergence analysis of (a) MTL-IRLS and (b) MTL-APG.
J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎12
15,362 students from 139 secondary schools. Each student is describedby 27 attributes such as year, gender and examination scores. Fol-lowing prior works, we vary the ratio of training set as 10%, 20%, 30%respectively, fix the validation ratio as 30%, and use the rest for testing.
The third dataset is the ADNI dataset from the Alzheimer'sDisease Neuroimaging Initiative database. The ADNI project is alongitudinal study, which provides a variety of measurementsfrom Alzheimer's disease patients, mild cognitive impairmentpatients and normal controls. The measurements include MRIscans, PET scans, CSF measurements, and cognitive scores. In ourexperiment, we use the structural MRI data of 675 patients. Inparticular, the dataset was processed using the well-known toolFreeSurfer and the preprocess procedure includes the followingsteps: remove the features with more than 1000 missing entries;remove the records with failed quality control; exclude thepatients without baseline MRI records; and fill the missing entriesusing the average value [62]. After all this procedure, MRI featurescan be grouped into 5 categories: cortical thickness average, cor-tical thickness standard deviation, volume of cortical parcellation,volume of white matter parcellation, and surface area. We test ourmethod to predict future Mini-Mental State Exam (MMSE) score atsix time points M06, M12, M18, M24, M36, and M48, following thestandard settings used in [63,62]. We vary the ratio of training set
Please cite this article as: J. Pu, et al., Multiple task learning with fledoi.org/10.1016/j.neucom.2015.11.029i
as 20%, 40%, 60% respectively, fix the validation ratio as 20%, anduse the rest for testing.
Tables 3, 4 and 5 report the performance measured by nMSEand aMSE for the SARCOS dataset, the School dataset and theADNI dataset, respectively. On the SARCOS dataset, MTL-APGperforms better than the other methods, and MTL-IRLS is thesecond best in most cases. However, on the school dataset and theADNI dataset, MTL-IRLS performs the best in all the cases andMTL-APG performs the second best in most cases. In addition, bothMTL-IRLS and MTL-APG often have significantly smaller standarddeviations of the performance, especially when there is lesstraining data. This indicates that the proposed methods are morerobust due to its unique power and flexibility of exploring thecomplex task relationships to compensate the lack of trainingsamples. Finally, we also observe that among the comparedmethods, the task grouping based method, i.e., GMTL, outperformsoutlier task based approaches, including DMTL, RMTL, and rMTFL.It indicates that recovering the main group structure is probablymore important then identifying few outlier tasks.
6.3. Parameter sensitivity
We also analyze parameter sensitivity for both MTL-IRLS andMTL-APG, using the SARCOS, School and ADNI datasets. Two key
xible structure regularization, Neurocomputing (2015), http://dx.
J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 13
parameters of the proposed methods are γ and λ. The parameter λcontrols the structure penalty, while γ balances the weights ofgroup sparsity and element-wise sparsity. We use 30% data sam-ples for training and the rest for testing. By varying γ in [0:0.1:1]and varying λ in 10i; i¼ �2 : 0:5 : 2, we show the performancemeasured by nMSE of the proposed MTL-IRLS and MTL-APG inFig. 4. It is clear to see that both algorithms are fairly robust to awide range of the parameter settings, as in the choice of 1rλr10is good for the MTL-IRLS method and 10rλr100 is good for theMTL-APG method. On the other side, though the performance ofboth approaches is also stable for the different choice of γ, the bestchoice is always in 0oγo1, which balances the effect of taskgrouping and task outlier.
6.4. Algorithm efficiency and convergence analysis
We also analyze the efficiency of the proposed methods (theMTL-IRLS and the MTL-APG) and compare with other competingmethods using the synthetic dataset. In particular, we evaluate thetraining time versus feature dimensionality d and the number oftasks L. Fig. 5 reports the experimental results, where the MTL-IRLS algorithm in general is the fastest one for training and itsefficiency is also less sensitive to the settings. In particular, Fig. 5(a) shows the training time with respect to the feature dimen-sionality ranging from 200 to 1200, where the efficiency of boththe MTL-IRLS and the MTL-APG methods is less affected by theincrease of the feature dimensionality. However, the training timeof the DMTL and the rMTFL method grows roughly in a linearcomplexity. In Fig. 5(b), the training cost with different number oftasks is evaluated, where we set the range as 20rLr40. Note thatas the number of tasks increases, the training time of the MTL-IRLSis still less affected, while the training cost of the MTL-APG sig-nificantly increases due to the time cost for calculating the prox-imal operators. The empirical evaluation results also confirm thetheoretical analysis and the comparison of the MTL-IRLS and theMTL-APG methods in Section 3.
The convergence analysis of the proposed methods is per-formed using the School dataset. The split ratio of training, vali-dation and test set is 30%, 30% and 40%, respectively. As statedabove, we use cross validation to choose all the parameters, andNMSE is used to measure the training and test error. The meanNMSE curves of 20 random trials are shown in Fig. 6. Note thatMTL-IRLS converges within five iterations of reweighting, andMTL-APG converges within 800 iterations of projection.
7. Conclusion
In this paper, we have presented a generic MTL learning frame-work with flexible structure regularization. Unlike the existingmethods such as the decomposition model, we directly impose aregularization term with a convex mixture of structure and outlierpenalties, i.e., joint ℓ11=ℓ21-norm regularization, which leads to aflexible yet robust formulation. To efficiently minimize the costfunction, we propose two efficient algorithms, namely MTL-IRLS andMTL-APG, to learn the target model under a complex setting withboth task grouping and task outlier. Besides analyzing the theoreticalrelatedness of these two solutions, we also provide rigorous proofs ofthe convergence property and the performance bound. Experimentson both synthetic and real data and the comparison study with sev-eral representative methods have verified the superior performanceof our methods. One interesting future direction is to deploy thepropose methods in more challenging real-world applications likefine grained objective categorization.
Please cite this article as: J. Pu, et al., Multiple task learning with flexdoi.org/10.1016/j.neucom.2015.11.029i
Appendix
In the following, we prove Theorem 3 in Section 3.3. We omitthe superscript for clarity in this section, and reformulate theoptimization problem in Eq. (15) as
W¼ arg minW
L2JW�VJ2F þλ1 JWJ2;1þλ2 JWJ1;1
� �
¼ arg minW
L2
Xi
Jwi� �vi� J22þλ1Xi
Jwi� J2þλ2Xi
Jwi� J1
!
¼Xi
arg minwi�
L2Jwi� �vi� J22þλ1 Jwi� J2þλ2 Jwi� J1
� �:
Thus, the minimization problem over matrix has been split intoa couple of minimization problem over vectors. After each mini-mization problem over a vector has been solved, a simple combi-nation of all these solutions forms the solution of the originalproblem. Therefore, we just consider the minimization problemover vectors.
Proof of Theorem 3.
Proof 4. The non-degraded case is formulated as follows:
wi� ¼ arg minwi�
L2Jwi� �vi� J22þλ1 Jwi� J2þλ2 Jwi� J1
� �: ð20Þ
The optimal condition for the above unconstrained nonconvexproblem is that
Lþ λ1Jwi� J2
� �wij�Lvijþλ2ξ¼ 0;
where ξA∂jwn
ij j .Then we discuss the solution in three cases:
� If jvij jrλ2L , then wn
ij ¼ 0 and ξ¼ Lλ2vij.� If vij4
λ2L 40, then wn
ij40 and ξ¼ 1. More specifically,
wn
ij ¼ Lþ λ1Jwi� J2
� ��1
L vij�λ2L
� �:
� If vijo�λ2L o0, then wn
ijo0 and ξ¼ �1. More specifically,
wn
ij ¼ Lþ λ1Jwi� J2
� ��1
L vijþλ2L
� �:
Then we can summarize the solution as
wn
ij ¼0; If jvij jr
λ2L;
Lþ λ1Jwi� J2
� ��1
L jvij j �λ2L
� �signðvijÞ; Otherwise:
8>>><>>>:
ð21Þ
As the first case is already formulated in an analytic form, weonly consider the second case with jvij j4λ2
L . Denote thatuij ¼ jvij j �λ2
L
�signðvijÞ, the optimal solution could be formulate as
a linear representation of uij
wn
ij ¼ ciuij; ð22Þ
where ci ¼ Lþ λ1Jwi� J 2
��1L is a nonnegative scalar.
By substituting the linear representation (22) into the originalminimization problem (20), we have
minci
L2Jciui� �vi� J22þλ1 Jciui� J2þλ2 Jciui� J1
� �:
Solving the above unconstrained and smooth minimization overthe nonnegative scalar ci and substituting the optimal solution into
ible structure regularization, Neurocomputing (2015), http://dx.
J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎14
the linear representation (22), we obtain
wn
ij ¼L⟨ui�; vi�⟩�λ1 Jui� J2�λ2 Jui� J1
LJui� J22
!þuij:
Combining the above solution with the first in Eq. (21), we have
wn
ij ¼0; If jvij jr
λ2L;
L⟨ui�; vi�⟩�λ1 Jui� J2�λ2 Jui� J1LJui� J22
!þuij; Otherwise:
8>>>><>>>>:
□
References
[1] R.K. Ando, T. Zhang, A framework for learning predictive structures frommultiple tasks and unlabeled data, J. Mach. Learn. Res. 6 (2005) 1817–1853.
[2] A. Argyriou, T. Evgeniou, M. Pontil, Convex multi-task feature learning, Mach.Learn. 73 (3) (2008) 243–272.
[3] J. Attenberg, K. Weinberger, A. Dasgupta, A. Smola, M. Zinkevich, Collaborativeemail-spam filtering with the hashing trick, in: The Sixth Conference on Emailand Anti-Spam, 2009.
[4] B. Bakker, T. Heskes, Task clustering and gating for Bayesian multitask learn-ing, J. Mach. Learn. Res. 4 (2003) 83–99.
[5] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm forlinear inverse problems, SIAM J. Imaging Sci. 2 (1) (2009) 183–202.
[6] A. Beck, M. Teboulle, Gradient-based algorithms with applications to signalrecovery problems, in: Y. Eldar, D. Palomar (Eds.), Convex Optimization inSignal Processing and Communications, Cambridge University Press, NewYork, NY, 2010, pp. 42–88.
[7] A. Beck, M. Teboulle, Smoothing and first order methods: a unified framework,SIAM J. Optim. 22 (2) (2012) 557–580.
[8] S. Bickel, J. Bogojeska, T. Lengauer, T. Scheffer, Multi-task learning for HIVtherapy screening, in: ICML, 2008.
[9] D. Calandriello, A. Lazaric, M. Restelli, Sparse multi-task reinforcementlearning, in: NIPS, 2014.
[10] R. Caruana, Multitask learning, Mach. Learn. 28 (1) (1997) 41–75.[11] G.H.-G. Chen, Forward-backward splitting techniques: theory and applications
(Ph.D. thesis), University of Washington, 1994.[12] J. Chen, L. Tang, J. Liu, J. Ye, A convex formulation for learning shared structures
from multiple tasks, in: ICML, 2009.[13] J. Chen, J. Zhou, J. Ye, Integrating low-rank and group-sparse structures for
robust multi-task learning, in: SIGKDD, 2011.[14] X. Chen, Q. Lin, S. Kim, J.G. Carbonell, E.P. Xing, Smoothing proximal gradient
method for general structured sparse learning, in: UAI, 2011.[15] I. Daubechies, R. Devore, M. Fornasier, C.S. Gntrk, Iteratively reweighted least
squares minimization for sparse recovery, Commun. Pure Appl. Math. 63,2010, 1-38.
[16] J. Eckstein, Splitting methods for monotone operators with applications toparallel optimization (Ph.D. thesis), Massachusetts Institute of Technology, 1989.
[17] T. Evgeniou, C.A. Micchelli, M. Pontil, Learning multiple tasks with kernelmethods, J. Mach. Learn. Res. 6 (2005) 615–637.
[18] H. Fei, J. Huan, Structured feature selection and task relationship inference formulti-task learning, Knowl. Inf. Syst. 2 (2013) 345–364.
[19] J. Ghosn, Y. Bengio, Multi-task learning for stock selection, in: NIPS, 1996.[20] P. Gong, J. Ye, C. Zhang, Robust multi-task feature learning, in: SIGKDD, 2012.[21] L. Jacob, F. Bach, J.-P. Vert, Clustered multi-task learning: a convex formulation,
in: NIPS, 2008.[22] A. Jalali, P.D. Ravikumar, S. Sanghavi, C. Ruan, A dirty model for multi-task
learning, in: NIPS, 2010.[23] Z. Kang, K. Grauman, F. Sha, Learning with whom to share in multi-task fea-
ture learning, in: ICML, 2011.[24] S. Kim, E.P. Xing, Tree-guided group lasso for multi-task regression with
structured sparsity, in: ICML, 2010.[25] A. Kumar, H. Daumé III, Learning task grouping and overlap in multi-task
learning, in: ICML, 2012.[26] N.D. Lawrence, J.C. Platt, Learning to learn with the informative vector
machine, in: ICML, 2004.[27] S. Lee, J. Zhu, E.P. Xing, Adaptive multi-task lasso: with application to eQTL
detection, in: NIPS, 2010.[28] Z. Li, J. Liu, J. Tang, H. Lu, Robust structured subspace learning for data
representation, Trans. Pattern Anal. Mach. Intell. 37 (2015) 2085–2098.[29] Z. Li, J. Liu, Y. Yang, X. Zhou, H. Lu, Clustering-guided sparse structural learning for
unsupervised feature selection, Trans. Knowl. Data Eng. 26 (2014) 2138–2150.[30] P.L. Lions, B. Mercier, Splitting algorithms for the sum of two nonlinear
operators, SIAM J. Numer. Anal. 16 (6) (1979) 964–979.
Please cite this article as: J. Pu, et al., Multiple task learning with fledoi.org/10.1016/j.neucom.2015.11.029i
[31] J. Liu, S. Ji, J. Ye, Multi-task feature learning via efficient ℓ2;1-norm mini-mization, in: UAI, 2009.
[32] K. Lounici, M. Pontil, A.B. Tsybakov, S.A. van de Geer, Taking advantage ofsparsity in multi-task learning, in: COLT, 2009.
[33] J.J. Moreau, Fonctions convexes duales et points proximaux dans un espacehilbertien, C. R. de l'Acad. Sci. (Paris), Sér. A 255 (1962) 2897–2899.
[34] S. Negahban, M.J. Wainwright, Joint support recovery under high-dimensionalscaling: benefits and perils of ℓ1;1-regularization, in: NIPS, 2008.
[35] S. Negahban, M.J. Wainwright, Estimation of (near) low-rank matrices withnoise and high-dimensional scaling, in: ICML, 2010.
[36] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course,Springer, Boston, Dordrecht, London, 2003.
[37] Y. Nesterov, Smooth minimization of non-smooth functions, Math. Program.103 (1) (2005) 127–152.
[38] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22(10) (2010) 1345–1359.
[39] S. Parameswaran, K. Weinberger, Large margin multi-task metric learning, in:NIPS, 2010.
[40] T.K. Pong, P. Tseng, S. Ji, J. Ye, Trace norm regularization: reformulations,algorithms, and multi-task learning, SIAM J. Optim. 20 (6) (2010) 3465–3489.
[41] J. Pu, Y.-G. Jiang, J. Wang, X. Xue, Multiple task learning using iterativelyreweighted least square, in: IJCAI, 2013.
[42] J. Pu, Y.-G. Jiang, J. Wang, X. Xue, Which looks like which: exploring inter-classrelationships in fine-grained visual categorization, in: ECCV, 2014.
[43] B. Rakitsch, C. Lippert, K. Borgwardt, O. Stegle, It is all in the noise: efficient multi-task Gaussian process inference with structured residuals, in: NIPS, 2013.
[44] B.D. Rao, K. Engan, S.F. Cotter, J. Palmer, K. Kreutz-delgado, Subset selection innoise based on diversity measure minimization, IEEE Trans. Signal Process.(2003) 760–770.
[45] B.D. Rao, K. Kreutz-Delgado, An affine scaling methodology for best basisselection, IEEE Trans. Image Process. 47 (1) (1999) 187–200.
[46] N. Rao, C. Cox, R. Nowak, T.T. Rogers, Sparse overlapping sets lasso for mul-titask learning and its application to fMRI analysis, in: NIPS, 2013.
[47] M.T. Rosenstein, Z. Marx, L.P. Kaelbling, T.G. Dietterich, To transfer or not totransfer, in: In NIPS05 Workshop, Inductive Transfer: 10 Years Later, 2005.
[48] Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd edition, Society forIndustrial and Applied Mathematics, Philadelphia, PA, USA, 2003.
[49] A. Schwaighofer, V. Tresp, K. Yu, Learning Gaussian process kernels via hier-archical Bayes, in: NIPS, 2004.
[50] K. Swersky, J. Snoek, R.P. Adams, Multi-task Bayesian optimization, in: NIPS,2013.
[51] S. Thrun, Is learning the n-th thing any easier than learning the first?, in: NIPS,1996.
[52] S. Thrun, J.O. Sullivan, Discovering structure in multiple learning tasks: the TCalgorithm, in: ICML, 1996.
[53] H. Wang, F. Nie, H. Huang, S.L. Risacher, C.H.Q. Ding, A. J. Saykin, L. Shen, Adni,Sparse multi-task regression and feature selection to identify brain imagingpredictors for memory performance, in: ICCV, 2011.
[54] X. Wang, C. Zhang, Z. Zhang, Boosted multi-task learning for face verificationwith applications to web image and video search, in: CVPR, 2009.
[55] Y. Xue, X. Liao, L. Carin, B. Krishnapuram, Multi-task learning for classificationwith Dirichlet process priors, J. Mach. Learn. Res. 8 (2007) 35–63.
[56] X. Yang, S. Kim, E.P. Xing, Heterogeneous multitask learning with joint sparsityconstraints, in: NIPS, 2009.
[57] K. Yu, V. Tresp, A. Schwaighofer, Learning Gaussian processes from multipletasks, in: ICML, 2005.
[58] J. Zhang, Z. Ghahramani, Y. Yang, Learning multiple related tasks using latentindependent component analysis, in: NIPS, 2005.
[59] Y. Zhang, D.-Y. Yeung, Transfer metric learning by learning task relationships,in: SIGKDD, 2010.
[60] Y. Zhang, D.-Y. Yeung, Learning high-order task relationships in multi-tasklearning, in: IJCAI, 2013.
[61] Y. Zhang, D.-Y. Yeung, Q. Xu, Probabilistic multi-task feature selection, in: NIPS,2010.
[62] J. Zhou, J. Liu, V.A. Narayan, J. Ye, Modeling disease progression via multi-tasklearning, NeuroImage 78 (0) (2013) 233–248.
[63] J. Zhou, L. Yuan, J. Liu, J. Ye, A multi-task learning formulation for predictingdisease progression, in: SIGKDD, 2011.
[64] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R.Stat. Soc. Ser. B Stat. Methodol. 67 (2) (2003) 301–320.
Jian Pu received the Ph.D. degree from Fudan Uni-versity, Shanghai, in 2014. He is a postdoctoralresearcher of Institute of Neuroscience, Chinese Acad-emy of Sciences, Shanghai. His current research inter-ests include machine learning, computer vision, andmedical image computing.
xible structure regularization, Neurocomputing (2015), http://dx.
J. Pu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 15
Jun Wang received the Ph.D. degree from ColumbiaUniversity, NY, in 2011. Currently, he is a professor ofSchool of Computer Science and Software Engineering,East China Normal University, Shanghai, China and anadjunct faculty member of Columbia University, NewYork, USA. He is also affiliated with Institute of DataScience and Technology, Alibaba Group, Seattle, USA.He was a research staff member in the Business Ana-lytics and Mathematical Sciences Department at IBM T.J. Watson Research Center, Yorktown Heights, NY. Hehas been the recipient of several awards and scholar-ships, including the award of “Youth 1000 Talents Plan”
program in 2014 and the Jury thesis award fromColumbia University in 2011. His research interests include machine learning,computer vision, and mobile intelligence.
Yu-Gang Jiang is an associate professor of computerscience at Fudan University, Shanghai. His researchfocuses on novel algorithms and systems for big videodata analysis. He is the lead architect of a few best-performing video analytic systems in the annual U.S. NISTTRECVID evaluation and the European MediaEval eva-luation. His work has led to many awards, including“emerging leader in multimedia” award from IBM T.J.Watson Research in 2009, early career faculty award fromIntel and CCF in 2013, the 2014 ACM China Rising StarAward, and the 2015 ACM SIGMM Rising Star Award. Heis an associate editor of Machine Vision and Applications
and has recently served as a program chair of ACM ICMR2015. He received a Ph.D. in Computer Science at City University of Hong Kong andspent three years working at Columbia University before joining Fudan.
Please cite this article as: J. Pu, et al., Multiple task learning with flexdoi.org/10.1016/j.neucom.2015.11.029i
Xiangyang Xue received the B.S., M.S., and Ph.D.degrees in communication engineering from XidianUniversity, Xi'an, China, in 1989, 1992, and 1995,respectively. He joined the Department of ComputerScience, Fudan University, Shanghai, China, in 1995.Since 2000, he has been a full professor. His currentresearch interests include multimedia informationprocessing and retrieval, pattern recognition, andmachine learning. He has authored more than 100research papers in these fields. He is an associate editorof the IEEE Transactions on Autonomous MentalDevelopment. He is also an editorial board member of
the Journal of Computer Research and Development,and the Journal of Frontiers of Computer Science and Technology.
ible structure regularization, Neurocomputing (2015), http://dx.