Clustered Multi-task Feature Learning for Attribute...

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#1105

CVPR#1105

CVPR 2016 Submission #1105. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Clustered Multi-task Feature Learning for Attribute Prediction

Anonymous CVPR submission

Paper ID 1105

Abstract

Semantic attributes have been proposed to bridge the se-mantic gap between low-level feature representation andhigh-level semantic understanding of visual objects. Ob-taining a good representation of semantic attributes usuallyrequires learning from high-dimensional low-level features,which often suffers from the curse of dimensionality. De-signing a good feature-selection approach would benefit at-tribute prediction and in turn its related applications. Sincesemantic attributes of an object are usually “related”, inthe literature multi-task learning has been introduced formulti-attribute prediction, either by assuming that all at-tributes are somehow correlated or by manually dividingattributes into related groups. However, the performanceof such approaches greatly rely on the task structure. Theprediction performance would degrade if the assumed taskstructure does not match to that of the problem. Desired isan approach that can automatically detect problem-specificclustering structures of the attributes. In this paper, wepropose a novel clustered multi-task feature selection ap-proach utilizing K-means and group sparsity regularizers,and develop an efficient alternating optimization algorithm.Experiments demonstrate that the proposed approach canautomatically capture the task structure and hence resultin obvious performance gain in attribute prediction, whencompared with existing state-of-the-art approaches.

1. IntroductionRecent literature has witnessed fast development of rep-

resentations using semantic attributes, whose goal is tobridge the semantic gap between low-level feature represen-tation and high-level semantic understanding of visual ob-jects. Attributes refer to visual properties that help describevisual objects or scenes such as “natural” scenes, “fluffy”dogs, or “formal” shoes. Visual attributes exist across ob-ject category boundaries and many methods have been em-ployed in applications including object recognition [7, 6],face verification [23] and image search [17, 21].

In a real-world problem, one attribute is often related tosome other attributes. For example, as shown in Figure 1, a

Figure 1. Illustration of Shoe images with three correspondingattributes “High Heel”, “Formal” and “Red”. The illustrationdemonstrates that “High heel” are highly correlated with “Formal”but weakly correlated with “Red”.

“high-heel” shoe is usually considered as a “formal” shoe aswell. To generate better generalization performance by cap-turing such correlation, recent work [3, 13] started introduc-ing multi-task learning approaches into attribute predictionor ranking. However, if one assumes such correlation existsacross all attributes, the assumption would be too strong.For example, it is hard to identify whether “high heel” or“formal” shoes are in red. This suggests that in real applica-tions the correlation may only exist within sub-groups of theattributes. A naive approach of learning attributes from allthe groups jointly, no matter whether they are truly related,would obviously lead to less-than-optimal performance dueto the unnecessary and incorrect constraints. On the otherhand, manually defining grouping/clustering structures forthe attributes may be possible only for very simple problemswith a small number of attributes. In short, existing ap-proaches still lack the capability of automatically detectinggrouping structures of the attributes for facilitating learning.

Good representations of semantic attributes are of-ten built on top of high-dimensional, low-level features.Attribute learning directly based on such raw, high-dimensional features may suffer from the problem of di-mensionality curse. Further, often it is reasonable to assumethat not all the low-level features would have equal contri-bution to all the attributes. In other words, there may beclustering structures existing across the dimensions as well.Identifying grouping/clusering structures both among theattributes and across the dimensions would naturally con-tribute to improved learning. The latter is essentially anattribute-dependent feature selection problem. In this paper,

1

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#1105

CVPR#1105


we propose a regularization-based multi-task-learning ap-proach that aims at automatically partitioning the attributesinto groups while simultaneously utilizing such group struc-tures for attribute-dependent feature selection. We employa K-means regularizer for attribute clustering, where strongattribute correlation is assumed to exist within each cluster.Besides, a group-sparsity regularizer is imposed on the ob-jective function to encourage intra-cluster feature sharingand inter-cluster feature competition. Under this formula-tion, we also propose an alternating structure optimizationalgorithm, which efficiently solves the proposed formula-tion. We verify the effectiveness and generalization capa-bility of our approach on one synthetic dataset and threereal image datasets. The results show that our approach out-performs the state-of-the-art approaches on prediction accu-racy and zero-shot learning.

In the remaining of the paper, we first discuss relatedwork in Section 2. The proposed approach is presented inSection 3. Experiments and results are demonstrated in Sec-tion 4. We conclude the paper in Section 5.Notations: In this paper, we represent scalars, vectors,matrices and sets as lower case letters x, bold face lowercase letters x, capital letters X and calligraphic capital let-ters X respectively. xi and x(j) denote the i-th column andthe j-th row of the matrix X . The scalar xij denotes the(i, j)-th entry of X . tr(X) denotes the trace of X . ∥·∥n and∥·∥F represent n-norm and Frobenius norm respectively.

2. Related WorkAs our work is mostly related to multi-task learning and

attribute learning, we briefly review the literature on thesetwo topics in the following and draw the conclusion thatneither current multi-task learning nor attribute learning ap-proaches can be adopted directly to automatically partitionthe attributes into different groups based on correlations forfeature selection.

2.1. Multitask LearningAssuming several different but similar tasks are “re-

lated”, multi-task learning aims to learn several tasks to-gether to improve the generalization performance throughcapturing the intrinsic correlation. Multi-task learning hasbeen successfully applied to many computer vision appli-cations including visual classification [26], action recogni-tion [19], attribute learning [3, 13], etc.

There have been two main ways to define task related-ness. The first one assumes that all tasks share a similarparameter space. Ji and Ye [14] introduced trace norm asregularization and obtained a low-rank structure projectionmatrix to capture task relatedness. Bach [2] and Jacob etal. [11] assume the tasks have some special structures andassign tasks into different groups where tasks in the samegroup are closer to each other than tasks in a different group.

Kanget al. [15] assign tasks into groups through integer pro-gramming. Kim and Xing [16] manage tasks in a tree struc-ture where tasks from the same node are closer to each otherand relatedness among the nodes depends on the depth in atree. Similarly, Chen et al. [4] represents tasks in a graphwhere task relatedness depends on the edge weight betweentwo nodes. Some of these approaches considered structuresof tasks, but they serve only to learn the shared parameterspace.

The other way models task relatedness as a common sub-set of latent features shared by different tasks. Argyriouet al. [1] obtain a sparse projection matrix through a ℓ1/ℓqnorm group lasso regularizer. Jacob et al. [10] further in-troduce a graph of covariates as prior for feature selec-tion. Zhang et al. [28] provides a probabilistic interpre-tation of an appropriate q in the generalized ℓ1/ℓq norm.Jalali et al. [12] and Gong et al. [9] introduce an extraℓ1 and ℓ1/ℓq-norm regularization term individually to de-tect outliers. However, current multi-task feature learningapproaches select features by treating all tasks in a wholegroup based on the assumption that all tasks are stronglycorrelated, thus cannot be directly adopted in our problem.

2.2. Attribute LearningA visual attribute learner is a binary predictor that aims

to indicate whether or not a visual property is present. Thestandard approaches learn the attribute predictor indepen-dently per attribute. Ferrari and Zisserman [8] presenteda probabilistic generative model which learns attributes bydistinguishing unary property of single segment or patternsof alternating segments. Lampert et al. [18] consideredzero-shot learning where the test set consists of entirelypreviously unseen object categories and the information istransferred from the training set to the test phase entirelythrough the attribute labels. Farhadi et al. [7] described un-familiar objects and new categories by visual attribute ofobject parts, e.g., “has head”, or appearance adjectives, e.g.,“spotty”. Farhadi et al. [6] first learned part and categorydetectors of objects and then described objects by spacial ar-rangement of the attributes and their interactions. Kovashkaet al. [17] and Scheirer et al. [21] used attributes to facilitatehuman-machine interaction for image search by which theuser is able to specify precise semantic queries.

While most methods learn attributes independently,some initial steps have been taken towards modeling at-tribute relationships. Wang et al. [25] treated attributesas latent variables and capture the correlations among at-tributes using an undirected graphical model built fromtraining data. Song et al. [23] proposed a method to modelthe attribute relationship for face verification based on a dis-criminative distributed-representation for attribute descrip-tion. Siddiquie et al. [22] proposed retrieval approachwhere correlations of attributes are considered as multi-

2

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#1105

CVPR#1105


Figure 2. Demonstration of feature selection based on groupingstructure. Tasks in the same group are strongly related with eachother and share the same subset dimensions of features; tasks indifferent groups are weakly correlated and featured with differentnon-zero dimensions.

attributes query in the vocabulary.Considering utilizing the multi-task learning framework

for attribute learning, Chen et al. [3] proposed a rankingframework which learns a common feature space amongall attributes while detecting outliers; Jayaraman et al. [13]aimed to select appropriate subset of features for differentattributes by manually dividing attributes into different se-mantic groups and encouraging intra-group feature-sharingand inter-group feature competition.

These approaches either make the strong assumption thatall tasks are correlated or require human intervention tospecify appropriate semantic groups. In contrast, our ap-proach can automatically detect semantic groups and learnsan effective subset of features representing the attributes.

3. MethodologyIn this section, we first give a formal definition of the

problem, then we propose a clustered multi-task feature se-lection framework together with an optimization algorithmto solve the problem.

3.1. Problem DefinitionSuppose that we are given a multi-task learning prob-

lem with m tasks (attributes); each task i is associatedwith a set of training data of d dimension and n sam-ples: (xi

1, yi1), . . . , (x

in, y

in) ⊂ Rd × R, we denote W =

[w1, . . . ,wm] ∈ Rd×m as the projection weight matrix tobe estimated where each column wi is the weight vector ofthe i-th task. Tasks may exhibit grouping structures, as il-lustrated in Figure 2. The tasks in the same group are highlycorrelated and thus sharing the same subset dimensions offeatures. The tasks in different groups are weakly corre-lated and have different subset of non-zero dimensions offeatures. Our goal is to design an approach which can auto-matically detect such group structure and utilize such groupcorrelation information for feature selection.

This problem can be formulated as below:

W, I : minW,I

L(W |X,Y ) + αF(I) + βG(W ) (1)

where L(W |X,Y ) is the logistic regression objective∑m

∑n log(1 + exp(−yan(x

Tn · wm))), I is the cluster-

ing assignment of the tasks, F(·) is the regularization termto encourage good clustering and G(·) is the regularizer toencourage feature sharing within clusters and feature com-petition cross different clusters.

3.2. Clustered Multitask Feature SelectionIn this subsection, we introduce the proposed choices for

the two regularizers. One of the popular regularizers en-couraging good clustering is the sum of squared error (SSE)that is used in K-means clustering:

F(I) =k∑

j=1

∑v∈Ij

∥wv − wj∥22 (2)

where Ij denotes the j-th cluster whose mean is wj . Ac-cording to [5, 27], it can also be written as:

k∑j=1

∑v∈Ij

∥wv − wj∥22 = tr(WTW )− tr(FTWTWF )

(3)where the matrix F ∈ Rm×k is an orthogonal cluster indi-cator matrix with Fi,j = 1√

njif i ∈ Ij and Fi,j = 0 other-

wise. This function imposes the constraint that each vectorin the same cluster should be close to the mean vector.

By employing the above SSE function and adding an ad-ditional term tr(WTW ) to improve the generalization, [29]derived a relaxed clustered multi-task learning penalty:

M : minM

αη(1 + η)tr(W (ηI +M)−1WT ) (4)

s.t. tr(M) = k,M≼I,M ∈ Sm+

where M = FTF potentially embeds the cluster assign-ment information. α and η are super parameters.

Given the clustering assignment I, the following groupregularization is proposed to encourage intra-group feature-sharing and inter-group feature competition:

G(W ) =d∑

i=1

k∑j=1

∥wIj

(i)∥2 =k∑

j=1

tr(

√WIjWT ) (5)

where wIj(i) is a row vector including elements wik if

wk∈Ij , Ij is a diagonal matrix with the element Iii = 1

if wi∈Ij and Iii = 0 otherwise. This regularizer first im-poses an ℓ2 norm to the row vector of each group to “col-lapse” each group as a column vector, and then imposes anℓ1 norm for sparsity to select features.

3

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#1105

CVPR#1105


Putting all those term together, the proposed objec-tive function for clustered multi-task feature selection withgroup detection and feature selection is written below:

W,M : minW,M

L(W |X,Y ) + βk∑

j=1

tr(

√WIjWT )(6)

+αη(1 + η)tr(W (ηI +M)−1WT )


where the group information I is embedded in M , whichwill be estimated by the method shown in the subsectionbelow. The key idea lying here is that we use the K-meansregularizer to partition the tasks into groups where strongcorrelation exists among tasks in the same group; and fea-ture selection based on such group structures would makesure appropriate feature subsets are selected to represent therespective semantic attributes.

The method is related to some existing methods in theliterature, e.g., [13], which requires the clustering assign-ment of the attributes I as the input. However, the optimalclustering assignment is not always available, if not impos-sible. A non-optimal input of such clustering assignmentwill dramatically degrade the performance of the method in[13], which is shown in the experiment section. Instead theproposed method doesnot require such input, which learnsthe grouping of the attributes from the tasks automatically,and the learned grouping of the attributes is then used toupdate the tasks.

3.3. Cluster Assignment IdentificationSince the K-means regularizer is spectral relaxed, we

cannot obtain the cluster assignment information I directlyfrom M . In this section we design a procedure to estimatethe cluster assignment information, which is summarized inAlgorithm 1.

We first need to obtain a good approximation of the clus-ter indicator matrix F . Given M , we apply Eigen decompo-sition M = UΛUT where each column of U is the eigen-vector and each diagonal element of Λ is the eigenvalue.Then n columns of U which have the n largest correspond-ing eigenvalues in Λ give an approximation of the clusterassignment matrix Fm×k. The number of cluster can beautomatically detected by the absolute value of the eigen-value. In our approach, we keep all eigenvalues greater than10e− 8. Note that the cluster number can also be specifiedby the user as supervision.

After obtaining an approximation of F , we use the sim-ilar technology in [27] to get the assignment information.Specifically, QR decomposition with column pivoting isfirst applied to F :

FT = Q[R11, R12]PT (7)

Algorithm 1 Obtain cluster assignment informationInput: M;Output: Cluster assignment vector c;

1: Apply Eigen decomposition to M = UΛUT ;2: Obtain F from n columns of U having n largest egien-

values;3: Apply QR decomposition with column pivoting: FT =

Q[R11, R12]PT ;

4: Calculate R = [In, R−111 R12]P

T ;5: For each task i, ci = argmaxj Rij ;

where Q is an n×n orthogonal matrix, R11 is an n×n uppertriangular matrix and P is a permutation matrix. Then wecalculate matrix R by

R = [In, R−111 R12]P

T . (8)

The assignment information is implied by R where the clus-ter membership of each task (column) is determined by therow index of the largest element in absolute value of thecorresponding column of R (Algorithm 1).

3.4. Optimization AlgorithmWe propose an alternating optimization algorithm to

solve the problem in Eqn. 6, which updates the weight vec-tors of the tasks W and the grouping M alternately until aconvergence criterion is satisfied. The whole algorithm isalso summarized in Algorithm 2. The details of solving Wand M are presented in the following.

Optimization of M The optimal M can be obtainedvia solving:

M : minM

tr[W (ηI +M)−1WT ] (9)


Let W = UΣV be the SVD of W , and M = V ΛV T be theEigen decomposition of M , where Λ is a diagonal matrix,then we have

M : minΛ

tr(UΣV TV (ηI + Λ)−1V TV ΣUT ) (10)

s.t. tr(V ΛV T ) = k, 0 ≤ Λ ≤ 1

This problem is equivalent to the following problem:

λ1, λ2, . . . , λq : minλi

q∑i=1

σ2i

η + λi(11)

s.t.q∑

i=1

λi = k, 0 ≤ λi ≤ 1

where Λ = dig([λ1, λ2, . . . , λq]) and Σ =diag([σ1, σ2, . . . , σm]).

4

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

CVPR#1105

CVPR#1105


Optimization of W By first squaring the mixed-normregularizer, the objective function can be approximated us-ing the following upper bound:

(d∑

i=1

k∑j=1

∥wIj

(i)∥2)2 ≤

d∑i=1

k∑j=1

∥wIj

(i)∥22

δij(12)

where δij are positive dummy variables satisfying∑i,j δij = 1. Then δij can be updated by holding the equal-

ity:δij = ∥wIj

(i)∥2/∑i,j

∥wIj

(i)∥2 (13)

Then for fixed M and δij , each weight vector w can beupdated by gradient-type approaches by taking the deriva-tive of the following function:

w : minw

L(W |X,Y ) + γtr(W (ηI +M)−1WT )

+ βd∑

i=1

k∑j=1

(∥wIj

(i)∥2)2

δij

(14)

Optional Imposing cluster structure for supervisionBefore calculating δ after obtaining cluster assignmentfrom M , an optional step can be taken if some specific clus-ter structure is preferred based on the domain knowledge.For example, if by prior we know wi and wj should be inthe same cluster, an additional operation could be taken onthe cluster assignment vector c to ensure those two weightvectors are in the same cluster. Note that not only the wholegroup structure but also partial group information can beimposed as supervision.

3.5. Complexity AnalysisFollowing the previous discussion, we use n, d, m and

k to denote the number of instances, the dimensions ofthe feature, the number of attributes and the number of

Algorithm 2 Optimization algorithm

Input: W0, I , max iteration number q;Output: M , W ;

1: Set w0 = 0;2: for i = 1 to q do3: Update M by solving Func. 10;4: Calculate cluster assignment vector c by algorithm

1;5: [Optional] Edit c to impose prior cluster structure;6: Update W by solving Func. 14;7: if stopping criteria satisfied then8: break the loop;9: end if

10: end for11: Set M = Mi+1, W = Wi+1;

the clusters. During cluster assignment identification, thecomputation cost of calculating M includes the eigen de-composition taking O(dm2), the QR decomposition takingO(dk2) and the computation of R taking O(k3). Sinced > m > k, the total complexity of calculating M byAlgorithm 1 takes O(dm2). For the optimization process,the update of mix-norm takes O(dk). The update of Mtakes O(mnd2) + O(md2) + O(mnd) + O(dm2). Sincen > d > m, the complexity is O(mnd2). All in all, thetotal computation complexity in each iteration is O(mnd2).

The convergence of our algorithm can be proved by us-ing similar techniques from [1].

4. ExperimentsIn this section, we first verify the effectiveness of our

proposed approach in obtaining the correct cluster struc-tures on one synthetic dataset, then we evaluate the attributeprediction and zero-shot learning capatibily on three realworld image datasets.

4.1. Simulation ExperimentThe synthetic data set is constructed in a procedure sim-

ilar to [11, 29]. Specifically, the dataset consists of 5 clus-ters, where each cluster includes 10 tasks and each task isrepresented by a weight vector with dimension d = 30. De-note by wc

i the weight vector of the i-th task from the c-thcluster, then wc

i can be expressed as the sum of the clus-ter center wc and the task specific component wc

i : wci =

wc +wci . We then independently generate the cluster cen-

ter wci and the task-specific component wc

i .The cluster centers wc are first drawn from a normal dis-

tribution. Then we randomly set half of the dimensions tozeros. Note that we keep wc orthogonal to the other clustercenters by selecting the appropriate locations of the non-zero entries. The task-specific component wc

i is first drawnfrom the same normal distribution, then the same dimen-sion of feature to their corresponding cluster centers are setto zero.

For each task, we generate 60 samples for training and1000 samples for testing. Denote the data matrix and thecorresponding response as Xi and yi respectively, then yi

(a) Groundtruth model (b) Learned model

Figure 3. The learned projection matrix and the correspondinggroundtruth in the simulation experiments. The white parts arezeros and the black parts are non-zeros.

5

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#1105

CVPR#1105


Dataset aPascal/aYahoo AwA SUN# of images 15339 30475 14340

# of attributes 64 85 102# of classes 32 50 611# of features 2429 1200 1112

Table 1. Statistics of real world dataset.

are generated as yi = Xiwci+εi where ε is the noise vector

drawn from normal distribution.We verify the effectiveness by applying our approach

on the dataset and comparing the learned projection matrixwith the groundtruth. Figure 3 shows one example of thelearned projection matrix 3(b) with the comparison of thegroundtruth 3(a) where the white part represents zeros andthe black part represents non-zeros. The result shows thatour approach is able to correctly capture the correct groupsparse structures.

4.2. Real Data ExperimentsWe compared our approach with three regularization

based feature selection approaches:No sharing Each attribute is treated as a single cluster byextending the lasso regularizer [24] into multi-task learningframework. Specifically, the approach can be formulatedas:

argminWL(W |X,Y ) + ρ

m∑i=1

∥w∥1 (15)

All sharing All attributes are treated as a whole group byutilizing an ℓ2,1 norm for feature selection [1].Off-line grouped Attributes are grouped by off-line k-means clustering for feature selection [13]. Since the groupinformation may not be always available, for a fair compar-ison, we first learn a model W0 independently for each at-tribute by logistic regression. Then k-means is imposed onW0 to acquire an estimation of the attribute cluster followedby the group feature selection of [13].

For all approaches, the super parameters are selected viacross-validation. We cannot get the number of cluster kwithout any prior knowledge, thus we also select k by theprediction accuracy on a small subset of datasets.

The experiments are conducted on three benchmarkdatasets: aYahoo [7], Animals with Attributes (AwA) [18]and SUN attribute [20] datasets. All the datasets are stan-dardized to zero-mean and normalized by the standard de-viation. Some attributes are intentionally eliminated if thelabel is extremely unbalanced, e.g., only less than 0.1% arelabeled as 0 or 1. We set 0.5 as a threshold if only continu-ous attribute labels are given. The statistics of the data aresummarized in Table 1.

To obtain a good representation of the high-level at-tributes, we require that the features can capture both thespatial and context information. Thus, we constructed the

features by pooling a variety types of feature histograms in-cluding GIST, HoG, SSIM.

4.2.1 Attribute Prediction AccuracyWe first compare our proposed approach with the baselineson the attribute prediction accuracies. For aPascal/aYahooand AwA datasets we use predefined seen/unseen split pub-lished with the datasets. For SUN dataset, 60% of cate-gories are randomly split out as “seen” categories in eachround with the rest as “unseen” categories. During train-ing 50% of samples are randomly and carefully drawn fromeach seen categories to ensure the balance of the positiveand negative attribute labels. The rest samples from “seen”classes and all samples from “unseen” classes are used fortesting.Overall Accuracy Table 2 shows the average predictionaccuracy of each approach over all attributes by runningthe experiment 10 rounds. The result shows that for both“seen” and “unseen” categories, the “no-sharing” and “all-sharing” approaches generate similar accuracies, which areboth less than the “off-line grouped” approach. Our pro-posed approach further outperforms the “off-line grouped”approach by 2%∼4%. The “off-line grouped” approach em-ploys more correlation during learning than “no-sharing”,and decorrelates low-correlated attributes by off-line clus-tering compared with “all sharing”, thus achieves better pre-diction performance. However, the off-line K-means usu-ally can not obtained an optimized clustering structure dueto the high-dimensional data, thus usually is able to achievea sub-optimized result. Our approach iteratively optimizesthe clustering structure and the projection model, whichachieves the best performance.

Figure 4 illustrates the prediction accuracies when only20 attributes of AwA datasets are involved for learning. Theresult shows that considering each attribute, our approachstill outperforms the baseline approaches most of the case.Human v.s. Machine We also compare the proposed ap-proach with the human-defined semantic groups. Following[13], the experiments are conducted on aPY-25 with 25 at-tributes in 3 groups, and AwA with 81 attributes in 9 groups.On aPY-25, the proposed approach achieves accuraciesof 64.24%\60.03% on “seen”\“unseen” categories, withthe comparison of 64.20%\60.07% achieved by human-defined groups. On AwA, the accuracies achieved bythe proposed approach are 62.54%\58.37% compared with62.50%\58.34% by human-defined groups. The resultshows that our proposed approach is able to achieve com-parable performance with human-defined approach.Learning curves We then explore the learning curvesof the prediction performance with respect to the numberof attributes and training samples on three datasets, whichare shown in Figure 5. In this experiment we intention-ally leave out a small portion of data for training to observe

6

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

CVPR#1105

CVPR#1105


DataSet aPascal/aYahoo AwA SUNMethods Seen Unseen Seen Unseen Seen Unseenno-sharing 0.5969±0.021 0.5665±0.022 0.5979±0.010 0.5581±0.011 0.6323±0.021 0.5985±0.022all-sharing 0.5967±0.020 0.5663±0.022 0.5976±0.011 0.5587±0.012 0.6326±0.021 0.6020±0.022off-line grouped 0.6105±0.018 0.5826±0.019 0.6053±0.015 0.5622±0.018 0.6469±0.025 0.6165±0.027Porposed 0.6363±0.014 0.6011±0.015 0.6254±0.007 0.5837±0.008 0.6682±0.011 0.6324±0.013

Table 2. Average prediction accuracies of all attributes on Seen and Unseen categories (the higher the better).

Figure 4. An illustration of 20 attribute prediction accuracy of the proposed approach and the baseline approaches. Result shows in mostof the attributes our approach achieves significant performance gains than all three baseline approaches.

the performance of multi-task learning with limited train-ing samples [1]. The left three plots in Fig. 5 show thelearning curves with different number of attributes involvedin the learning task. Among all approaches, the predictionaccuracies first go up with the increase of attribute num-bers, and then drop down. This means that the more num-ber of attributes, the more information can be utilized forlearning at the first phrase. However, with the increase ofthe attribute numbers, nosies and weak-correlation are alsomis-used during learning, which cause the prediction ac-curacies drop down in the second phrase. Since our pro-posed approach captures the correlation based on groupedstructure, larger performance gains are achieved when thenumber of attributes increases, e.g., on aYahoo dataset 2%performance gain is achieved with 5 attributes while 4% isachieved with 19 attributes, compared with “all sharing” ap-proach.

The right three plots in Fig. 5 show the learning curveswith different number of training samples involved. The re-sult shows that all approaches achieve higher performancewith the increase of training samples, while the perfor-mance gain achieved by the proposed approach goes down.That means the “correct” amount of correlation captured bygroup structure contributes more when less information canbe shared. More shared information would bring additionalknowledge but also noise during learning for all approaches,which decreases the performance gain.Case demonstration For an intuitive understanding, weillustrate some success and failure attribute prediction casesof the proposed approach.

Figure 6(a) left three columns illustrate some examplesthat our approach successfully predicts the attributes whilethe “no sharing” approach fails, like “Eating”, “Wood” and“Sport”. In such cases, the attributes are either usually notvery obvious, e.g., the third image in the first row is easilyconfused with general “traffic” scene, or merely appear in asmall portion of the image, e.g., the “Eating” attribute of thefirst image in the second row mainly reflected by the pizzain the left corner. Such attributes are not easily detectedpurely on visual image but can be implied by the correla-tion from other attributes, e.g., the attribute “wood” usu-ally coexists with some other attributes like “table” in the“dorm” scene. The right two columns demonstrate exam-ples that our approach fails but “no sharing” succeeds. Themain reason is that the content of the image is dominated bysome other objects, like the buildings in the forth columns,which may be predicted as “no flower”, influenced by someother attributes frequently associated with building images.

Figure 6(b) left three columns give some example thatthe proposed approach successfully predicts the attributesbut failed by the “all sharing” approach. By learning allattributes together, the prediction is easily degraded by in-stances having weak correlation, e.g., farms should have at-tribute “grass”. The proposed approach may filter such in-appropriate information by partition attributes into groups.The right two columns show some failure examples of ourapproach but succeed in “all sharing”. Based on our ob-servation, in such cases the objects reflecting the attributesare usually blurred (the fifth image in row one) or in a verysmall region (the fourth image in row one), which is hard

7

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#1105

CVPR#1105


(a) aYahoo

(b) SUN

(c) AwA

Figure 5. Learning curves of prediction accuracy corresponding tothe number of attributes and training samples.

to cluster and detect by our approach. The correlation cap-tured by “all sharing” may benefit such predictions.

4.2.2 Zero-shot LearningWe also experimented on the zero-shot learning problem onall three datasets. Zero-shot learning aims to learn a clas-sifier based on training samples from some seen categories,and classify some new samples to a new unseen category.We adopt the Direct Attribute Prediction (DAP) frameworkproposed in [18] with attribute prediction probability fromeach approaches as input. Since only continuous imagelevel attribute labels are provided on the SUN dataset, weconstruct the class level attribute labels by thresholdingthe average attribute label values of all samples from theclass. Same “Seen”\“Unseen” categories splits are adoptedas previous experiments. Average classification accuraciesof 10 rounds experiment are reported in Table 3. The re-sult shows that on aYahoo and AwA, our approach achievessignificant performance gains than the baseline approaches.

(a) Some successful (left three columns) and failed (right two columns) ofthe proposed approach compared with no sharing.

(b) Some successful (left three columns) and failed (right two columns) ofthe proposed approach compared with all sharing.

Figure 6. Image illustration of prediction results.

aYahoo AwA SUNNo sharing 0.1822 0.2945 0.1866All sharing 0.1834 0.2953 0.1842Off-line Grouped 0.2052 0.3085 0.2010Proposed 0.2262 0.3258 0.2133

Table 3. Zero-shot learning accuracy on both real dataset.

The large number of categories in SUN dataset make theclassification problem very hard which leads to all low per-formance of all approaches. Our approach still works betterthan the baseline approaches.

5. ConclusionsIn this paper, we proposed a clustered multi-task feature

learning framework for semantic attribute prediction. Ourapproach employs both K-means and group-sparsity regu-larizers for feature selection. The K-means regularizer par-titions the attributes into different groups where strong cor-relation lies among attributes in the same group while weakcorrelation exists between groups. The group-sparsity reg-ularizer encourages intra-group feature-sharing and inter-group feature competition. With an efficient alternating op-timization algorithm, the proposed approach is able to ob-tain a good group structure and select appropriate featuresto represent semantic attributes. The proposed approachwas verified on both synthetic and real image datasets withcomparison with state-of-the-art approaches. The resultshows effective group structure identification capability ofour method, as well as its significant performance gainson both attribute prediction accuracy and zero-shot learn-ing classification accuracies.

8

864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917

918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971

CVPR#1105

CVPR#1105


References[1] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature

learning. J. Mach. Learn. Res., 73(3):243–272, Dec. 2008.2, 5, 6, 7

[2] F. R. Bach. Consistency of the group lasso and multiple ker-nel learning. J. Mach. Learn. Res., 9:1179–1225, June 2008.2

[3] L. Chen, Q. Zhang, and B. Li. Predicting multiple attributesvia relative multi-task learning. In Proc. of CVPR’14, pages1027–1034, June 2014. 1, 2, 3

[4] X. Chen, S. Kim, Q. Lin, J. G. Carbonell, and E. P. Xing.Graph-structured multi-task regression and an efficient opti-mization method for general fused lasso. CoRR, pages –1–1,2010. 2

[5] C. Ding and X. He. K-means clustering via principal com-ponent analysis. In Proc. of ICML’04, pages 29–, 2004. 3

[6] A. Farhadi, I. Endres, and D. Hoiem. Attribute-centricrecognition for cross-category generalization. In Proc. ofCVPR’10, pages 2352–2359, June 2010. 1, 2

[7] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describingobjects by their attributes. In Computer Vision and PatternRecognition, 2009. CVPR 2009. IEEE Conference on, pages1778–1785, June 2009. 1, 2, 6

[8] V. Ferrari and A. Zisserman. Learning visual attributes. InProc. of NIPS’08, pages 433–440. 2008. 2

[9] P. Gong, J. Ye, and C. Zhang. Robust multi-task featurelearning. In Proc. of KDD’12, 2012. 2

[10] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso withoverlap and graph lasso. In Proc. of ICML’09, pages 433–440, 2009. 2

[11] L. Jacob, J. philippe Vert, and F. R. Bach. Clustered multi-task learning: A convex formulation. In Proc. of NIPS’09,pages 745–752. 2009. 2, 5

[12] A. Jalali, P. D. Ravikumar, and S. Sanghavi. A dirty modelfor multiple sparse regression. CoRR, 2011. 2

[13] D. Jayaraman, F. Sha, and K. Grauman. Decorrelating se-mantic visual attributes by resisting the urge to share. InProc. of CVPR’14, pages 1629–1636, June 2014. 1, 2, 3, 4,6

[14] S. Ji and J. Ye. An accelerated gradient method for tracenorm minimization. In Proc. of ICML’09, pages 457–464,2009. 2

[15] Z. Kang, K. Grauman, and F. Sha. Learning with whomto share in multi-task feature learning. In L. Getoor andT. Scheffer, editors, Proceedings of the 28th InternationalConference on Machine Learning (ICML-11), pages 521–528, New York, NY, USA, 2011. ACM. 2

[16] S. Kim and E. P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In Proc. of ICML’10,pages 543–550, 2010. 2

[17] A. Kovashka, D. Parikh, and K. Grauman. Whittlesearch:Image search with relative attribute feedback. In Proc. ofCVPR’12, pages 2973–2980, June 2012. 1, 2

[18] C. Lampert, H. Nickisch, and S. Harmeling. Learning to de-tect unseen object classes by between-class attribute transfer.In Proc. of CVPR’09, pages 951–958, June 2009. 2, 6, 8

[19] B. Mahasseni and S. Todorovic. Latent multitask learn-ing for view-invariant action recognition. In Computer Vi-sion (ICCV), 2013 IEEE International Conference on, pages3128–3135, Dec 2013. 2

[20] G. Patterson and J. Hays. Sun attribute database: Discover-ing, annotating, and recognizing scene attributes. In Proc. ofCVPR’12, 2012. 6

[21] W. Scheirer, N. Kumar, P. Belhumeur, and T. Boult. Multi-attribute spaces: Calibration for attribute fusion and similar-ity search. In Proc. of CVPR’12, pages 2933–2940, June2012. 1, 2

[22] B. Siddiquie, R. Feris, and L. Davis. Image ranking and re-trieval based on multi-attribute queries. In Proc. of CVPR’11,pages 801–808, June 2011. 2

[23] F. Song, X. Tan, and S. Chen. Exploiting relationship be-tween attributes for improved face verification. In Proc. ofBMVC’12, pages 27.1–27.11, 2012. 1, 2

[24] R. Tibshirani. Regression shrinkage and selection via thelasso. Journal of the Royal Statistical Society, Series B,58:267–288, 1994. 6

[25] Y. Wang and G. Mori. A discriminative latent model of ob-ject classes and attributes. In Proc. of ECCV’10, pages 155–168, 2010. 2

[26] X.-T. Yuan and S. Yan. Visual classification with multi-taskjoint sparse representation. In Computer Vision and Pat-tern Recognition (CVPR), 2010 IEEE Conference on, pages3493–3500, June 2010. 2

[27] H. Zha, X. He, C. Ding, M. Gu, and H. D. Simon. Spectralrelaxation for k-means clustering. In Proc. of NIPS’02, pages1057–1064. 2002. 3, 4

[28] Y. Zhang, D.-Y. Yeung, and Q. Xu. Probabilistic multi-taskfeature selection. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neu-ral Information Processing Systems 23, pages 2559–2567.Curran Associates, Inc., 2010. 2

[29] J. Zhou, J. Chen, and J. Ye. Clustered multi-task learningvia alternating structure optimization. In Proc. of NIPS’11,pages 702–710. 2011. 3, 5

9

Date post:	27-Jun-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Clustered Multi-task Feature Learning for Attribute...

Documents