+ All Categories
Home > Documents > Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection

Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection

Date post: 18-Dec-2016
Category:
Upload: nicu
View: 212 times
Download: 0 times
Share this document with a friend
10
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012 1021 Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection Zhigang Ma, Feiping Nie, Yi Yang, Jasper R. R. Uijlings, and Nicu Sebe, Senior Member, IEEE Abstract—The number of web images has been explosively growing due to the development of network and storage tech- nology. These images make up a large amount of current multimedia data and are closely related to our daily life. To ef- ciently browse, retrieve and organize the web images, numerous approaches have been proposed. Since the semantic concepts of the images can be indicated by label information, automatic image annotation becomes one effective technique for image manage- ment tasks. Most existing annotation methods use image features that are often noisy and redundant. Hence, feature selection can be exploited for a more precise and compact representation of the images, thus improving the annotation performance. In this paper, we propose a novel feature selection method and apply it to automatic image annotation. There are two appealing properties of our method. First, it can jointly select the most relevant features from all the data points by using a sparsity-based model. Second, it can uncover the shared subspace of original features, which is benecial for multi-label learning. To solve the objective function of our method, we propose an efcient iterative algorithm. Exten- sive experiments are performed on large image databases that are collected from the web. The experimental results together with the theoretical analysis have validated the effectiveness of our method for feature selection, thus demonstrating its feasibility of being applied to web image annotation. Index Terms—Image annotation, shared subspace uncovering, sparse feature selection, supervised learning. I. INTRODUCTION A S DIGITAL cameras become very common gadgets in our daily life, we have witnessed an explosive growth of digital images. On the other hand, the popularity of many social networks such as Facebook and Flickr helps boost the sharing of these personal images on the web. In fact, digital im- ages now take up a very large proportion of multimedia contents Manuscript received June 24, 2011; revised December 15, 2011; accepted January 18, 2012. Date of publication February 06, 2012; date of current version July 13, 2012. The work of Z. Ma, J. Uijlings, and N. Sebe was supported in part by the European Commission under the contract FP7-248984 GLOCAL. The work of F. Nie was supported in part by the National Basic Research Program of China (2012CB316400). The work of Y. Yang was supported in part by the Na- tional Science Foundation under Grant IIS-0917072 and Grant CNS-0751185. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Qi Tian. Z. Ma, J. Uijlings, and N. Sebe are with the Department of Information Engineering and Computer Science, University of Trento, 38123 Trento, Italy (e-mail: [email protected]; [email protected]; [email protected]). F. Nie is with the Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019 USA (e-mail: [email protected]). Y. Yang is with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA (e-mail: [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TMM.2012.2187179 in the network and are utilized intensively with different pur- poses. However, it is not straightforward to effectively organize and access these web images because we are facing an over- whelmingly large amount of them. Aiming to manage the im- ages efciently, automatic image annotation has been proposed as an important technique in multimedia analysis. The key idea for image annotation is to correlate keywords or detailed text descriptions with images to facilitate image indexing, retrieval, organization and management. The sheer amount of web images itself provides us free and rich image repository for research. Researchers have been de- veloping many automatic image annotation methods by lever- aging the web scale databases such as Flickr which consist of a large number of user-generated images annotated with user-dened tags [1]. Appearance-based annotation, which is one popular approach, is generally realized through two pro- cesses, namely searching and mining. Similar images of the unannotated images are rst found out from the web scale databases through the searching process and then the mining process extracts annotation from the textual information of these retrieved similar images. Research work using this ap- proach has demonstrated promising performance for automatic image annotation [2], [3]. Appearance-based image annotation has its effectiveness, but a major problem is that it can be negatively affected when user-generated tags do not reect the concepts precisely. Learning-based automatic annotation is an- other effective approach and has gained much research interest. This approach is dependent on certain amount of available annotated images as the training data to learn classiers for image annotation. Many algorithms have been rendered using learning-based approach these years with varying degrees of success for multimedia semantic analysis [4]–[8]. Therefore, this paper focuses on exploiting learning based methods for image annotation. Images are normally represented by multiple features, which can be quite different from each other [9]. As it is inevitable to bring in irrelevant and/or redundant information in the fea- ture representation, feature selection can be used to preprocess the data to facilitate subsequent image annotation task [11]. Hence, it is of great value to propose effective feature selection methods. Existing feature selection algorithms are achieved by different means. For instance, classical feature selection algo- rithms such as Fisher Score [12] compute the weights of dif- ferent features, rank them accordingly and then select features one by one. These classical algorithms generally evaluate the importance of each feature individually and neglect the useful information of the correlation between different features. To overcome the disadvantage of selecting features individually, researchers have proposed another approach which selects fea- 1520-9210/$31.00 © 2012 IEEE
Transcript
Page 1: Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012 1021

Web Image Annotation Via Subspace-SparsityCollaborated Feature Selection

Zhigang Ma, Feiping Nie, Yi Yang, Jasper R. R. Uijlings, and Nicu Sebe, Senior Member, IEEE

Abstract—The number of web images has been explosivelygrowing due to the development of network and storage tech-nology. These images make up a large amount of currentmultimedia data and are closely related to our daily life. To effi-ciently browse, retrieve and organize the web images, numerousapproaches have been proposed. Since the semantic concepts ofthe images can be indicated by label information, automatic imageannotation becomes one effective technique for image manage-ment tasks. Most existing annotation methods use image featuresthat are often noisy and redundant. Hence, feature selection canbe exploited for a more precise and compact representation ofthe images, thus improving the annotation performance. In thispaper, we propose a novel feature selection method and apply it toautomatic image annotation. There are two appealing propertiesof our method. First, it can jointly select the most relevant featuresfrom all the data points by using a sparsity-based model. Second,it can uncover the shared subspace of original features, which isbeneficial for multi-label learning. To solve the objective functionof our method, we propose an efficient iterative algorithm. Exten-sive experiments are performed on large image databases that arecollected from the web. The experimental results together with thetheoretical analysis have validated the effectiveness of our methodfor feature selection, thus demonstrating its feasibility of beingapplied to web image annotation.

Index Terms—Image annotation, shared subspace uncovering,sparse feature selection, supervised learning.

I. INTRODUCTION

A S DIGITAL cameras become very common gadgets inour daily life, we have witnessed an explosive growth

of digital images. On the other hand, the popularity of manysocial networks such as Facebook and Flickr helps boost thesharing of these personal images on the web. In fact, digital im-ages now take up a very large proportion of multimedia contents

Manuscript received June 24, 2011; revised December 15, 2011; acceptedJanuary 18, 2012. Date of publication February 06, 2012; date of current versionJuly 13, 2012. The work of Z. Ma, J. Uijlings, and N. Sebe was supported in partby the European Commission under the contract FP7-248984 GLOCAL. Thework of F. Nie was supported in part by the National Basic Research Program ofChina (2012CB316400). The work of Y. Yang was supported in part by the Na-tional Science Foundation under Grant IIS-0917072 and Grant CNS-0751185.The associate editor coordinating the review of this manuscript and approvingit for publication was Dr. Qi Tian.Z. Ma, J. Uijlings, and N. Sebe are with the Department of Information

Engineering and Computer Science, University of Trento, 38123 Trento, Italy(e-mail: [email protected]; [email protected]; [email protected]).F. Nie is with the Department of Computer Science and Engineering,

University of Texas at Arlington, Arlington, TX 76019 USA (e-mail:[email protected]).Y. Yang is with the School of Computer Science, CarnegieMellonUniversity,

Pittsburgh, PA 15213 USA (e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TMM.2012.2187179

in the network and are utilized intensively with different pur-poses. However, it is not straightforward to effectively organizeand access these web images because we are facing an over-whelmingly large amount of them. Aiming to manage the im-ages efficiently, automatic image annotation has been proposedas an important technique in multimedia analysis. The key ideafor image annotation is to correlate keywords or detailed textdescriptions with images to facilitate image indexing, retrieval,organization and management.The sheer amount of web images itself provides us free and

rich image repository for research. Researchers have been de-veloping many automatic image annotation methods by lever-aging the web scale databases such as Flickr which consistof a large number of user-generated images annotated withuser-defined tags [1]. Appearance-based annotation, which isone popular approach, is generally realized through two pro-cesses, namely searching and mining. Similar images of theunannotated images are first found out from the web scaledatabases through the searching process and then the miningprocess extracts annotation from the textual information ofthese retrieved similar images. Research work using this ap-proach has demonstrated promising performance for automaticimage annotation [2], [3]. Appearance-based image annotationhas its effectiveness, but a major problem is that it can benegatively affected when user-generated tags do not reflect theconcepts precisely. Learning-based automatic annotation is an-other effective approach and has gained much research interest.This approach is dependent on certain amount of availableannotated images as the training data to learn classifiers forimage annotation. Many algorithms have been rendered usinglearning-based approach these years with varying degrees ofsuccess for multimedia semantic analysis [4]–[8]. Therefore,this paper focuses on exploiting learning based methods forimage annotation.Images are normally represented by multiple features, which

can be quite different from each other [9]. As it is inevitableto bring in irrelevant and/or redundant information in the fea-ture representation, feature selection can be used to preprocessthe data to facilitate subsequent image annotation task [11].Hence, it is of great value to propose effective feature selectionmethods. Existing feature selection algorithms are achieved bydifferent means. For instance, classical feature selection algo-rithms such as Fisher Score [12] compute the weights of dif-ferent features, rank them accordingly and then select featuresone by one. These classical algorithms generally evaluate theimportance of each feature individually and neglect the usefulinformation of the correlation between different features. Toovercome the disadvantage of selecting features individually,researchers have proposed another approach which selects fea-

1520-9210/$31.00 © 2012 IEEE

Page 2: Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection

1022 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012

tures jointly across all data points by taking into account the re-lationship of different features [11], [13]. These methods haveshown promising performance in different applications. In thispaper we propose a feature selection technique which buildsupon the latest mathematical advances in sparse, joint featureselection and apply this to automatic image annotation.Image annotation is basically a classification problem. How-

ever, most web images are multi-labeled, that is to say, an imagecan reflect several semantic concepts. This intrinsic character-istic of web images makes it a complicated problem to classifythem. A simple way to annotate multi-label images is to trans-form the problem to a couple of binary classification problemsfor each concept, respectively. Though it is easy to implement,this approach neglects the correlation between different conceptlabels which is potentially useful. Therefore, many recent works[15] have proposed to exploit the shared subspace learning formulti-label tasks by incorporating the relational information ofconcept labels into multi-label learning. Inspired by their suc-cess, we apply shared subspace learning to the problem of fea-ture selection.To summarize, we combine the latest advances in joint, sparse

feature selection with multi-label learning to create a novel fea-ture selection technique which uncovers a feature subspace thatis shared among classes. We name our method Sub-Feature Un-covering with Sparsity and demonstrate its effectiveness for au-tomatic web image annotation. The main contributions of ourwork are:• our method leverages the prominent joint feature selectionwith sparsity, which can select the most discriminative fea-tures by exploiting the whole feature space;

• our method considers the correlation between differentconcept labels to facilitate the feature selection;

• we conduct several experiments on large scale databasescollected from the web. The results demonstrate the effec-tiveness of utilizing sparse feature selection and label cor-relation simultaneously.

This paper is organized as follows. We briefly introduce thestate of the art on shared feature subspace uncovering, featureselection and automatic image annotation in Section II. Thenwe elaborate the formulation of our method followed by theproposed solution in Section III. We conduct extensive experi-ments in Section IV to verify the advantage of our method forweb image annotation. The conclusion is drawn in Section V.

II. RELATED WORK

Our work is geared towards better image annotation per-formance by exploiting effective feature selection. In thissection, we briefly review the three related topics of our work,i.e., shared feature subspace uncovering, feature selection andautomatic image annotation.

A. Shared Feature Subspace Uncovering

Let be a datum represented by a feature vector. The gen-eral goal of supervised learning is to predict for the input anoutput . To achieve this objective, learning algorithms usuallyuse training data to learn a prediction function

that can correlate with . A common approach to obtain isto minimize the following regularized empirical error:

where is the loss function and is the regularizationwith as its parameter.It is reasonable to assume that multi-label images share cer-

tain common attributes. For example, a picture related to “pa-rade,” “people” and “street” share the component “people” withanother picture related to “party,” “people.” Intuitively, we canleverage such label correlations for image annotation. In multi-label learning problems, Ando et al. assume that there is a sharedsubspace for the original feature space [17]. The concepts of animage are predicted by its vector representation in the originalfeature space together with the embedding in the shared sub-space, which can be generalized as the following demonstration:

(2)

where and are the weight vectors and is a common sub-space shared by all the features.Suppose the images are related to concepts in multi-label

learning and there are training data belonging tothe th concept labeled as . Then (1) can be redefined as

(3)

Note that the constraint in (3) is imposed to make theproblem tractable.By incorporating the shared feature subspace uncovering of

(2) into (3), we get

(4)

Shared feature subspace learning has received increasing at-tention for its effectiveness on multi-label data [15]. Its theoryhas also been applied in multimedia analysis and proved its ad-vantage. For instance, Amores et al. have leveraged the idea ofsharing feature across multiple classes for object-class recogni-tion and achieved prominent performance [18]. As a result, weadopt shared feature subspace uncovering in our feature selec-tion framework and build our mathematical formulation on (4).

B. Feature Selection

Feature selection is widely adopted in many multimedia anal-ysis applications. Its principle is to select the most discrim-inating features from the original ones while simultaneouslyeliminate the noise, thus resulting in better performance in prac-tice. Another advantage of feature selection lies in its attributethat it reduces the dimensionality of the original data, which inturn reduces the computational cost of the classification.According to the availability of label information, feature se-

lection algorithms can be classified into two groups: supervisedand unsupervised. Unsupervised feature selection [19]–[21] is

Page 3: Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection

MA et al.: WEB IMAGE ANNOTATION VIA SUBSPACE-SPARSITY COLLABORATED FEATURE SELECTION 1023

used when there is no label information. An effective way of un-supervised feature selection is to use the manifold structure ofthe whole feature set to select the most meaningful features [21].In contrast, supervised feature selection is preferable when

there is available label information that can be leveraged byusing the correlation between features and labels. In the lit-erature, plenty of supervised feature selection methods havebeen proposed. For example, Fisher Score [12] and ReliefF[22] are traditional supervised feature selection methods andare exploited widely in multimedia analysis. However, tradi-tional feature selection usually neglects the correlation amongdifferent features [21]. Therefore, another approach has beendeveloped recently, namely sparsity-based feature selection[13], [23] which can exploit the feature correlation. This ap-proach is built upon the comprehension that many real worlddata can be sparsely represented, thus rendering the possibilityof searching the sparse representation of the data to realizefeature selection. The -norm regularization is known to bean effective model for sparse feature selection [24] and hasdrawn increasing attention [13], [16].The -norm of an arbitrary matrix is defined as

(5)

In [13] and [16], -norm is leveraged to conduct feature se-lection jointly across the entire feature space with promisingperformance. Their works demonstrate that the -norm ofmakes sparse, meaning that some of its rows shrink to

zero. Consequently, can be viewed as the combination co-efficients for the most discriminative features. Feature selectionis then realized by where only the features associated withthe nonzero rows in are selected. Sparsity-based feature se-lection is efficient as it can select discriminative features jointlyacross all data points. However, few works have incorporatedsparsity-based feature selection and shared feature subspace un-covering into one joint framework.

C. Automatic Image Annotation

Image annotation can be viewed as a classification task. Itaims to correlate concept labels with specific images by clas-sifying images to different classes. The ultimate goal is thatthe predicted labels via annotation algorithms can precisely re-flect the real semantic contents of images. Nonetheless, the webimage resources are countless so it is infeasible to annotate all ofthemmanually. Hence, automatic image annotation becomes anessential tool for handling web scale images for retrieval, indexand other management tasks.Existing automatic image annotation methods have utilized a

plethora of techniques [1], [3], [4], [10], [25]. Since images areusually represented by different features, much work [10], [11],[17] has focused on optimizing the feature selection processin their annotation frameworks. By finding the discriminativesubset of original features and eliminating the noise, feature se-lection can help improve image annotation performance. Forinstance, Ma et al. have exploited a sparse selection model toselect discriminative features that are closely related to imageconcepts for image annotation [17].

Thanks to the continuous effort made by researchers, we havewitnessed great advance in automatic annotation for web im-ages. However, the performance of automatic image annotationhas yet to be satisfactory, thus requiring more research workin this domain. Inspired by the recent advanced techniques offeature selection and shared feature subspace uncovering, wepropose a novel framework to extract the most discriminatingfeatures to boost the image annotation performance.

III. THE PROPOSED FRAMEWORK

In this section, we first illustrate the formulation of our Sub-Feature Uncovering with Sparsity (SFUS) framework. Then adetailed approach is rendered to solve the objective problem.

A. Problem Formulation

Our method roots from the shared feature subspace uncov-ering as given by (4).Denote the training data matrix as

where is the th datum and is thetotal number of the training data. Let

be the label matrix. stands for the class number andis the label vector with classes. Denote

andwhere is the dimension of the shared subspace. We can thenpresent (4) in a more compact way as

(6)

By defining where , the abovefunction equivalently becomes

(7)

It can be seen from the above function that by applying adifferent loss function and regularization, we can realize sharedfeature subspace uncovering in different ways. The least squareloss has been widely used in research which can be illustratedas where denotes the Frobenius norm ofa matrix. By utilizing the least square loss, Ji et al. [15] haveproposed to achieve shared subspace learning in the followingway:

(8)

In the above function, is the regu-larization term. The first part regulates the information to eachspecific label and the second part controls the complexity of theobjective function. This approach is mathematically tractableand can be easily implemented. However, there are two issuesworthy of further consideration. First, the least square loss issensitive to outliers, thus demanding amore robust loss function.Second, as we aim to conduct effective feature selection, it isadvantageous to exert the sparse feature selection models on theregularization term. In [13], Nie et al. have proved that -normbasedmodels can handle both the aforementioned issues.

Page 4: Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection

1024 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012

We therefore propose the following objective function as ourfoundation to realize feature selection

(9)

The loss function in our objective, that is to say,is robust to outliers as indicated in [13]. At the same time,in the regularization term guarantees that is sparse

to achieve feature selection across all data points [13], [16].

B. Solution

As can be seen in (9), our problem involves the -normwhich is non-smooth and cannot be solved in a closed form. Asa result, we propose to solve it as follows.By denoting and

, the objective in (9) is equivalent to

(10)

where and are two matrices with their diagonal elementsand , respectively.

Note that for an arbitrary matrix , . Thus,(10) becomes

(11)

By setting the derivative of (11) w.r.t to zero, we have

(12)

Substituting in (11) with (12) we have

(13)

Since , the problem be-comes

(14)

By setting the derivative of (14) w.r.t to zero, we get

(15)

where , and.

Note that (14) can be rewritten as

(16)

By incorporating the obtained with (15) into the abovefunction, we have

(17)

The above problem is equivalent to the following:

(18)

According to Sherman–Woodbury–Morrison formula,

. Thus, (18) becomes

(19)

which is equivalent to

(20)

Page 5: Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection

MA et al.: WEB IMAGE ANNOTATION VIA SUBSPACE-SPARSITY COLLABORATED FEATURE SELECTION 1025

As for arbitrary matrices , and , ,the above function becomes

(21)

where and .Equation (21) can be easily solved by the eigen-decomposi-

tion of . However, as the solving of requires the inputof and which are related to , it is still not straightforwardto get and . To solve this problem, we propose an iterativeapproach demonstrated in Algorithm 1. The complexity of theproposed algorithm is briefly discussed as follows. The com-plexity of calculating the inverse of a few matrices is . Toobtain , we need to conduct eigen-decomposition of ,which is also in complexity.

Algorithm 1: The algorithm for solving the SFUS objectivefunction.

Input:

The training data ;

The training data labels ;

Parameters and .

Output:

Optimized .

1: Set and initialize randomly;

2: repeat

Compute ;

Compute the diagonal matrix as

Compute the diagonal matrix as

Compute ;

Compute ;

Compute ;

Obtain by the eigen-decomposition of ;

Update according to (15);

.

until Convergence;

3: Return .

The proposed iterative approach in Algorithm 1 can be veri-fied to converge to the optimal by the following theorem.Theorem 1: The objective function value shown in (9) mono-

tonically decreases in each iteration until convergence using theiterative approach in Algorithm 1.

Proof: See Appendix A.

IV. EXPERIMENTS

To validate the efficacy of our method when applied to auto-matic image annotation, we conduct several experiments partic-ularly on image databases that are collected from the web imageresources.

A. Compared Methods

We compare our methodwith one baseline and several featureselection algorithms on automatic image annotation to under-stand how our method progresses towards better annotation per-formance. The compared methods are enumerated as follows.• Using all features (All-Fea): our baseline. It means that weuse the original data without feature selection for annota-tion.

• Fisher score (F-score) [12]: a classical method. It selectsthe most discriminative features by evaluating the impor-tance of each feature individually.

• Sparse multinomial logistic regression via Bayesian L1regularization (SBMLR) [14]: a sparsity based state of theart method. It realizes sparse feature selection by using aLaplace prior.

• Spectral feature selection (SPEC) [26]: a state of the artmethod using spectral regression. It selects features one byone by leveraging the work of spectral graph theory. Thesupervised implementation is used in our experiments forfair comparison.

• Group lasso with logistic regression (GLRR) [11]: a re-cently proposed method based on a sparse model. It utilizesgroup lasso extended with logistic regression to select bothsparse and discriminative groups of homogeneous features.

• Feature selection via joint -norms minimization(FSNM) [13]: a recent sparse feature selection algorithm.It employs joint -norm minimization on both lossfunction and regularization for joint feature selection.

As our framework is expanded upon regularized least squareregression, we use it as the classifier for all the compared ap-proaches.

B. Image Databases

Web images cover almost all the concepts people are inter-ested in, thus justifying their advantage to be used as researchcorpus for automatic image annotation. For the sake of the studyon multimedia analysis, researchers have also managed to col-lect and process the web images to create good image databasesfor experimental purpose.In our experiments, we select two large scale databases which

are bothmade up of web images. The first one is theMSRA-MM2.0 database which was created by Microsoft Research Asia[27]. This database was collected from the web through a com-mercial search engine and consists of 50 000 images belongingto 100 concepts. However, 7734 images of the original data-base are not associated with any labels, we thus have removed

Page 6: Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection

1026 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012

TABLE IPERFORMANCE COMPARISON ( STANDARD DEVIATION) WHEN IMAGES WORK AS TRAINING DATA

TABLE IIPERFORMANCE COMPARISON ( STANDARD DEVIATION) WHEN IMAGES WORK AS TRAINING DATA

these images and obtained a subset of 42 266 labeled images.In 2009, the Lab for Media Search in National University ofSingapore proposed another large scale image database, i.e.,NUS-WIDE where all images are from Flickr [28]. NUS-WIDEincludes 269 000 real-world images. The very large amount ofNUS-WIDE, from our perspective, can well validate the scala-bility of our framework for real world annotation tasks. Hence,we choose this database in our experiments as well. Nonethe-less, 59 653 images within NUS-WIDE are unlabeled, we there-fore have removed them and used the remaining 209 347 labeledimages related to 81 concepts as experimental corpus.Considering the computational efficiency, we combine three

feature types, i.e., color correlogram, edge-direction histogram,and wavelet texture provided by the authors to represent the im-ages of the two databases. As a consequence, the correspondingfeature dimensions forMSRA-MM2.0 and NUS-WIDE are 347and 345, respectively [27], [28].

C. Experiment Setup

The procedure of our experiments can be generalized as fol-lows. We first randomly generate a training set comprised of

images for each database similarly to the experimentalsetting in [29]. The remaining images are used as testing sets.To understand the performance variation w.r.t the number oftraining data, we set as 10 and 20, respectively, and reportthe corresponding results. We generate the training and testingsets for five times and report the average results for fair com-parison with other methods.Note that our objective function in (9) involves

two parameters and . We tune both of them fromand report the best results.

The number of the selected features ranges from {100, 150,200, 250, 300} and we use the corresponding feature subsetto represent the images. Then the regularized least squareregression is applied as the classifier for image annotation.To evaluate the annotation performance, we use three evalu-

ation metrics, i.e., mean average precision (MAP), MicroAUC,and MacroAUC which are all widely used for multi-label clas-sification tasks [11], [30]–[32].

D. Performance on Image Annotation

Table I and Table II show the annotation results when usingand training data, respectively. The results in bold

indicate the best performance using the corresponding evalu-ation metric. According to the annotation results, we observethat our method demonstrates consistently better performanceon both databases.Take MAP as an example. First, our method is better than

All-Fea, i.e., not using feature selection for annotation on bothdata sets. In particular, SFUS obtains notable improvement overAll-Fea on NUS-WIDE. Second, our method has better annota-tion performance than the compared feature selection methods.Using training data, SFUS outperforms the second bestfeature selectionmethod by about 2.6% and it is better than otherfeature selection algorithms for both data sets; usingtraining data, SFUS is better than the second best feature se-lection method by about 1.6% and 3% on MSRA-MM 2.0 andNUS-WIDE, respectively, and it demonstrates good advantageover other algorithms. Hence, we conclude that our algorithm isa good feature selection mechanism for web image annotation.The good performance of SFUS for image annotation can

be attributed to the appealing property that it can select fea-tures jointly across the whole feature space while simultane-ously considering the correlation of multiple labels by exploringthe shared feature subspace. The incorporation of the sparsemodel and shared subspace uncovering facilitates the feature se-lection by finding themost discriminative features, which can beused subsequently in annotation process.

E. Influence of Feature Type

To evaluate the effectiveness of our method, we use a dif-ferent original feature set, i.e., only color correlogram andwavelet texture are combined to represent the images and wepresent the corresponding annotation results. The experimentis conducted on the MSRA-MM dataset with the results shownin Table III.It can be seen that our method still outperforms other feature

selection algorithms when the images are represented by colorhistogram and wavelet texture. The results demonstrate that ouralgorithm is robust for the variance of the original feature set.

Page 7: Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection

MA et al.: WEB IMAGE ANNOTATION VIA SUBSPACE-SPARSITY COLLABORATED FEATURE SELECTION 1027

TABLE IIIPERFORMANCE COMPARISON ( STANDARD DEVIATION) USING COLOR CORRELOGRAM AND WAVELET

TEXTURE ON MSRA-MM WHEN TRAINING DATA ARE LABELED

Fig. 1. Performance variation w.r.t to the number of selected features using ourfeature selection algorithm. (a) MSRA-MM. (b) NUS-WIDE.

F. Influence of Selected Features

As feature selection is aimed at both accuracy and computa-tional efficiency, we perform an experiment to study how thenumber of selected features can affect the annotation perfor-mance using training data. This experiment can present usthe general trade-off between performance and computationalefficiency for the two image databases.Fig. 1 shows the performance variation w.r.t the number of

selected features in terms of MAP. We have the following ob-servations. 1)When the number of selected features is too small,MAP is not competitive with using all features for annotation,which could be attributed to too much information loss. For in-stance, when using less than 150 features of MSRA-MM 2.0,MAP is worse than using all features for annotation. 2) MAP in-creases as the number of selected features increases up to 200.3) MAP arrives at the peak level when using 200 features. 4)MAP keeps stable from using 200 features to using 300 featuresfor MSRA-MM 2.0 while drops for NUS-WIDE. The differentvariance shown on the two datasets are supposed to be relatedto the properties of the datasets. 5) After all the features are se-lected, in other words, without feature selection, MAP is lower

Fig. 2. Performance variation w.r.t and when we fix the number of selectedfeatures at 200 for annotation. The figure shows different annotation resultswhen using different values of and . With this setting, we get the best resultswhen for MSRA-MM 2.0 and when and forNUS-WIDE. (a) MAP-MSRA. (b) MAP-NUS.

than selecting 200 features for MSRA-MM 2.0 and 100 featuresfor NUS-WIDE. We conclude that, as MAP improves on bothdatabases, our method reduces noise.

G. Parameter Sensitivity Study

Our method involves two regularization parameters, whichare denoted as and in (9). To learn how they affect the fea-ture selection and consequently the performance on image anno-tation, we conduct an experiment on the parameter sensitivity.Following the above experiment, we use training data forimage annotation. MAP is used here to reflect the performancevariation.Fig. 2 demonstrates the MAP variation w.r.t and on the

two databases. From Fig. 2 we notice that the annotation per-formance changes corresponding to different combinations ofand . The impact of different values of the regularization pa-rameters is supposed to be related to the trait of the database. On

Page 8: Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection

1028 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012

Fig. 3. Convergence curves of the objective function value in (9) using Algo-rithm 1. The figure shows that the objective function value monotonically de-creases until convergence by applying the proposed algorithm. (a) MSRA-MM.(b) NUS-WIDE.

our experimental datasets, better results are generally obtainedwhen and are comparable in value.

H. Convergence Study

As mentioned before, the proposed iterative approach mono-tonically decreases the objective function value in (9) until con-vergence. We conduct an experiment to validate our claim andto understand how the iterative approach works. Following theabove experiments, we use training data in this experi-ment. The two parameters and are both fixed at one as thatis the median value of the range from which the parameters aretuned.Fig. 3 shows the convergence curves of our algorithm ac-

cording to the objective function value in (9). It can be observedthat the objective function value converges quickly.We also cal-culate the convergence time which is 17.6 and 10.9 seconds forMSRA-MM 2.0 and NUS-WIDE, respectively, on a personalPC with Intel Core 2 Quad 2.83GHz CPU. The convergenceexperiment demonstrates the efficiency of our algorithm.

V. CONCLUSION

In this paper we have proposed a novel feature selectionmethod and applied it to web image annotation. Our workintegrates two state of the art innovations from shared featuresubspace uncovering and joint feature selection with sparsity,thus endowing our method the following appealing properties.First, our method jointly selects the most discriminative fea-tures across the entire feature space. Additionally, our method

considers the correlation between different labels, which hasproved to be an effective way in multi-label learning tasks.To validate the efficacy of our method for web image annota-

tion, we conducted experiments on two popular image databasesconsisting of web images. It can be seen from the experimentalresults that our method outperforms classical and state of the artalgorithms for image annotation. Therefore, we conclude thatour method is a robust feature selection method and its featuresubspace sharing foundation makes it particularly suitable forweb images which are usually multi-labeled.

APPENDIX APROOF OF THEOREM 1

Proof: According to Algorithm 1, it can be inferred from(11) that

Therefore, we have

Page 9: Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection

MA et al.: WEB IMAGE ANNOTATION VIA SUBSPACE-SPARSITY COLLABORATED FEATURE SELECTION 1029

It has been shown in [13], [16] that for any nonzero vectors

where is an arbitrary number. Thus, we can easily get the fol-lowing inequality:

which indicates that the objective function value of (9) mono-tonically decreases until converging to the optimal throughthe proposed approach in Algorithm 1.

ACKNOWLEDGMENT

Any opinions, findings, and conclusions or recommenda-tions expressed in this material are those of the author(s) anddo not necessarily reflect the views of the National ScienceFoundation.

REFERENCES[1] A. Ulges, M. Worring, and T. Breuel, “Learning visual contexts for

image annotation from flickr groups,” IEEE Trans. Multimedia, vol.13, no. 2, pp. 330–341, Apr. 2009.

[2] X. Wang, L. Zhang, X. Li, and W. Ma, “Annotating images by miningimage search results,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30,no. 11, pp. 1919–1932, Nov. 2008.

[3] B. Russell, A. Torralba, K. Murphy, and W. Freeman, “LabelMe: Adatabase and web-based tool for image annotation,” Int. J. Comput.Vis., vol. 77, no. 1–3, pp. 157–173, 2008.

[4] Y. Lu and Q. Tian, “Discriminant subspace analysis: An adaptive ap-proach for image classification,” IEEE Trans. Multimedia, vol. 11, no.7, pp. 1289–1300, Nov. 2009.

[5] Y. Yang, Y. Zhuang, F. Wu, and Y. Pan, “Harmonizing hierarchicalmanifolds for multimedia document semantics understanding andcrossmedia retrieval,” IEEE Trans. Multimedia, vol. 10, no. 3, pp.437–446, Apr. 2008.

[6] Y. Zhuang, Y. Yang, and F. Wu, “Mining semantic correlation of het-erogeneous multimedia data for cross-media retrieval,” IEEE Trans.Multimedia, vol. 10, no. 2, pp. 221–229, Feb. 2008.

[7] Z. Ma, Y. Yang, F. Nie, J. Uijlings, and N. Sebe, “Exploiting the entirefeature space with sparsity for automatic image annotation,” in Proc.ACM MM, 2011, pp. 283–292.

[8] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, and Y. Pan, “A multimediaretrieval framework based on semi-supervised ranking and relevancefeedback,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp.723–742, Apr. 2011.

[9] Y. Yang, Y. Zhuang, D. Xu,Y. Pan, D. Tao, and S.Maybank, “Retrievalbased interactive cartoon synthesis via unsupervised bi-distance metriclearning,” in Proc. ACM MM, 2009, pp. 311–320.

[10] Y. Gao, J. Fan, X. Xue, and R. Jain, “Automatic image annotation byincorporating feature hierarchy and boosting to scale up SVM classi-fiers,” in Proc. ACM MM, 2006, pp. 901–910.

[11] F. Wu, Y. Yuan, and Y. Zhuang, “Heterogeneous feature selection bygroup lasso with logistic regression,” in Proc. ACM MM, 2010, pp.983–986.

[12] R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed. NewYork: Wiley-Interscience, 2001.

[13] F. Nie, H. Huang, X. Cai, and C. Ding, “Efficient and robust featureselection via joint L21-norms minimization,” in Proc. NIPS, 2010, pp.1813–1821.

[14] G. Cawley, N. Talbot, and M. Girolami, “Sparse multinomial logisticregression via bayesian L1 regularisation,” in Proc. NIPS, 2006, pp.209–216.

[15] S. Ji, L. Tang, S. Yu, and J. Ye, “A shared-subspace learning frameworkfor multi-label classification,” ACM Trans. Knowl. Discovery Data,vol. 2, no. 1, pp. 8(1)–8(29), 2010.

[16] Y. Yang, H. Shen, Z. Ma, Z. Huang, and X. Zhou, “L21-norm regu-larized discriminative feature selection for unsupervised learning,” inProc. IJCAI, 2011, pp. 1589–1594.

[17] R. Ando and T. Zhang, “A framework for learning predictive structuresfrom multiple tasks and unlabeled data,” J. Mach. Learning Res., vol.6, no. 1817–1853, 2005.

[18] J. Amores, N. Sebe, and P. Radeva, “Context-based object-class recog-nition and retrieval by generalized correlograms,” IEEE Trans. PatternAnal. Mach. Intell., vol. 29, no. 10, pp. 1818–1833, Oct. 2007.

[19] M. Law, M. Figueiredo, and A. Jain, “Simultaneous feature selectionand clustering using mixture models,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 26, no. 9, pp. 1154–1166, Sep. 2004.

[20] H. Wei and S. Billings, “Feature subset selection and ranking for datadimensionality reduction,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 29, no. 1, pp. 162–166, Jan. 2007.

[21] D. Cai, C. Zhang, and X. He, “Unsupervised feature selection for multi-cluster data,” in Proc. ACM SIGKDD, 2010, pp. 333–342.

[22] I. Kononenko, “Estimating attributes: Analysis and extensions of RE-LIEF,” in Proc. ECML, 1994, pp. 171–182.

[23] B. Krishnapuram, A. Hartemink, L. Carin, and M. Figueiredo, “ABayesian approach to joint feature selection and classifier design,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 9, pp. 1105–1111,Sep. 2004.

[24] Z. Zhao, L. Wang, and H. Liu, “Efficient spectral feature selection withminimum redundancy,” presented at the AAAI, Atlanta, GA, 2010.

[25] G. Carneiro, A. Chan, P. Moreno, and N. Vasconcelos, “Supervisedlearning of semantic classes for image annotation and retrieval,” IEEETrans. Pattern Anal. Mach. Intell., vol. 29, no. 3, pp. 394–410, Mar.2007.

[26] Z. Zhao and H. Liu, “Spectral feature selection for supervised and un-supervised learning,” in ICML, 2007.

[27] H. Li, M.Wang, and X. Hua, “MSRA-MM 2.0: A large-scale web mul-timedia dataset,” in Proc. IEEE Int. Conf. Data Mining Workshops,2006.

[28] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUS-WIDE:A real-world web image database from National University of Singa-pore,” presented at the CIVR, Santorini, Greece, 2009.

[29] B. Cheng, J. Yang, S. Yan, Y. Fu, and T. Huang, “Learning withL1-graph for image analalysis,” IEEE Trans. Image Process., vol. 19,no. 4, pp. 858–866, Apr. 2010.

[30] S. Nowak, A. Llorente, E. Motta, and S. Rueger, “The effect of se-mantic relatedness measures on multi-label classification evaluation,”in Proc. CIVR, 2010, pp. 303–310.

[31] M. Wang, X. Hua, J. Tang, and R. Hong, “Beyond distance measure-ment: Constructing neighborhood similarity for video annotation,”IEEE Trans. Multimedia, vol. 11, no. 3, pp. 465–476, Apr. 2009.

[32] Y. Han, F. Wu, J. Jia, Y. Zhuang, and B. Yu, “Multi-task sparse dis-criminant analysis (MtSDA) with overlapping categories,” presentedat the AAAI, Atlanta, GA, 2010.

Page 10: Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection

1030 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012

Zhigang Ma received the B.S. and M.S. degreesfrom Zhejiang University, Hangzhou, China, in 2004and 2006, respectively, and is currently workingtoward the Ph.D. degree from the University ofTrento, Trento, Italy.His research interests include machine learning

and its application to computer vision and multi-media analysis.

Feiping Nie received the B.S. degree in computerscience from the North China University of WaterConservancy and Electric Power, Zhengzhou, China,in 2000, the M.S. degree in computer science fromLanzhou University, Lanzhou, China, in 2003, andthe Ph.D. degree in computer science from TsinghuaUniversity, Tshinghua, China, in 2009.Currently, he is a Research Assistant Professor at

the University of Texas, Arlington. His research in-terests include machine learning and its applicationfields, such as pattern recognition, data mining, com-

puter vision, image processing, and information retrieval.

Yi Yang received the Ph.D degree in computer sci-ence from Zhejiang University, in 2010.He had been a Postdoctoral Research Fellow at the

University of Queensland from 2010 to May, 2011.After that, he joined Carnegie Mellon University.He is now a Postdoctoral Research Fellow at theSchool of Computer Science, Carnegie MellonUniversity, Pittsburgh, PA. His research interestsinclude machine learning and its applications tomultimedia content analysis and computer vision,e.g., multimedia indexing and retrieval, image

annotation, video semantics understanding, etc.

Jasper R. R. Uijlings received the M.Sc. degree inartificial intelligence from the University of Ams-terdam, Amsterdam, The Netherlands, in 2006, andthe Ph.D. degree on the topic of object recognitionin computer vision, from the ISIS Lab, University ofAmsterdam, in 2011.Currently he is working as a Postdoctoral Research

Fellow at the University of Trento, Trento, Italy. Hisresearch interests include computer vision, image re-trieval, and statistical pattern recognition.

Nicu Sebe (M’01–SM’11) received the Ph.D. incomputer science from Leiden University, Leiden,The Netherlands, in 2001.Currently, he is with the Department of Informa-

tion Engineering and Computer Science, Universityof Trento, Italy, where he is leading the research in theareas ofmultimedia information retrieval and human-computer interaction in computer vision applications.He has been a Visiting Professor in Beckman Insti-tute, University of Illinois at Urbana-Champaign andin the Electrical Engineering Department, Darmstadt

University of Technology, Darmstadt, Germany. He was involved in the organ-ization of the major conferences and workshops addressing the computer visionand human-centered aspects of multimedia information retrieval.Dr. Sebe is a senior member of ACM. He was General Co-Chair of the IEEE

Automatic Face and Gesture Recognition Conference, FG 2008, ACM Inter-national Conference on Image and Video Retrieval (CIVR) 2007 and 2010,and WIAMIS 2009, and as one of the initiators and a Program Co-Chair of theHuman-Centered Multimedia track of the ACM Multimedia 2007 Conference.He is the General Chair of ACM Multimedia 2013 and was a Program Chair ofACMMultimedia 2011. He is the Co-Chair of the IEEE Computer Society TaskForce on Human-Centered Computing and an Associate Editor of Machine Vi-sion and Applications, Image and Vision Computing, Electronic Imaging, andof Journal of Multimedia. He has served as the Guest Editor for several specialissues in IEEE COMPUTER, Computer Vision and Image Understanding, Imageand Vision Computing, Multimedia Systems, and ACM TOMCCAP.


Recommended