+ All Categories
Home > Documents > Click Prediction for Web Image Reranking Using Multimodal Sparse ...

Click Prediction for Web Image Reranking Using Multimodal Sparse ...

Date post: 11-Feb-2017
Category:
Upload: vantuong
View: 231 times
Download: 0 times
Share this document with a friend
14
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014 2019 Click Prediction for Web Image Reranking Using Multimodal Sparse Coding Jun Yu, Member, IEEE, Yong Rui, Fellow, IEEE , and Dacheng Tao, Senior Member, IEEE Abstract— Image reranking is effective for improving the performance of a text-based image search. However, existing reranking algorithms are limited for two main reasons: 1) the textual meta-data associated with images is often mismatched with their actual visual content and 2) the extracted visual features do not accurately describe the semantic similarities between images. Recently, user click information has been used in image reranking, because clicks have been shown to more accurately describe the relevance of retrieved images to search queries. However, a critical problem for click-based methods is the lack of click data, since only a small number of web images have actually been clicked on by users. Therefore, we aim to solve this problem by predicting image clicks. We propose a multimodal hypergraph learning-based sparse coding method for image click prediction, and apply the obtained click data to the reranking of images. We adopt a hypergraph to build a group of manifolds, which explore the complementarity of different features through a group of weights. Unlike a graph that has an edge between two vertices, a hyperedge in a hypergraph connects a set of vertices, and helps preserve the local smoothness of the constructed sparse codes. An alternating optimization procedure is then performed, and the weights of different modalities and the sparse codes are simultaneously obtained. Finally, a voting strategy is used to describe the predicted click as a binary event (click or no click), from the images’ corresponding sparse codes. Thorough empirical studies on a large-scale database including nearly 330K images demonstrate the effectiveness of our approach for click prediction when compared with several other methods. Additional image reranking experiments on real- world data show the use of click prediction is beneficial to improving the performance of prominent graph-based image reranking algorithms. Index Terms— Image reranking, click, manifolds, sparse codes. I. I NTRODUCTION D UE to the tremendous number of images on the web, image search technology has become an active and chal- lenging research topic. Well-recognized image search engines, Manuscript received February 22, 2013; revised August 31, 2013 and February 20, 2014; accepted March 6, 2014. Date of publication March 11, 2014; date of current version March 31, 2014. This work was supported in part by the National Natural Science Foundation of China under Grant 61100104, in part by ARC under Grants FT130101457 and DP-140102164, in part by the Program for New Century Excellent Talents in University under Grant NCET-12-0323, and in part by the Hong Kong Scholar Programme under Grant XJ2013038. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Mark H.-Y. Liao. J. Yu is with the School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China (e-mail: [email protected]). Y. Rui is with Microsoft Research Asia, Peking, China (e-mail: [email protected]). D. Tao is with the Centre for Quantum Computation and Intelli- gent Systems, Faculty of Engineering and Information Technology, Uni- versity of Technology, Sydney, Ultimo, NSW 2007, Australia (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2311377 such as Bing [1], Yahoo [2] and [3], usually Google use textual meta-data included in the surrounding text, titles, captions, and URLs, to index web images. Although the performance of text-based image retrieval for many searches is acceptable, the accuracy and efficiency of the retrieved results could still be improved significantly. One major problem impacting performance is the mismatches between the actual content of image and the textual data on the web page [4]. One method used to solve this problem is image re-ranking, in which both textual and visual information are combined to return improved results to the user. The ranking of images based on a text-based search is considered a reasonable baseline, albeit with noise. Extracted visual information is then used to re-rank related images to the top of the list. Most existing re-ranking methods use a tool known as pseudo-relevance feedback (PRF) [34], where a proportion of the top-ranked images are assumed to be relevant, and subsequently used to build a model for re-ranking. This is in contrast to relevance feedback, where users explicitly provide feedback by labeling the top results as positive or negative. In the classification-based PRF method [35], the top-ranked images are regarded as pseudo-positive, and low-ranked images regarded as pseudo-negative examples to train a classifier, and then re-rank. Hsu et al. [36] also adopt this pseudo-positive and pseudo-negative image method to develop a clustering-based re-ranking algorithm. The problem with these methods is the reliability of the obtained pseudo-positive and pseudo-negative images is not guaranteed. PRF has also been used in graph-based re-ranking [37] and Bayesian visual re-ranking [38]. In these methods, low-rank images are promoted by receiving rein- forcement from related high-rank images. However, these methods are limited by the fact that irrelevant high-rank images are not demoted. Therefore, both explicit and implicit re-ranking methods suffer from the unreliability of the original ranking list, since the textual information cannot accurately describe the semantics of the queries. Instead of related textual information, user clicks have recently been used as a more reliable measure of the relationship between the query and retrieved objects [5], [6], since clicks have been shown to more accurately reflect the relevance [7]. Joachims et al. [39] conducted an eye-tracking experiment to observe the relationship between the clicked links and the relevance of the target pages, while Shokouhi et al. [8] investigated the effect of reordering web search results based on click through search effectiveness. In the case of image searching, clicks have proven to be very reliable [7]; 84% of clicked images were relevant compared 1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Transcript
Page 1: Click Prediction for Web Image Reranking Using Multimodal Sparse ...

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014 2019

Click Prediction for Web Image RerankingUsing Multimodal Sparse Coding

Jun Yu, Member, IEEE, Yong Rui, Fellow, IEEE, and Dacheng Tao, Senior Member, IEEE

Abstract— Image reranking is effective for improving theperformance of a text-based image search. However, existingreranking algorithms are limited for two main reasons: 1) thetextual meta-data associated with images is often mismatchedwith their actual visual content and 2) the extracted visualfeatures do not accurately describe the semantic similaritiesbetween images. Recently, user click information has been usedin image reranking, because clicks have been shown to moreaccurately describe the relevance of retrieved images to searchqueries. However, a critical problem for click-based methods isthe lack of click data, since only a small number of web imageshave actually been clicked on by users. Therefore, we aim tosolve this problem by predicting image clicks. We propose amultimodal hypergraph learning-based sparse coding method forimage click prediction, and apply the obtained click data to thereranking of images. We adopt a hypergraph to build a groupof manifolds, which explore the complementarity of differentfeatures through a group of weights. Unlike a graph that has anedge between two vertices, a hyperedge in a hypergraph connectsa set of vertices, and helps preserve the local smoothness of theconstructed sparse codes. An alternating optimization procedureis then performed, and the weights of different modalities andthe sparse codes are simultaneously obtained. Finally, a votingstrategy is used to describe the predicted click as a binaryevent (click or no click), from the images’ corresponding sparsecodes. Thorough empirical studies on a large-scale databaseincluding nearly 330K images demonstrate the effectiveness ofour approach for click prediction when compared with severalother methods. Additional image reranking experiments on real-world data show the use of click prediction is beneficial toimproving the performance of prominent graph-based imagereranking algorithms.

Index Terms— Image reranking, click, manifolds, sparse codes.

I. INTRODUCTION

DUE to the tremendous number of images on the web,image search technology has become an active and chal-

lenging research topic. Well-recognized image search engines,

Manuscript received February 22, 2013; revised August 31, 2013 andFebruary 20, 2014; accepted March 6, 2014. Date of publication March 11,2014; date of current version March 31, 2014. This work was supported in partby the National Natural Science Foundation of China under Grant 61100104,in part by ARC under Grants FT130101457 and DP-140102164, in part bythe Program for New Century Excellent Talents in University under GrantNCET-12-0323, and in part by the Hong Kong Scholar Programme underGrant XJ2013038. The associate editor coordinating the review of thismanuscript and approving it for publication was Prof. Mark H.-Y. Liao.

J. Yu is with the School of Computer Science and Technology, HangzhouDianzi University, Hangzhou 310018, China (e-mail: [email protected]).

Y. Rui is with Microsoft Research Asia, Peking, China (e-mail:[email protected]).

D. Tao is with the Centre for Quantum Computation and Intelli-gent Systems, Faculty of Engineering and Information Technology, Uni-versity of Technology, Sydney, Ultimo, NSW 2007, Australia (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2014.2311377

such as Bing [1], Yahoo [2] and [3], usually Google use textualmeta-data included in the surrounding text, titles, captions,and URLs, to index web images. Although the performanceof text-based image retrieval for many searches is acceptable,the accuracy and efficiency of the retrieved results could stillbe improved significantly.

One major problem impacting performance is themismatches between the actual content of image and thetextual data on the web page [4]. One method used to solvethis problem is image re-ranking, in which both textual andvisual information are combined to return improved resultsto the user. The ranking of images based on a text-basedsearch is considered a reasonable baseline, albeit with noise.Extracted visual information is then used to re-rank relatedimages to the top of the list.

Most existing re-ranking methods use a tool known aspseudo-relevance feedback (PRF) [34], where a proportionof the top-ranked images are assumed to be relevant, andsubsequently used to build a model for re-ranking. This isin contrast to relevance feedback, where users explicitlyprovide feedback by labeling the top results as positiveor negative. In the classification-based PRF method [35],the top-ranked images are regarded as pseudo-positive, andlow-ranked images regarded as pseudo-negative examples totrain a classifier, and then re-rank. Hsu et al. [36] also adoptthis pseudo-positive and pseudo-negative image method todevelop a clustering-based re-ranking algorithm.

The problem with these methods is the reliability ofthe obtained pseudo-positive and pseudo-negative images isnot guaranteed. PRF has also been used in graph-basedre-ranking [37] and Bayesian visual re-ranking [38]. In thesemethods, low-rank images are promoted by receiving rein-forcement from related high-rank images. However, thesemethods are limited by the fact that irrelevant high-rankimages are not demoted. Therefore, both explicit and implicitre-ranking methods suffer from the unreliability of the originalranking list, since the textual information cannot accuratelydescribe the semantics of the queries.

Instead of related textual information, user clicks haverecently been used as a more reliable measure of therelationship between the query and retrieved objects [5], [6],since clicks have been shown to more accurately reflect therelevance [7]. Joachims et al. [39] conducted an eye-trackingexperiment to observe the relationship between the clickedlinks and the relevance of the target pages, while Shokouhi etal. [8] investigated the effect of reordering web search resultsbased on click through search effectiveness.

In the case of image searching, clicks have proven to be veryreliable [7]; 84% of clicked images were relevant compared

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Click Prediction for Web Image Reranking Using Multimodal Sparse ...

2020 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

to 39% relevance of documents found using a general websearch. Based on this fact, Jain et al. [9] proposed a methodwhich utilizes clicks for query-dependent image searching.However, this method only takes clicks into considerationand neglects the visual features which might improve theretrieved image relevance to the query. In another study, Jainand Varma [10] proposed a Gaussian regression model whichdirectly concatenates the clicks and various visual featuresinto a long vector. Unfortunately the diversity of multiplevisual features was not taken into consideration. Accordingto commercial search engine analysis reports, only 15% ofweb images are clicked by web users. This lack of clicks is aproblem that makes effective click-based re-ranking challeng-ing for both theoretical studies and real-world implementation.In order to solve this problem, we adopt sparse coding topredict click information for web images.

Sparse coding is a popular signal processing method andperforms well in many applications, e.g. signal reconstruc-tion [11], signal decomposition [12], and signal denoising [13].Although orthogonal bases like Fourier or Wavelets have beenwidely adopted, the latest trend is to adopt an overcompletebasis, in which the number of basis vectors is greater than thedimensionality of the input vector. A signal can be describedby a set of overcomplete bases using a very small numberof nonzero elements [18]. This causes high sparsity in thetransform domain, but many applications need this compactrepresentation of signals. In computer vision, signals are imagefeatures, and sparse coding is adopted as an efficient techniquefor feature reconstruction [14]–[16]. It has been widely used inmany different applications, such as image classification [14],face recognition [15], image annotation [17], and imagerestoration [13].

In this paper, we formulate and solve the problem of clickprediction through sparse coding. Based on a group of webimages with associated clicks (known as a codebook), and anew image without any clicks, sparse coding is utilized tochoose as few basic images as possible from the codebookin order to linearly reconstruct a new input image whileminimizing reconstruction errors. A voting strategy is utilizedto predict the click as a binary event (click or no click)from the sparse codes of the corresponding images. Theovercomplete characteristic of the codebook guarantees thesparsity of the reconstruction coefficients.

However, in addition to sparsity, the overcompleteness ofthe codebook causes loss in the locality of the features to berepresented. This results in similar web images being describedby totally different sparse codes, and unstable performance inimage reconstruction; clicks are thus not predicted success-fully. In order to address this issue, one feasible solution is toadd an additional locality preserving term to the formulation ofsparse coding. Laplacian sparse coding (LSC) [19], in which alocality-preserving Laplacian term is added to the sparse code,makes the sparse codes more discriminative while maintainingthe similarity of features, and enhancing the sparse coding’srobustness.

However, LSC [19] can only handle single feature images;in practice, web images are usually described by multiplefeatures. For instance, commercial search engines extract and

Fig. 1. Example images and their click number according to the queries of“bull” and “White Tiger”.

preserve different features such as color histograms, edgedirection histograms, and SIFTs. Two categories of methodsare used to deal with multimodal data: early fusion and latefusion [20], [50]. They differ in the way they integrate theresults from feature extraction on various modalities. In earlyfusion, feature vectors are connected from different modalitiesas a new vector. However, this concatenation does not makesense due to the specific characteristics of each feature. In latefusion, the results obtained by learning for each modality areintegrated, but these fused results from late fusion may notbe satisfactory since results for each modality might be poor,and assigning appropriate weights to different modalities isdifficult.

In this paper we propose a novel method named multimodalhypergraph learning-based sparse coding for click prediction,and apply the predicted clicks to re-rank web images. Bothstrategies of early and late fusion of multiple features are usedin this method through three main steps.

First, we construct a web image base with associated clickannotation, collected from a commercial search engine. Asshown in Fig. 1, the search engine has recorded clicks foreach image. Fig. 1(a), (b), (e), and (f) indicate that the imageswith high clicks are strongly relevant to the queries, whileFig. 1(c), (d), (g), and (h) present non-relevant images withzero clicks. These two components form the image bases.

Second, we consider both early and late fusion in theproposed objective function. The early fusion is realized bydirectly concatenating multiple visual features, and is appliedin the sparse coding term. Late fusion is accomplished inthe manifold learning term. For web images without clicks,we implement hypergraph learning [29] to construct a groupof manifolds, which preserves local smoothness using hyper-edges. Unlike a graph that has an edge between two vertices,a set of vertices are connected by the hyperedge in a hyper-graph. Common graph-based learning methods usually onlyconsider the pairwise relationship between two vertices, ignor-ing the higher-order relationship among three or more vertices.Using this term can help the proposed method preserve thelocal smoothness of the constructed sparse codes.

Finally, an alternating optimization procedure is conductedto explore the complementary nature of different modalities.The weights of different modalities and the sparse codesare simultaneously obtained using this optimization strategy.A voting strategy is then adopted to predict if an inputimage will be clicked or not, based on its sparse code.

Page 3: Click Prediction for Web Image Reranking Using Multimodal Sparse ...

YU et al.: CLICK PREDICTION FOR WEB IMAGE RERANKING 2021

The obtained click is then integrated within a graph-basedlearning framework [37] to achieve image re-ranking.

In summary, we present the important contributions of thispaper:

• First, we effectively utilize search engine derived imagesannotated with clicks, and successfully predict the clicksfor new input images without clicks. Based on theobtained clicks, we re-rank the images, a strategy whichcould be beneficial for improving commercial imagesearching.

• Second, we propose a novel method named multimodalhypergraph learning-based sparse coding. This methoduses both early and late fusion in multimodal learning.By simultaneously learning the sparse codes and theweights of different hypergraphs, the performance ofsparse coding performs significantly.

• We conduct comprehensive experiments to empiricallyanalyze the proposed method on real-world web imagedatasets, collected from a commercial search engine.Their corresponding clicks are collected from internetusers. The experimental results demonstrate the effective-ness of the proposed method.

The rest of this paper is organized as follows. InSection II, we briefly review some related work. The pro-posed method of multimodal hypergraph learning-based sparsecoding is presented in Section III. Section IV presentsour experimental results. In Section V we apply clicks toimage re-ranking, and finally, in section VI we draw ourconclusions.

II. RELATED WORK

A. Multimodal Learning for Web Images

We can assume that each web image i is described by tvisual features as x(1)

i , x(2)i , . . . , x(t)

i . A normal method forhandling multimodal features is to directly concatenate theminto a long vector

[x(1)

i , x(2)i , . . . , x(t)

i

], but this representation

may reduce the performance of algorithms [20], especiallywhen the features are independent or heterogeneous. It is alsopossible that the structural information of each feature may belost in feature concatenation [20].

In [20], the methods of multimodal feature fusion areclassified into two categories, namely early fusion and latefusion. It has been shown that if an SVM classifier isused, late fusion tends to result in better performance [20].Wang et al. have [30] provided a method to integrate graphrepresentations generated from multiple modalities for thepurpose of video annotation. Geng et al. [31] have integratedgraph representations using a kernelized learning approach.Our work integrates multiple features into a graph-basedlearning algorithm for click prediction.

B. Graph-Based Learning Methods

Graph-based learning methods have been widely used in thefields of image classification [21], ranking [22] and clustering.In these methods, a graph is built according to the given data,where vertices represent data samples and edges describe their

similarities. The Laplacian matrix [23] is constructed fromthe graph and used in a regularization scheme. The localgeometry of the graph is preserved during the optimization,and the function is forcefully smoothed on the graph. How-ever, a simple graph-based method cannot capture higher-order information. Unlike a simple graph, a hyperedge ina hypergraph links several (two or more) vertices, and therebycaptures this higher-order information.

Hypergraph learning has achieved excellent performance inmany applications. For instance, Shashua [24] utilized thehypergraph for image matching using convex optimization.Hypergraphs have been applied to solve problems with multil-abel learning [25] and video segmentation [26]. Tian et al. [27]have provided a semi-supervised learning method namedHyperPrior to classify gene expression data, by using biolog-ical knowledge as a constraint. In [28], a hypergraph-basedimage retrieval approach has been proposed. In this paper,we construct the hypergraph Laplacian using the algorithmpresented in [29].

III. MULTIMODAL HYPERGRAPH LEARNING-BASED

SPARSE CODING FOR CLICK PREDICTION

Here we present definitions of multimodal hypergraphlearning-based sparse coding for click prediction, and defineimportant notations used in the rest of the paper. Capital letters,e.g. X, represent the database of web images. Lower caseletters, e.g. x, represent images and xi is the i th image of X.Superscript (i), e.g. X(i) and x(i), represents the web image’sfeature from the i th modality. A multimodal image databasewith n images and m representations can be represented as:

X ={

X(i) =[x(i)

1 , . . . , x(i)n

]∈ Rmi ×n

}t

i=1. Fig. 2 illustrates

the details of the proposed framework. First, multiple featuresare extracted to describe web images. Second, from thesefeatures, we construct multiple hypergraph Laplacians, andperform sparse coding based on the integration of multiplefeatures. Meanwhile, the local smoothness of the sparse codesis preserved by using manifold learning on the hypergraphs.The sparse codes of images, and the weights for differenthypergraphs, are obtained by simultaneous optimization usingan iterative two-stage procedure. A voting strategy is adoptedto predict the click as a binary event (click or no click) fromthe obtained sparse codes. Specifically, the non-zero positionsin sparse code represent a group of images, which are usedto reconstruct the images. If more than 50% of the imageshave clicks, then the image is predicted as clicked. Otherwise,the image is predicted as not clicked. Finally, a graph-basedschema [37] is conducted with the predicted clicks to achieveimage re-ranking. Some important notations are presented inTable I.

A. Definition of Hypergraph-Based Sparse Coding

Given an image x{x ∈ Rd

}, and web image bases with

associated clicks as A = [a1, a2, . . . , as, ](A ∈ Rd×s

), sparse

coding can build a linear reconstruction of a given image xby using the bases in A: x = c1a1 + c2a2 + · · · + csas = Ac.The reconstruction coefficient vector c for click prediction is

Page 4: Click Prediction for Web Image Reranking Using Multimodal Sparse ...

2022 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

Fig. 2. The framework of multimodal hypergraph learning-based sparse coding for click prediction. First, multiple features are extracted from both the inputimages and image bases. Second, multiple hypergraph Laplacians are constructed, and the sparse codes are built. Meanwhile, the locality of the obtainedsparse codes is preserved by using manifold learning on hypergraphs. Then, the sparse codes of the images and the weights for different hypergraphs areobtained by simultaneously optimization through an iterative two-stage optimization procedure. A voting strategy is used to achieve click data propagation.Finally, the obtained sparse codes are integrated with the graph-based schema for image re-ranking.

TABLE I

IMPORTANT NOTATIONS AND THEIR DESCRIPTIONS

sparse, meaning that only a small proportion of entries in care non-zero. ‖c‖0 can be denoted as the number of non-zero entries for vector c, and sparse coding can be describedas: min ‖c‖0 s.t. x = Ac. However, the minimization of thisproblem is NP-hard. It has been proven in [18] that the min-imization of l1-norm approximates the sparsest near-solution.Therefore, most studies normally describe the sparse codingproblem as the minimization of l1-norm of the reconstructioncoefficients. The objective of sparse coding can be definedas [32]:

minc

‖x − Ac‖2 + α ‖c‖1. (1)

The reconstruction error is represented by the first termin (1), and the second term is adopted to control the sparsityof sparse codes c. α is the tuning parameter used to coordinatesparsity and reconstruction error. By using the sparse codingmethod, the web images are represented independently, andsimilar web images can be described as totally different sparsecodes. One reason for this is the loss of the locality informationin equation (1).

Therefore, to preserve the locality information, the hyper-graph Laplacian is utilized in (1). We adopt V , E to representthe vertex set, and the hyperedge set as G = (V , E). Thehyperedge weight vector is w. Here, each hyperedge ei isassigned a weight w (ei ). A |V | × |E | incidence matrix Hdenotes G with the following elements:

H (v, e) ={

1, if v ∈ e

0, if v /∈ e.(2)

Based on H, the vertex degree of each vertex v ∈ V is

d (v) =∑e∈E

w (e) H (v, e), (3)

and the edge degree of hyperedge e ∈ E is

δ (e) =∑v∈V

H (v, e). (4)

Dv and De are used to denote diagonal matrices of vertexdegrees and hyperedge degrees, respectively. Let W denotethe diagonal matrix of the hyperedge weights. The value ofeach hyperedge’s weight is set according to the rules usedin [28]. First, we construct the |V | × |V | affinity matrix �

Page 5: Click Prediction for Web Image Reranking Using Multimodal Sparse ...

YU et al.: CLICK PREDICTION FOR WEB IMAGE RERANKING 2023

according to �i j = exp(−∥∥vi − v j

∥∥/σ 2

), where σ is the

average distance among all vertices. Then, the weight for eachhyperedge is calculated as Wi = ∑

v j ∈ei

�i j . The ei is a vertex

set, which is composed of K nearest neighbors of vertex vi .The unnormalized hypergraph Laplacian matrix [29] can bedefined as follows:

L = Dv − HWD−1e HT . (5)

Therefore, we want the sparse codes of images within thesame hyperedge to be similar to each other. Summing upthe pairwise distances between the sparse codes within eachhyperedge by w (e)

/δ (e), the hypergraph-based sparse coding

can be formulated as

minc1,...,cn

∑i

‖xi − Aci‖2

+ α∑

i

‖ci‖1 + β

2

∑e∈E

∑(p,q)∈e

w (e)

δ (e)

∥∥cp − cq∥∥2

, (6)

where ci is a vector of sparse coding. Hence, the web imagesconnected by the same hyperedge are encoded as similarsparse codes using this formulation. The similarity amongthese web images within the same hyperedge is preserved.We denote X = [x1, x2, . . . , xn], and C = [c1, c2, . . . , cn].Equation (6) can be rewritten as

minC

‖X − AC‖2F + α

∑i

‖ci‖1 + βtr(CLCT )

. (7)

B. Multimodal Feature Combinations

In real applications, images are described by multimodalfeatures. Given a dataset with multiple features: X ={

X(i) =[x(i)

1 , . . . , x(i)n

]∈ Rmi ×n

}t

i=1, in which each repre-

sentation X(i) is a feature matrix from view i , we can formulatethe objective function based on (7) as:

minc1,...,cn

t∑j=1

∑i

∥∥∥x( j )i − A( j )ci

∥∥∥2 + α

∑i

‖ci‖1

+ βtr

⎛⎝C

⎛⎝

t∑j=1

λ j L( j )

⎞⎠ CT

⎞⎠, (8)

where L( j ) is the constructed hypergraph Laplacian matrixfor the j th view, and λ j is the corresponding weight. A( j ) =[a( j )

1 , a( j )2 , . . . , a( j )

s

],(A( j ) ∈ Rd×s

)is a specified codebook

for j th view. Eq. (8) can be rewritten as

minC,λ

t∑j=1

∥∥∥X( j ) − A( j )C∥∥∥

2

F+ α

∑i

‖ci‖1

+ βtr

⎛⎝C

⎛⎝

t∑j=1

λ j L( j )

⎞⎠ CT

⎞⎠ s.t.

t∑j=1

λ j = 1 λ j > 0. (9)

The object of Equation (9) is to find a sparse linear recon-struction of the given images in X by using multiple basesfrom different features. The reconstruction coefficients ci foreach image are sparse, which means that only a small fraction

of entries ci are non-zero. We present implementation detailsof (9) in Section III.C and Section III.D. We propose analternating optimization procedure with two stages to solve theproblem in (9). First, we fix the weights λ, and optimize C.Then we fix C, and optimize λ. These two stages are iterateduntil the objective function converges.

C. Implementations for Sparse Codes

Instead of optimizing the entire sparse code matrix Csimultaneously, we optimize each ck sequentially until thewhole C converges. To optimize ck , we should fix all theleft sparse codes cp (p �= k), and the weights λ. Therefore,

we can obtain∧L =

t∑j=1

λ j L( j ). The optimization of (9) can be

rewritten with respect to ck as follows:min

ckQ (ck) + α ‖ck‖1 ∀k, (10)

where

Q (ck) =t∑

i=1

∥∥∥x(i)k − A(i)ck

∥∥∥2

+β(

cTk

(C

∧L

)+

(C

∧L)T

ck − cTk

∧Lkk

ck

). (11)

To solve the problem in (11), some derivations are conducted

on the first termt∑

i=1

∥∥∥x(i)k − A(i)ck

∥∥∥2

of (11):

t∑i=1

∥∥∥x(i)k − A(i)ck

∥∥∥2

=t∑

i=1

(x(i)

k − A(i)ck

)T (x(i)

k − A(i)ck

)

=t∑

i=1

(x(i)T

k −(

A(i)ck

)T) (

x(i)k − A(i)ck

)

=t∑

i=1

(x(i)T

k x(i)k −x(i)T

k A(i)ck −(A(i)ck

)Tx(i)

k +(

A(i)ck

)TA(i)ck

)

= ‖xk − Ack‖2, (12)

where xk =[x(1)

k ; . . . ; x(t)k

]∈ R(m1+m2+···+mt )×1 and A =

[A(1); · · · ; A(t)

] ∈ R(m1+m2+···+mt )×N . According to themethod in [32], we adopt the feature-sign search algorithm tosolve ck . Fig. 3 shows the details of this algorithm. We shouldenhance that in order to speed up the convergence of sparsecodes. We initialize the sparse codes with the results of generalsparse coding. After we complete the optimization of ck ,C will be updated accordingly. In our experiments, C con-verges in only a few iterations. In Fig. 3, the matrix Ck isthe submatix after removing the kth column of matrix C, andthe vector Lk,−k is defined to be the subvector formed byremoving the kth of vector Lk . ϒck and ϒckck in Fig. 3 arecalculated as

ϒck = ∂ Q (ck)

∂ck= 2

(AT Ack −AT xk +βC−k

∧L

k,−k+β

∧Lkk

ck

),

ϒck ck = ∂2 Q (ck)

∂ (ck)2 = 2

(AT A + β

∧Lkk

I)

. (13)

Page 6: Click Prediction for Web Image Reranking Using Multimodal Sparse ...

2024 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

Fig. 3. Algorithm details of implementations for sparse codes.

D. Implementations for Obtaining Weights

In the second stage, we fix C to update λ. Therefore,

in (9), only the term tr

(C

(t∑

j=1λ j L( j )

)CT

)can affect the

objective function’s value, and Eq. (9) can be rewritten as

minλ

tr

⎛⎝C

⎛⎝

t∑j=1

λ j L( j )

⎞⎠ CT

⎞⎠

s.t.t∑

j=1

λ j = 1 λ j > 0. (14)

The solution to λ in (14) is λp = 1, relating to the minimumtr

(CL( j )CT

)over different modalities, and λp = 0 otherwise.

It indicates that only one modality will be selected. Hence,this solution cannot effectively explore the complementarycharacteristics of different modalities. To solve this problem,the method in [33] is utilized. The λ j in (14) is replaced by

λzj with z > 1. Therefore,

∑tj=1 λ j reaches its minimal value

when λ j = 1/t related to∑t

j=1 λ j = 1, λ j > 0. In this case,equation (14) can be reformulated as

minλ

tr

⎛⎝C

⎛⎝

t∑j=1

λzj L

( j )

⎞⎠ CT

⎞⎠

s.t.t∑

j=1

λ j = 1 λ j > 0, (15)

where z > 1. The constraintt∑

j=1λ j = 1 is taken into con-

sideration through the Lagrange multiplier, and the objectivefunction in (15) can be rewritten as

(λ, ς)= tr

⎛⎝C

⎛⎝

t∑j=1

λzj L

( j )

⎞⎠ CT

⎞⎠−ς

⎛⎝

t∑j=1

λ j − 1

⎞⎠. (16)

Page 7: Click Prediction for Web Image Reranking Using Multimodal Sparse ...

YU et al.: CLICK PREDICTION FOR WEB IMAGE RERANKING 2025

Fig. 4. The details of the alternating algorithm to obtain sparse codes andoptimal weights.

From (λ, ς), we can make the partial derivative with respectto λ j and ς zero. Thus, λ j can be obtained as

λ j =(1/tr

(CL( j )CT

))1/(z−1)

t∑j=1

(1/tr

(CL( j )CT

))1/(z−1). (17)

The Laplacian matrix L( j ) is positive semidefinite, so we haveλ j ≥ 0. When C is fixed, the global optimal of λ j can beobtained from (17). The algorithm details are listed in Fig. 4.

E. Time Complexity Analysis

We suppose the experiment is conducted on the dataset with

image basis A ={

A(i) =[a(i)

1 , . . . , a(i)N1

]∈ Rmi×N1

}t

i=1and

image set X ={

X(i) =[x(i)

1 , . . . , x(i)N2

]∈ Rmi×N2

}t

i=1. The

time complexity of conducting the alternating algorithm toobtain the sparse codes for X is calculated from two parts:

• The calculation of L( j ) for visual manifolds: the timecomplexity for this part is O

((∑ti=1 mi

)(N2)

2).• The calculation for alternating algorithm to obtain sparse

codes and optimal weights: for the update of λ, thetime complexity is O

(t × (N2)

2). For the sparse codingpart, we adopt efficient sparse coding algorithm [32] tocalculate sparse codes c for each x, and C converges inonly a few iterations. We adopt � to indicate the timecomplexity of efficient sparse coding. Therefore, the timecomplexity of this part is O (T1 × (N2�)).

Therefore, the entire time complexity of theproposed algorithm is O

((∑ti=1 mi

)(N2)

2 + (t × (N2)

2+T1×(N2�)

) × T2), where T2 is the number of iterations in

alternative optimization. T1 is less than three and T2 is lessthan 5 in all experiments. According to the demonstrationsin [32], the time complexity � of efficient sparse coding islower than some state-of-art sparse coding methods includingQP solver, a modified version of LARS [47], grafting [48],and Chen et al.’s interior point method [49]. Therefore,

TABLE II

DETAILS OF THE REAL-WORLD WEB QUERIES DATASETS

it can be guaranteed that our method obtains state-of-artperformance of time complexity.

IV. EXPERIMENTAL RESULTS AND DISCUSSION

To demonstrate the effectiveness of the proposed method,we conduct experiments on a real-world dataset with imagescollected from a commercial search engine. We compareperformance of the proposed method with representativealgorithms, such as single hypergraph learning-based sparsecoding [19], single graph learning-based sparse coding [19],regular sparse coding [41] and the k-nearest neighbor (k−NN)algorithm. The experiments are conducted in two stages. In thefirst stage, we compare our method with the others for clickprediction. In the second stage, we conduct experiments totest the sensitivity of the parameters. The details are providedbelow.

A. Dataset Description

We use real-world Web Queries dataset, which contains200 diverse representative queries collected from the query logof a commercial search engine. In total, it contains 330,665images. Table II provides details of the real-world web querydatasets including the query number for each category andsome examples. We select this dataset to assess our method forclick prediction for two main reasons. First, the web queriesand their related images originate directly from the internet,and the queries are mainly ‘hot’ (i.e. current) queries that haveappeared frequently over the past six months. Second, thisdataset contains real click data, making it easy to evaluatewhether our method accurately predicts clicks on web images.The labels of images in the dataset are assigned according totheir click counts. The images are categorized into two cate-gories: the images of which the click count is larger than zeroand the images of which the click count is zero. We representeach image by extracting five different visual features from theimages including: block-wise color moments (CM), the HSVcolor histogram (HSV), color autocorrelogram (CO), wavelettexture (WT) and face feature.

B. Experiment Configuration

To evaluate the performance of the proposed method forclick predication, we compare the following seven methods,including the proposed method:

1. Multimodal hypergraph learning-based sparse coding(MHL). Parameters α and β in (9) are selected by five-fold cross validation. The neighborhood size k in thehyperedge generation process and the value of z in (15)are tuned to the optimal values.

Page 8: Click Prediction for Web Image Reranking Using Multimodal Sparse ...

2026 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

TABLE III

PERFORMANCE COMPARISONS OF CLASSIFICATION ACCURACY (%) FOR CLICK PREDICTIONS WITH A FIXED SIZE OF IMAGE BASES, AND VARIED

SIZE OF TEST IMAGE SETS. THE COMPARISON INCLUDES A COMPARISON OF MHL, MGL, SHL, SGL, SC, KNN AND GP. THE SIZE OF THE

TEST IMAGE SET IS VARIED FROM AMONG [5%, 10%, 15%, 20%, 25%], AND THE SIZE OF THE IMAGE BASE IS FIXED AT 75%.

THE RESULTS SHOWN IN BOLD ARE SIGNIFICANTLY BETTER THAN OTHERS

2. Multimodal graph learning-based sparse coding (MGL).Following the framework of (9), we adopt a simplegraph [40] to replace the hypergraph. The parametersα and β in (9) are determined using five-fold crossvalidation. The neighborhood size k and the value z aretuned to optimal values.

3. Single hypergraph learning-based sparse coding (SHL)[19]. The framework in (7) is adopted for each singlevisual feature separately. The average performance ofSHL-SC is reported and we name it SHL(A). In addition,we concatenate visual features into a long vector andconduct SHL-SC on it. The results are denoted asSHL(L). The parameters in this method are tuned tooptimal values.

4. Single graph learning with sparse coding (SGL) [19].We adopt a simple graph [40] to replace the hypergraphin (7). The performance of SGL(A) and SGL(L) arerecorded. The parameters are tuned to their optimalvalues.

5. Regular sparse coding (SC). The sparse coding isdirectly conducted on each visual feature separatelyusing Lasso [41]. The average performance of SC isreported, and denoted as SC(A). In addition, we conductsparse coding on the integrated long vector and recordthe results as SC(L).

6. The k-nearest neighbor algorithm (KNN). To provide thebaseline performance for the experiment, we adopt KNNfor each visual feature. This is a method that classifiesa sample by finding its closest samples in the trainingset. In this experiment, each parameter is tuned to theoptimal value. The KNN(A) and KNN(L) are reported.

7. The Gaussian Process regression [10]: This methodidentifies a group of clicked images and conductsdimensionality reduction on concatenated visual fea-tures. A Gaussian Process regressor is trained on theset of clicked images and is then used to predict clickcounts for all images. This method is named “GP” inthe experimental results.

We randomly select images to form the image bases and testimages. Since different queries contain a different numberof images, it would be inappropriate to find a fixed numbersetting for different queries. Therefore, we choose differentpercentages of images to form the image bases. Specifically,the experiments are separated into two stages: the size of testimage set is fixed at 5%, and the size of image base is variedfrom among [10%, 30%, 50%, 70%, 90%]; the size of image

base is fixed at 75%, and the size of the test image is variedfrom among [5%, 10%, 15%, 20%, 25%]. Besides, we conductexperiments to show the effects of different parameters. For allmethods, we independently repeat the experiments five timeswith randomly selected image bases and report the averagedresults.

C. Results on Click Prediction

The performance of MHL was compared with other variousmethods. We performed MHL, MGL, SHL, SGL and SCto obtain sparse codes for the input images, and the votingstrategy was utilized to predict whether the images would beclicked or not. We obtained the classification accuracy (%)as an estimate of the result of click prediction. Table III liststhe estimated average classification accuracy for the differentmethods. We used 75% images from each query to form theimage base. The experiments were conducted under five differ-ent conditions, where the proportion of input images varied inthe range of 5-25%. According to the experimental results, weobserve that nearly all the used methods effectively improveon baseline comparative results. Our method, MHL, achievedthe best results for click prediction, with the hypergraph-based method performing better than other single graph-basedmethods. The high-order information preserved by hypergraphconstruction is beneficial to preserving local smoothness.Compared with a normal graph, the use of the hypergraph caneffectively improve click prediction performance. In addition,we observe that multimodality methods (MHL and MGL)outperformed single modality methods (SHL and SGL).

This suggests that the multimodality design in equation (9),and its corresponding alternating optimization algorithm areeffective in obtaining optimal classification results. Anotherinteresting finding is that the graph-based learning methods(MHL, MGL, SHL and SGL) perform better than regularsparse coding (SC). The graph and hypergraph-based regular-izers perform efficiently in obtaining excellent sparse codingresults. Additionally, we found that MHL is stable when thesize of the image test set increases from 5% to 25%. Table IVcompares the performance of different methods when thesize of the test image set is fixed at 5%, and the size ofthe training image bases are varied in the range of 10-90%.In general, MHL had the best performance. We observed thatthe classification accuracy of MHL increases from 64.8% to66.9% when the size of the image base increases from 10% to90% indicating that a small image base can adversely affectclick prediction performance.

Page 9: Click Prediction for Web Image Reranking Using Multimodal Sparse ...

YU et al.: CLICK PREDICTION FOR WEB IMAGE RERANKING 2027

TABLE IV

PERFORMANCE COMPARISONS OF CLASSIFICATION ACCURACY (%) FOR CLICK PREDICTIONS WITH VARIED SIZE OF IMAGE BASES, AND FIXED SIZE

OF TEST IMAGE SETS. THE COMPARISON INCLUDES MHL, MGL, SHL, SGL, SC, KNN AND GP. THE SIZE OF THE IMAGE BASE IS VARIED FROM

AMONG [10%, 30%, 50%, 70%, 90%], AND THE SIZE OF THE TEST IMAGE SET IS FIXED AT 5%. THE RESULTS

SHOWN IN BOLD ARE SIGNIFICANTLY BETTER THAN OTHERS

Fig. 5. Average classification accuracy (%) with different parameters of α, β, K and z. The sizes of image bases and test image sets are fixed at 10% and5%, respectively. (a) Classification Accuracy versus the values of alpha. (b) Classification Accuracy versus the values of beta. (c) Classification Accuracyversus the values of K . (d) Classification Accuracy versus the values of z.

Another interesting finding in Tables III and IV is the sparsecoding method dose not perform better than KNN. The reasonis that the overcompleteness of sparse coding causes loss in thelocality of the features. Similar web images will be describedby totally different sparse codes, and the performance in clickdata predictions is unstable. In order to address this issue, anadditional locality-preserving term with a laplacian matrix isadded into the sparse coding formulation. The experimentalresults in Tables III and IV show that SHL and SGL performbetter than KNN.

D. The Effect of Changing Parameter Values

In Fig. 5, we show the sensitivity of parameters α, β, K ,and z in graph-based sparse coding algorithms. In these

experiments, we have fixed the percentage of the imagebase to 10%, and the percentage comprising the testingimage set to 5%. We first fixed β to βopt, and var-ied α with

[10−2αopt, 10−1αopt, αopt, 101αopt, 102αopt

]. The

average classification accuracies of the methods are shownin Fig. 5(a). We have then fixed α to αopt, and var-ied β with

[10−2βopt, 10−1βopt, βopt, 101βopt, 102βopt

]. The

average classification accuracies are shown in Fig. 5(b).From these figures we see that MHL performs best. From10−2αopt to 101αopt, these methods perform stably, as shownin Fig. 5(a). However, from 101αopt to 102αopt, the per-formance degrades severely. In Fig. 5(b), these methodsare stable when β increases from 10−2αopt to 102αopt.In Fig. 5(c), we observe that the hypergraph-based methods

Page 10: Click Prediction for Web Image Reranking Using Multimodal Sparse ...

2028 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

Fig. 6. Performance comparisons of classification accuracy (%). Thecomparison is conducted among MHL, SHL(BOF) and SHL(LLC). The sizeof the image bases is fixed at 75% and the size of the test image set is variedfrom among [5%, 10%, 15%, 20%, 25%].

(MHL, SHL(A), and SHL(L)) obtain the highest perfor-mance when k is fixed at 10. The graph-based methods(MGL, SGL(A) and SGL(L)) have the best performance whenK is set to 5. In Fig. 5(d), we have varied the parameter zfrom 2 to 6, and observed that the classification accuracies ofMHL and MGL are highest when z is 5.

E. Discussion About Features

We adopt conventional features in experiments and theresults are presented in Section IV.C. Recently, some state-of-the-art features have been proposed, such as the Bag ofFeatures (BOF) based on SIFT descriptions [45], and thelocality-constrained linear coding (LLC) [46], which is a SPMlike descriptor. LLC has obtained state-of-the-art performancein image classification. In this part, the features of 1024 dimen-sional BOF and 21504 dimensional LLC with spatial blockstructure [1 2 4] are extracted for images. The experimentalresults are presented in Fig. 6. The comparison is conductedamong MHL, SHL(BOF) and SHL(LLC). The size of theimage bases is fixed at 75% and the size of the test imageset is varied from among [5%, 10%, 15%, 20%, 25%]. Theexperimental results demonstrate that our proposed method-MHL performs better than BOF and LLC. It indicates that themultimodal learning adopted in MHL is effective in enhancingthe classification performance.

F. Experimental Results on Scene Recognition

In this part, we demonstrate that MHL performs well byconducting image classification experiments on the standarddataset of Scene 15 [43], which contains 1500 images thatbelong to 15 natural scene categories: bedroom, CALsub-urb, industrial, kitchen, livingroom, MITcoast, MITforest,MIThighway, MITinsidecity, MITmountain, MITopencountry,MITstreet, MIT-tallbuilding, PARoffice, and store. Five differ-ent features are adopted to describe the scenes, including ColorHistogram (CH), Edge Direction Histogram (EDH), SIFT [45],Gist [44] and Locality-constrained Linear Coding (LLC) [46].

In our experiments, the labeled sample images in the basesare randomly selected. Since the size of each class variessignificantly, it is inappropriate to find a fixed number for

Fig. 7. Performance comparisons of classification accuracy (%) for scenerecognition. The comparison is conducted on MHL, MGL, SHL, SGL, SCand KNN. (a) The size of the image bases is fixed at 75% and the size of thetest image set is varied from among [5%, 10%, 15%, 20%, 25%] (b) the sizeof the test image set is fixed at 5% and the size of the image base is variedfrom among [10%, 30%, 50%, 70%, 90%].

Fig. 8. The comparisons of the average NDCG measurements. Fromthese results, we can see that the click-based method, which combines themultimodal visual features and click information, outperforms other methods.

different classes. Therefore, we fix a percentage number, whichis used for choosing images from different classes to form theimage bases and testing images. For the experiments on scenerecognition, we compare the performance of six methods:Multimodal hypergraph learning-based sparse coding (MHL),Multimodal graph learning-based sparse coding (MGL), Singlehypergraph learning-based sparse coding (SHL), Single graphlearning with sparse coding (SGL), Regular sparse coding (SC)and k-nearest neighbor algorithm (KNN). The details of thesesix methods have been presented in Section IV.B. As shownin Fig. 7(a), our proposed method MHL outperforms other

Page 11: Click Prediction for Web Image Reranking Using Multimodal Sparse ...

YU et al.: CLICK PREDICTION FOR WEB IMAGE RERANKING 2029

Fig. 9. The top 10 images in the commercial ranking list (baseline) and the re-ranking lists obtained using the graph-based method and click-based methodfor the queries “Butterfly” and “Music”. The orders of the images are provided with green, blue and red indicating relevance scales 2, 1 and 0 respectively.From the figure, it is clear that the click-based method obtains the best results. (a) Results of image re-ranking for query “Butterfly”. (b) Results of imagere-ranking for query “Music”.

methods in all cases. This demonstrates that the proposedmethod can perform well on standard datasets.

V. APPLICATION OF THE ALGORITHM

FOR IMAGE RERANKING

To evaluate the efficacy of the proposed method for imagere-ranking, we conduct experiments on a new dataset con-sisting of two subsets A and B. Subset A includes 330,665images with 200 queries used in Section IV, and subset Bcontains 94925 images using the same 200 queries. Images insubset A contain associated click data, and are used to form theimage base for sparse coding. The images of subset B containthe original ranking information from a popular search engine,and thus we can easily evaluate whether our approach is ableto improve performance over the search engine algorithm.In subset B, each image was labeled by the human oracleaccording to its relevance to the corresponding query, as either“not relevant”, “relevant”, or “highly relevant”. We utilize

scores of 0, 1, and 2 to indicate the three relevance levels,respectively.

We use the popular graph-based re-ranking [38] scheme inthe design of our experiment. According to this framework,a ranking score list, y = [s1, s2, . . . , sn]T , is a vector ofranking scores, which corresponds to an image set X ={x1, x2, . . . , xn}. The purpose of graph-based re-ranking is tocalculate a new ranking score list through learning, basedon the images’ visual features. Therefore, the re-rankingprocess can be formulated as a function f : s = f (X, s∗),in which s∗ = [s∗

1 , s∗2 , . . . , s∗

n ]T is the initial ranking scorelist. Hence, the graph-based re-ranking can be formulated asa regularization framework as follows:

arg mins

{� (s) + γ Remp (s)

}, (18)

in which � (s) is the regularization term that makes theranking scores of visually similar images close, and Remp (s)is an empirical loss. γ is a tradeoff parameter to balance the

Page 12: Click Prediction for Web Image Reranking Using Multimodal Sparse ...

2030 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

empirical loss and the regularizer. According to [38], (18) canbe rewritten as:

arg mins

sT Ls + γ∥∥s − s∗∥∥, (19)

in which L is a Laplacian matrix. Since we can calculate agroup of weights λ j for each modality, the Laplacian in (19)

is assigned with L =t∑

j=1λ j L( j ).

Based on the graph-based framework, we compare twomethods which create initial scores differently. In the firstmethod, the initial score s∗ is obtained using a traditionalalgorithm [38], which associates s∗

i with the position τi usingheuristic strategies: s∗

i = 1 − τin . Here, n is the number of

images in this query. We name it the “graph-based method”.In the second method, the predicted click information isconsidered along with the position. Therefore, we name thisthe “click-based method”. For a specified image i , if it ispredicted to be clicked by the user, then the initial scorewill be calculated as: s∗

i = (1 − τi

n + 1)/

2. On the otherhand, if the image is not predicted to be clicked the scorewill be: s∗

i = (1 − τi

n

)/2. The ranking results of the com-

mercial search engine are provided as the baseline, whichwe name the “commercial baseline”. In order to evaluate there-ranking performance for each query, we apply the nor-malized discounted cumulated gain (NDCG) [42], which isa standard measurement in information retrieval. For a givenquery, the NDCG at position P can be calculated as:

N DCG@p = Z p

p∑i=1

2l(i) − 1

log(1 + i), (20)

where p is the considered depth, l(i) is the relevance levelof the i th image in the refined ranking list, and Z p is anormalization constant; this makes NDCG@p 1 for a perfectranking. For the Web Queries dataset, Z p was calculated basedon the labels provided. To compute the overall performance,NDCGs were averaged over all queries for each dataset.

Fig. 8 shows average NDCG scores obtained by differentmethods at depths of [3, 5, 10, 20, 50]. According to theseresults we can see that both the graph-based method andclick-based method can effectively improve baseline results.It indicates that the graph-based learning method performswell for web image re-ranking. The click-based method, whichintegrates all visual features and click data, achieved the bestresults. Specifically, it outperformed the graph-based methodat all depths. It shows that by including the click data, semanticgaps can be bridged and the performance of image re-rankingimprove. Fig. 9 provides the top ten returned images obtainedby the three methods for two example queries. The blockswith green, blue and red indicate images’ relevance scales 2,1 and 0 respectively. In the first and second rows of Fig. 9(a)and (b), there are many irrelevant and low relevant images,but the images of the third row are all relevant to the query.It clearly shows that click data is effective in coordinating theranking results.

VI. CONCLUSION

In this paper we propose a new multimodal hypergraphlearning based sparse coding method for the click prediction

of images. The obtained sparse codes can be used for imagere-ranking by integrating them with a graph-based schema.We adopt a hypergraph to build a group of manifolds, whichexplore the complementary characteristics of different featuresthrough a group of weights. Unlike a graph that has an edgebetween two vertices, a set of vertices are connected by ahyperedge in a hypergraph. This helps preserve the localsmoothness of the constructed sparse codes. Then, an alter-nating optimization procedure is performed and the weightsof different modalities and sparse codes are simultaneouslyobtained using this optimization strategy. Finally, a votingstrategy is used to predict the click from the correspondingsparse code. Experimental results on real-world data setshave demonstrated that the proposed method is effective indetermining click prediction. Additional experimental resultson image re-ranking suggest that this method can improve theresults returned by commercial search engines.

REFERENCES

[1] Y. Gao, M. Wang, Z. J. Zha, Q. Tian, Q. Dai, and N. Zhang, “Lessis more: Efficient 3D object retrieval with query view selection,” IEEETrans. Multimedia, vol. 13, no. 5, pp. 1007–1018, Oct. 2011.

[2] S. Clinchant, J. M. Renders, and G. Csurka, “Trans-media pseudorelevance feedback methods in multimedia retrieval,” in Proc. CLEF,2007, pp. 1–12.

[3] L. Duan, W. Li, I. W. Tsang, and D. Xu, “Improving web image searchby bag-based reranking,” IEEE Trans. Image Process., vol. 20, no. 11,pp. 3280–3290, Nov. 2011.

[4] B. Geng, L. Yang, C. Xu, and X. Hua, “Content-aware Ranking forvisual search,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Jun. 2010, pp. 3400–3407.

[5] B. Carterette and R. Jones, “Evaluating search engines by modeling therelationship between relevance and clicks,” in Proc. Adv. Neural Inf.Process. Syst., 2007, pp. 1–9.

[6] G. Dupret and C. Liao, “A model to estimate intrinsic documentrelevance from the clickthrough logs of a web search engine,” in Proc.ACM Int. Conf. Web Search Data Mining, 2010, pp. 181–190.

[7] G. Smith and H. Ashman, “Evaluating implicit judgments from imagesearch interactions,” in Proc. WebSci. Soc., 2009, pp. 1–3.

[8] M. Shokouhi, F. Scholer, and A. Turpin, “Investigating the effectivenessof clickthrough data for document reordering,” in Proc. Eur. Conf. Adv.Inf. Retr., 2008, pp. 591–595.

[9] V. Jain and M. Varma, “Random walks on the click graph,” in Proc.ACM SIGIR Conf. Res. Develop. Inf. Retr., 2007, pp. 239–246.

[10] V. Jain and M. Varma, “Learning to re-rank: Query-dependent imagere-ranking using click data,” in Proc. Int. Conf. World Wide Web, 2011,pp. 277–286.

[11] E. Candès, J. Romberg, and T. Tao, “Stable signal recovery fromincomplete and inaccurate measurements,” Commun. Pure Appl. Math.,vol. 59, no. 8, pp. 1207–1223, 2006.

[12] A. Eriksson and A. van den Hengel, “Efficient computation of robustlow-rank matrix approximations in the presence of missing data usingthe l1 norm,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Jun. 2010, pp. 771–778.

[13] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Non-localsparse models for image restoration,” in Proc. IEEE Int. Conf. Comput.Vis., Oct. 2009, pp. 2272–2279.

[14] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramidmatching using sparse coding for image classification,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1794–1801.

[15] S. Gao, I. Tsang, and L. T. Chia, “Kernel sparse representation for imageclassification and face recognition,” in Proc. Eur. Conf. Comput. Vis.,2010, pp. 1–14.

[16] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust facerecognition via sparse representation,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.

[17] C. Wang, S. Yan, L. Zhang, and H. Zhang, “Multi-label sparse coding forautomatic image annotation,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2009, pp. 1643–1650.

Page 13: Click Prediction for Web Image Reranking Using Multimodal Sparse ...

YU et al.: CLICK PREDICTION FOR WEB IMAGE RERANKING 2031

[18] D. Donoho, “For most large underdetermined systems of linear equa-tions, the minimal l1-norm solution is also the sparsest solution,” Dept.Statist., Stanford Univ., Stanford, CA, USA, Tech. Rep., 2004.

[19] S. Gao, I. Tsang, and L. Chia, “Laplacian sparse coding, hypergraphLaplacian sparse coding, and applications,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 35, no. 1, pp. 92–104, Jan. 2013.

[20] C. Snoek, M. Worring, and A. Smeulders, “Early versus late fusion insemantic video analysis,” in Proc. ACM Int. Conf. Multimedia, 2005,pp. 399–402.

[21] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf, “Learningwith local and global consistency,” in Proc. Int. Conf. Neural Inf.Process. Syst., 2004, pp. 321–328.

[22] X. He, W. Y. Ma, and H. J. Zhang, “Learning an image manifold forretrieval,” in Proc. ACM Int. Conf. Multimedia, 2004, pp. 17–23.

[23] M. Belkin and P. Niyogiy, “Laplacian eigenmaps for dimensionalityreduction and data representation,” Neural Comput., vol. 15, no. 6,pp. 1373–1396, 2003.

[24] R. Zass and A. Shashua, “Probabilistic graph and hypergraph matching,”in Proc. Int. Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1–8.

[25] L. Sun, S. Ji, and J. Ye, “Hypergraph spectral learning for multi-labelclassification,” in Proc. Int. Conf. Know. Discovery Data Mining, 2008,pp. 668–676.

[26] Y. Huang, Q. Liu, and D. Metaxas, “Video object segmentation byhypergraph cut,” in Proc. Int. Conf. Comput. Vis. Pattern Recognit.,2009, pp. 1738–1745.

[27] Z. Tian, T. Hwang, and R. Kuang, “A hypergraph-based learningalgorithm for classifying gene expression and array CGH data with priorknowledge,” Bioinformatics, vol. 25, no. 21, pp. 2831–2838, 2009.

[28] Y. Huang, Q. Liu, S. Zhang, and D. Metaxas, “Image retrieval viaprobabilistic hypergraph ranking,” in Proc. Int. Conf. Comput. Vis.Pattern Recognit., 2010, pp. 3376–3383.

[29] D. Zhou, J. Huang, and B. Schölkopf, “Learning with hypergraphs:Clustering, classification, and embedding,” in Proc. Neural Inf. Process.Syst., 2006, pp. 1601–1608.

[30] M. Wang, X. S. Hua, R. Hong, J. Tang, G. Qi, and Y. Song, “Unifiedvideo annotation via multigraph learning,” IEEE Trans. Circuits Syst.Video Technol., vol. 19, no. 5, pp. 733–746, May 2009.

[31] B. Geng, C. Xu, D. Tao, L. Yang, and X. S. Hua, “Ensemble manifoldregularization,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,Jun. 2009, pp. 2396–2402.

[32] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse cod-ing algorithms,” in Proc. Adv. Neural Inf. Process. Syst., 2006,pp. 801–808.

[33] M. Wang, X. S. Hua, X. Yuan, Y. Song, and L. R. Dai, “Optimizingmultigraph learning: Towards a unified video annotation scheme,” inProc. ACM Int. Conf. Multimedia, 2007, pp. 862–870.

[34] Y. Lv and C. Zhai, “Positional relevance model for pseudo-relevancefeedback,” in Proc. Int. Conf. Res. Develop. Inf. Retr., 2010,pp. 579–586.

[35] R. Yan, A. Hauptmann, and R. Jin, “Multimedia search with pseudo-relevance feedback,” in Proc. Int. Conf. Image Video Retr., 2003,pp. 238–247.

[36] W. Hsu, L. Kennedy, and S. Chang, “Video search reranking viainformation bottleneck principle,” in Proc. ACM Int. Conf. Multimedia,2006, pp. 35–44.

[37] W. Hsu, L. Kennedy, and S. Chang, “Video search reranking throughrandom walk over document-level context graph,” in Proc. ACM Int.Conf. Multimedia, 2007, pp. 971–980.

[38] X. Tian, L. Yang, J. Wang, X. Wu, and X. Hua, “Bayesian visualreranking,” IEEE Trans. Multimedia, vol. 13, no. 4, pp. 639–652,Aug. 2010.

[39] T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, andG. Gay, “Evaluating the accuracy of implicit feedback from clicks andquery reformulations in web search,” ACM TOIS, vol. 25, no. 2, pp. 1–6,2007.

[40] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf, “Learningwith local and global consistency,” in Proc. Neural Inf. Process. Syst.,2004, pp. 321–328.

[41] R. Tibshirani, “Regression shrinkage and selection via the lasso,”J. R. Statist. Soc., Series B, vol. 58, no. 8, pp. 267–288, 1994.

[42] K. Jarvelin and J. Kekalainen, “Cumulated gain-based evaluation of irtechniques,” ACM Trans. Inf. Syst., vol. 20, no. 4, pp. 422–446, 2002.

[43] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features:Spatial pyramid matching for recognizing natural scene categories,”in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jan. 2006,pp. 2169–2178.

[44] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holisticrepresentation of the spatial envelope,” Int. J. Comput. Vis., vol. 42,no. 3, pp. 145–175, 2001.

[45] D. Lowe, “Distinctive image features from scale-invariant keypoints,”Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.

[46] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” in Proc. Comput.Vis. Pattern Recognit., 2010, pp. 3360–3367.

[47] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angleregression,” Ann. Statist., vol. 32, no. 2, pp. 407–499, 2004.

[48] S. Perkins and J. Theiler, “Online feature selection using grafting,” inProc. ICML, 2003, pp. 592–599.

[49] S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basispursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998.

[50] C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning,” NeuralComput. Appl., vol. 23, no. 7–8, pp. 2031–2038, 2013.

Jun Yu (M’13) received the B.Eng. and Ph.D.degrees from Zhejiang University, Zhejiang, China.He is currently a Professor with the School of Com-puter Science and Technology, Hangzhou DianziUniversity. He was an Associate Professor withthe School of Information Science and Technology,Xiamen University. From 2009 to 2011, he waswith Singapore Nanyang Technological University.From 2012 to 2013, he was a Visiting Researcherwith Microsoft Research Asia. Over the past years,his research interests include multimedia analysis,

machine learning, and image processing. He has authored and co-authoredmore than 50 scientific articles. He has (Co-)Chaired for several specialsessions, invited sessions, and workshops. He served as a Program CommitteeMember or reviewer of top conferences and prestigious journals. He isa Professional Member of the IEEE, ACM, and CCF.

Yong Rui (F’10) is currently a Senior Director atMicrosoft Research Asia, leading research effort inthe areas of multimedia search, knowledge mining,and social and urban computing. As a fellow of theIEEE, IAPR, and SPIE, and a Distinguished Scientistof ACM, he is recognized as a leading expert inhis research areas. He holds more than 50 U.S. andInternational patents. He has authored 16 books andbook chapters, and more than 100 referred journaland conference papers. His publications are amongthe most cited—his top five papers have been cited

more than 6000 times and his h-Index is 43.He is the Associate Editor-in-Chief of the IEEE MULTIMEDIA MAGAZINE,

an Associate Editor of the ACM Transactions on Multimedia Computing,Communication and Applications, a founding Editor of the InternationalJournal of Multimedia Information Retrieval, and a founding AssociateEditor of the IEEE ACCESS. He was an Associate Editor of the IEEETRANSACTIONS ON MULTIMEDIA from 2004 to 2008, the IEEE TRANS-ACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGIES from2006 to 2010, the ACM/Springer Multimedia Systems Journal from 2004 to2006, and the International Journal of Multimedia Tools and Applicationsfrom 2004 to 2006. He also serves on the Advisory Board of the IEEETRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING. He is onOrganizing Committees and Program Committees of numerous conferences,including ACM Multimedia, IEEE Computer Vision and Pattern Recognition,IEEE European Conference on Computer Vision, IEEE Asian Conference onComputer Vision, IEEE International Conference on Image Processing, IEEEInternational Conference on Acoustics, Speech, and Signal Processing, IEEEInternational Conference on Multimedia and Expo, SPIE ITCom, InternationalConference on Pattern Recognition, and Conference on Image and VideoRetrieval (CIVR). He is a General Co-Chair of CIVR 2006, ACM Multimedia2009, and ICIMCS 2010, and a Program Co-Chair of ACM Multimedia 2006,Pacific Rim Multimedia (PCM) 2006, and IEEE ICME 2009. He is/was onthe Steering Committees of ACM Multimedia, ACM ICMR, IEEE ICME, andPCM. He is the founding Chair of ACM SIG Multimedia China Chapter.

Dr. Rui received the B.S. degree from Southeast University, the M.S. degreefrom Tsinghua University, and the Ph.D. degree from the University ofIllinois at Urbana-Champaign. He also holds a Microsoft Leadership TrainingCertificate from Wharton Business School, University of Pennsylvania.

Page 14: Click Prediction for Web Image Reranking Using Multimodal Sparse ...

2032 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

Dacheng Tao (M’07–SM’12) is a Professor of Com-puter Science with the Centre for Quantum Compu-tation and Information Systems and the Faculty ofEngineering and Information Technology, Universityof Technology, Sydney. He mainly applies statis-tics and mathematics for data analysis problems incomputer vision, machine learning, multimedia, datamining, and video surveillance. He has authored andco-authored more than 100 scientific articles at topvenues, including IEEE TRANSACTIONS ON PAT-TERN ANALYSIS AND MACHINE INTELLIGENCE,

IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE International Con-ference on Artificial Intelligence and Statistics, IEEE COMPUTER VISION

AND PATTERN RECOGNITION, and IEEE International Conference on DataMining (ICDM), with the best theory/algorithm paper runner-up award inIEEE ICDM’07.


Recommended