462 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, …king/PUB/TMM2010-Ma.pdf · Bridging the Semantic...

462 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 5, AUGUST 2010

Bridging the Semantic Gap BetweenImage Contents and Tags

Hao Ma, Jianke Zhu, Member, IEEE, Michael Rung-Tsong Lyu, Fellow, IEEE, and Irwin King, Senior Member, IEEE

Abstract—With the exponential growth of Web 2.0 applications,tags have been used extensively to describe the image contentson the Web. Due to the noisy and sparse nature in the humangenerated tags, how to understand and utilize these tags for imageretrieval tasks has become an emerging research direction. As thelow-level visual features can provide fruitful information, they areemployed to improve the image retrieval results. However, it ischallenging to bridge the semantic gap between image contentsand tags. To attack this critical problem, we propose a unifiedframework in this paper which stems from a two-level data fusionsbetween the image contents and tags: 1) A unified graph is builtto fuse the visual feature-based image similarity graph with theimage-tag bipartite graph; 2) A novel random walk model isthen proposed, which utilizes a fusion parameter to balance theinfluences between the image contents and tags. Furthermore,the presented framework not only can naturally incorporate thepseudo relevance feedback process, but also it can be directlyapplied to applications such as content-based image retrieval,text-based image retrieval, and image annotation. Experimentalanalysis on a large Flickr dataset shows the effectiveness andefficiency of our proposed framework.

Index Terms—Content-based image retrieval, image annotation,random walk, text-based image retrieval.

I. INTRODUCTION

I MAGE retrieval has been adopted in most of the majorsearch engines, including Google, Yahoo!, Bing, etc. A

large number of image search engines mainly employ thesurrounding texts around the images and the image names toindex the images. However, this limits the capability of thesearch engines in retrieving the semantically related imagesusing a given query. On the other hand, although the currentstate-of-the-art in content-based image retrieval is progressing,it has not yet succeeded in bridging the semantic gap betweenhuman concepts, e.g., keyword-based queries, and low-levelvisual features that are extracted from the images [22], [36].Hence, it has become an urgent need for developing noveland effective paradigms that go beyond these conventionalapproaches or retrieval models.

Manuscript received October 31, 2009; revised February 28, 2010; acceptedApril 25, 2010. Date of publication May 27, 2010; date of current version July16, 2010. This work was supported by two grants from the Research GrantsCouncil of the Hong Kong Special Administrative Region, China (Project No.CUHK 4128/08E and CUHK 4154/09E). The associate editor coordinating thereview of this manuscript and approving it for publication was Prof. Abdulmo-taleb El Saddik.

The authors are with the Department of Computer Science and Engineering,Chinese University of Hong Kong, Kowloon, Hong Kong (e-mail: [email protected]; [email protected]; [email protected]; [email protected].

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TMM.2010.2051360

Fig. 1. Example image with its tags.

Recently, with the prevalence of Web 2.0 applications and so-cial games, more and more users contribute numerous tags toWeb images, such as Flickr [17], ESP games [14], etc. Thesetags provide the meaningful descriptors of images, which areespecially important for those images containing little or no tex-tual context. The success of Flickr proves that users are willingto provide this semantic context through manual annotations.Recent user studies on this topic reveal that users do annotatetheir photos with the motivation to make them better accessibleto the general public [1]. Fig. 1 shows an example image ex-tracted from Flickr with its user-generated tags. This is a phototaken in Hong Kong, and is described by users with the tagsHong Kong, night, IFC (a building name, standing for Interna-tional Finance Centre), Bank of China Tower, and skyline, whichare all semantically relevant to this image. However, it is verydifficult for the current content-based image retrieval methodsto produce such meaningful results. Hence, the tag data is anideal source for improving many tasks in image retrieval.

Unfortunately, tags inevitably contain the injected noise in themanual labeling process. As shown in Fig. 1, the last tag associ-ated with this photo is travel. To the owner who submitted thisphoto, obviously, this tag is not a noisy tag since this photo prob-ably was taken when the owner was traveling to the city. But interms of the image retrieval tasks, most probably the tag travel isa noise, since it is a too general term. Other popular tags in Flickrlike nature, 2008 also belong to this category. Therefore, simplyusing tags in image retrieval tasks is not a reliable and reason-able solution; the visual information of images should also betaken into consideration to improve the image search enginessince visual information gives the most direct correlations be-tween images, and 80% of human cognition comes from visualinformation [40].

In this paper, we investigate the research problem on how toincorporate both image content and tag information into imageretrieval and annotation tasks. To take advantages of both the vi-sual information and user-contributed tags for image retrieval,we need to tackle two main challenges. The first challenge ishow to bridge the semantic gap between image contents and

1520-9210/$26.00 © 2010 IEEE

MA et al.: BRIDGING THE SEMANTIC GAP BETWEEN IMAGE CONTENTS AND TAGS 463

image tags. Essentially, visual features and tags are two dif-ferent but closely related aspects of images. Although content-based image retrieval using visual features has been extensivelystudied since the 1990s, the semantic gap between low-levelimage features and high-level semantic concepts is still the keyhindrance towards the effectiveness of content-based image re-trieval systems [39]. The second challenge is how to create ascalable and effective algorithm. There are a huge amount of im-ages on the Web, and more and more new photos are uploadedto photo sharing Web sites like Flickr everyday by numerous in-dependent Web users. Hence, a scalable and effective algorithmis necessary to analyze both the visual information and the tagsof images.

Aiming at the above challenges, we propose a unified frame-work for performing tasks related to image retrieval, includingcontent-based image retrieval, text-based image retrieval,and image annotation. This framework relies on a two-leveldata fusion between image contents and tags. Based on theglobal features extracted from every image, we first infer animage similarity graph, and then form a hybrid graph withthe image-tag bipartite graph. In this hybrid graph, one partof the weighted edges are connecting different images andthe weights represent the similarities between them, while theother part of the weighted edges are bonding the images andtags with the weights reflecting the co-occurrent frequencies.After building the hybrid graph, we then propose a novel andeffective random walk model that employs a fusion parameterto balance the importance between the image contents and thetags. The fusion parameter determines whether to acceleratethe diffusion of random walks of image-tag subgraph, or toaccelerate the walks of image-image subgraph. Moreover, ourframework also provides a natural solution for including thepseudo relevant feedback into image retrieval and annotationtasks. The experimental results of three applications on a largeFlickr dataset show the advantage of our proposed framework.

The rest of the paper is organized as follows. We reviewrelated work in Section II. Section III describes the proposedunified framework, including the global feature extraction,the hybrid graph construction, and the random walk model.In Section IV, we demonstrate the empirical analysis of ourframework on three image retrieval applications. Finally, con-clusions and future work are given in Section V.

II. RELATED WORK

Considerable research efforts [5], [11], [12], [19], [26],[36] have been devoted to address attacking the semantic gapbetween low-level features and high-level semantic concepts,which is the key hindrance in content-based image retrieval[10], [35].

Machine learning techniques have been shown as one wayto bridge the semantic gap between the image features and se-mantic concepts. The recent research literatures have been sig-nificantly interested in employing graphical models and dis-tance metric learning algorithms. The work in [5] is inspiredfrom the natural language processing methods, in which theprocess of building the relation between the visual features andthe keywords is analogous to a language translation. Similarly,Djeraba [11] tries to learn the associations between the visual

features and the semantic descriptions via a visual dictionary.As for the distance metric learning, it is mainly employed toconstruct the semantic map [39] which learns a distance mea-sure to approximate the similarity in the textual space. There-fore, the learnt similarity measure can be further employed inthe image annotation task. Moreover, the semantic concept re-lationship can be captured by the visual correlation betweenconcepts [40], [34], which is essential to the concept clustering,semantic distance estimation, and image annotation. Addition-ally, learning with relevance feedbacks, which takes advantageof the users’ interaction to improve the retrieval performance,has been extensively studied [35], [38]. One disadvantage forthe learning-based methods is its generalization capability. Aremedy is to raise the total number of representative training ex-amples; however, this requires more manually labeled data andincreases the computational cost significantly.

Another approach to the semantic gap issue is to take advan-tage of the advance in computer vision domain, which is closelyrelated to object recognition and image analysis. Duygulu et al.[12] present a machine translation model which maps the key-word annotation onto the discrete vocabulary of clustered imagesegmentations. Moreover, Blei and Jordan [6] extend this ap-proach through employing a mixture of latent factors to gen-erate keywords and blob features. Jeon et al. [21] reformulatethe problem as cross-lingual information retrieval, and proposea cross-media relevance model to the image annotation task.In contrast to the image-based and region-based methods, theimage content is represented by the salient objects in [16], whichcan achieve automatic image annotation at the content level.A hierarchical classification frame is proposed in [15], whichemployed salient objects to characterize the intermediate imagesemantics. Those salient objects are defined as the connectedimage regions that capture the dominant visual properties linkedto the corresponding physical objects in an image. However,these methods rely on the results of image segmentation andsalience detection, which are sensitive to the illumination con-ditions and cluttered background. In most recent, bag-of-wordsrepresentation [8], [42] of the local feature descriptors demon-strated promising performance in calculating the image simi-larity. To deal with the high-dimensionality of the vector fea-ture space, the efficient hashing index methods have been in-vestigated in [8] and [24]. These approaches did not take con-sideration of the tag information which is very important for theimage retrieval task.

Apart from its connection with research work in content-based image retrieval, our work is also related to the broad re-search topic in graph-based methods. Graph-based methods areintensively studied with the aim of reducing the gap of the vi-sual features and semantic concept. In [23], the images are rep-resented by the attributed relational graphs, in which each nodein the graph represents an image region and each edge representsa relation between two regions. In [18], an image is representedas a sequence of feature-vectors characterizing low-level visualfeatures, and is modeled as if it was stochastically generatedby a hidden Markov model, whose states represent concepts.Most recently, Jing and Baluja [22] present an intuitive graphmodel-based method for product image search. They directlyview images as documents and their similarities as probabilistic


visual link. Moreover, the likelihood of images is estimated bya random walk algorithm on the image similarity graph. Theimage similarity is based on the local feature matching usingSIFT descriptor [27]; unfortunately, this incurs heavy computa-tional cost. Recently, several random walk-based methods havebeen proposed in image and video retrieval tasks [4], [9], [20].However, the image-tag [9] and video-view graphs [4] based ap-proaches did not take consideration of the contents of images orvideos, which lose the opportunity to retrieve more accurate re-sults. In [20], a re-ranking scheme is developed using randomwalk over the video story graph. Multiple-instance learning canalso take advantage of the graph-based representation [37] in theimage annotation task.

Our work is also related to recommender systems since itaims to recommend relevant images and tags. However, dif-ferent with traditional recommendation or collaborative filteringalgorithms [28]–[30], our work does not have user-item ratinginformation. Hence, in some sense, the problem we study in thiswork is more difficult than some of the traditional recommen-dation problems.

Instead of relying on the complicated models and representa-tive training examples in the machine learning-based methods,we propose an effective and efficient framework based onMarkov random walk [33], which can take advantage of bothimage visual contents and image tags. Our method does notneed to train any functions or models, and can be easily scaledto very large dataset.

III. UNIFIED FRAMEWORK

In this section, we detail our framework, including how toextract global features, how to build the hybrid graph based onvisual features and tags, and how to perform a random walk onit.

A. Global Feature Extraction

The global feature representation techniques have been ex-tensively studied in image processing and content-based imageretrieval. In contrast to the local feature-based approaches [22],the global feature is very efficient in computation and storagedue to its compact representation. A wide variety of globalfeature extraction techniques have been proposed in the pastdecade. In this paper, we extract four kinds of effective globalfeatures.

• Grid color moment. We adopt the grid color moment toextract color features from images. Specifically, an imageis partitioned into a 3 3 grids. For each grid, we extractthree kinds of color moments: color mean, color variance,and color skewness in each color channel (R, G, and B),respectively. Thus, an 81-dimensional grid color momentvector is adopted for color features.

• Local binary pattern (LBP). The local binary pattern [31]is defined as a gray-scale invariant texture measure, derivedfrom a general definition of texture in a local neighbor-hood. In our experiment, a 59-dimensional LBP histogramvector is adopted.

• Gabor wavelets texture. To extract Gabor texture fea-tures, each image is first scaled to 64 64 pixels. TheGabor wavelet transform [25], [43] is then applied on the

scaled image with five levels and eight orientations, whichresults in 40 subimages. For each subimage, three momentsare calculated: mean, variance, and skewness. Thus, a 120-dimensional vector is used for Gabor texture features.

• Edge. An edge orientation histogram is extracted for eachimage. We first convert each image into a grayscale image,and then employ a Canny edge detector [7] to obtain theedge map for computing the edge orientation histogram.The edge orientation histogram is quantized into 36 binsof 10 degrees each. An additional bin is used to countthe number of pixels without edge information. Hence, a37-dimensional vector is used for shape features.

In total, a 297-dimensional vector is used to represent all theglobal features for each image in the dataset, which is furthernormalized to zero mean and unit variance. Note that this fea-ture representation has shown the promising performance on theduplicate image retrieval task [44], [45] which requires the ac-curate similarity measure for the image pairs.

B. Hybrid Graph Construction

Once the visual features are extracted, we can build theimage similarity graph. Let represent the dimensionality ofeach image, and let denote the total image set. For eachimage , let the -dimentional vectorrepresent the image feature vector corresponding to image

. We employ the cosine function to measure the similaritybetween two images and :

(1)

We then build the image similarity graph based on the cal-culated similarities. Usually, there are several methods to con-struct similarity graphs, including NN graph, NN graph, exp-weighted graph, etc. As reported in [46], NN graph tends toperform well empirically. Hence, for an image , we employthe most similar images as its neighbors. More specifically,if an image is in the -nearest-neighborhood of image ,then we create a directed edge from node to node , andthe weight is the similarity . This NN graph is anasymmetric graph since if node is in the -nearest-neigh-borhood of node , it does not mean node is also in the

-nearest-neighborhood of node . Fig. 2(a) illustrates an ex-ample NN graph with . Note that it is time-consuming tofind the most similar images by brute-force searching in a verylarge dataset. Fortunately, we can take advantage of the nearestneighbor searching method proposed in [2] to efficiently buildthe image similarity graph.

Beside the image similarity graph, we also build theimage-tag bipartite graph. However, we cannot simply incorpo-rate the image-tag bipartite graph into our framework. This isbecause the bipartite graph is an undirected graph, which cannotaccurately interpret the relationships between images and tags.To tackle this problem, we need to convert it into a directedgraph. As shown in Fig. 2(b), the left part of the bipartite graphrepresents the image nodes, while the right part denotes thetag nodes. In the converted graph, every undirected edge in theoriginal bipartite graph is converted into two directed edges.The weight on a new directed image-tag edge is normalized by


Fig. 2. Hybrid graph construction.

the total number of times that the image is tagged, while theweight on a directed tag-image edge is normalized by the totalnumber of times that this tag has been assigned.

After building the image similarity graph and the image-tagdirected graph, we consolidate these two graphs, and create adirected hybrid graph, as shown in Fig. 2(c). This directed hy-brid graph forms the foundation of our random walk model thatwill be introduced in the next section.

C. Random Walk Model

Markov random walk model has been extensively studied inmany Web applications. In this section, we introduce a novelrandom walk model on our hybrid graph that can smoothly em-ploy the visual and the textual information into several imageretrieval tasks.

Let denote a directed hybrid graph, whereis the vertex set, and represents the set of image nodes

while denotes the set of tag nodes. is the edgeset which consists of two types of edges. If the edge is in theedge set , then and . If the edge is in theedge set , then , or .

For all the edges in the edge set , we define the tran-sition probability from node to node as

or . If , thenis the number of times that the tag node has been assigned tothe image node , while is the total number of timesthat the image node has been tagged. If , thenis the number of times that the tag node has been assigned tothe image node , while is the total number of timesthat the tag node has been assigned to all the images. Actually,this consideration denoises the popular tags with little meaningslike “nature” and “travel” since the weights on the edges startingfrom these nodes will be very small. The notationdenotes the transition probability from node at time step tonode at time step . While the counts are symmetric,the transition probabilities generally are not, becausethe normalization varies across different nodes.

For other edges in the edge set , we define the tran-sition probability from node to node as

, where is the

image visual similarity between the image nodes anddefined in (1). This is slightly different from the example weshow in Fig. 2, since we normalize the similarities here.

In general, the transition probability is

(2)

The random walk can only diffuse through the links that con-nect nodes in a given graph; in fact, there are random relationsamong different nodes even if these nodes are not connected.For example, in the similarity relations on the image similaritygraph, we explicitly calculate the similarities between imagesbased on (1). Actually, there are some implicit hidden simi-larity relations among these images that cannot be observed orcaptured. Hence, to capture these relations, without any priorknowledge, we propose to add a uniform random relation amongdifferent nodes. More specifically, let denote the probabilitythat such phenomena happen, and is the probability oftaking a “random jump”. Without any prior knowledge, we set

, where is a uniform stochastic distribution vector,is the vector of all ones, and is the number of nodes. Based

on the above consideration, we modify our model to use the fol-lowing transition probability matrix:

(3)

where matrix is the transition probability matrix with theentry of the th row and the th column defined in (2). Followingthe setting of in PageRank [13], [32], we set in allof our experiments conducted in Section IV.

Our graph is a hybrid graph which consists of two totally dif-ferent subgraphs: image similarity graph and image-tag bipar-tite graph. Intuitively, the contributions of these two graphs arenot likely the same. This indicates that in some applications,the image-tag subgraph is more important than the image simi-larity subgraph, while in other applications, the image similaritysubgraph should contribute more. Hence, in order to endow


more flexibility to our random walk model, we employ a fu-sion parameter to the transition matrix introduced in (3).We define the transition probability matrix with the entry

from node to node as

(4)

The parameter plays as a very important role in our randomwalk model, which defines how fast the random walk diffuses onthe two subgraphs. Following a physical intuition, when ,the random walk only performs on the image-tag subgraph. Onthe other hand, in the extreme case when , no randomwalk will diffuse on the image-tag subgraph, and it only diffuseson the image similarity graph. In the intermediate case, whenis relatively large, the diffusion on the image-tag subgraph isfaster than the diffusion on the image similarity subgraph. Asa result, the random walk will depend more on the image-taginformation. If is relatively small, the results will depend moreon the image visual information. In Section IV, we will give adetailed analysis on the impact of parameter .

With the transition probabilistic matrix defined in (4), wecan now perform the random walk on the hybrid graph: we cal-culate the probability of transition from node to node as

(5)

1

The random walk sums the probabilities of all paths of lengthbetween the two nodes. It gives a measure of the volume of

paths between these two nodes; if there are many paths, thetransition probability will be higher [9]. The larger the transitionprobability is, the more the node is similar to thenode .

Since the image dataset is very large, computing the matrixmultiplication is infeasible. Hence, we compute the randomwalk in an efficient way as follows. If we want to start a randomwalk at node , we employ a row vector with a unit entry atnode , and then calculate the transition probability to nodeas

(6)

where controls the number of walk steps. This is very efficientsince the matrix operations are quite sparse.

With the hybrid graph and the random walk model, similarto [9], we can then apply our framework to several applicationareas, including the following.

• Image-to-image retrieval. Given an image, find relevantimages based on visual information and tags. The relevantdocuments should be ranked highly regardless of whetherthey are adjacent to the original image in the hybrid graph.

• Image-to-tag suggestion. This is also called image anno-tation. Given an image, find related tags that have semanticrelations to the contents of this image.

• Tag-to-image retrieval. Given a tag, find a ranked list ofimages related to this tag. This is more like the text-basedimage retrieval.

1Before the start of any random walks, we will normalize the probabilityby to make sure that is the transition proba-

bilistic matrix.

• Tag-to-tag suggestion. Given a tag, suggest some otherrelevant tags to this tag. This is also known as tag recom-mendation problem.

In Section IV, we will show the performance of our model onthe first three applications. We will not show the experiments ofthe tag recommendation application since it is beyond the scopeof this paper.

D. Pseudo Relevance Feedback

Relevance feedback is an effective scheme to bridge the gapbetween high-level semantics and low-level features in content-based image retrieval. However, it involves the user interactionin the retrieval processes, which is infeasible in some retrievalapplications. Pseudo relevance feedback provides an effectivemethod for automatic local analysis. It automates the manualpart of relevance feedback, so that users would obtain improvedretrieval performance without an extended interaction.

Taking advantage of the proposed random work algorithm, ourproposed framework can be naturally extended to pseudo rele-vance feedback. Consider the image-to-image retrieval example:Given an image, we first conduct a round of random walk, as-suming that the top ranked images are relevant; then conduct an-other round of random walk, using the original image and the topranked images. Actually, the top ranked images are used to makean expansion of the original image. We then re-rank all the im-ages based on the expanded image set. The detailed algorithmfor pseudo relevance feedback is summarized in Algorithm 1.

Algorithm 1: Pseudo Relevance Feedback Algorithm

1) Given the query node , form a vector ( isthe total number of nodes), with the th entry equal to1 while other entries equal to 0.

2) Perform a t-step random walk and get a new vector.

3) Get the top- nodes with the highest values in vector(notice that, in the image-to-image retrieval task,

the top- nodes are in the image node set while inthe image-to-tag task or image annotation task, thetop- nodes are in the tag node set ).

4) Form a new vector with the th entry equal to1, the entries represent the top- nodes being equal to1, and other entries being equal to 0.

5) Conduct a new t-step random walk and get the results. Rank the vector as the retrieval results.

In the above algorithm, we only conduct 1-round feedback.Actually, this algorithm can also run multiple feedback rounds.

IV. EXPERIMENTAL ANALYSIS

In this section, we evaluate our proposed framework usingthe Flickr2 dataset on content-based image retrieval, text-basedimage retrieval and image annotation problems.

A. Data Description

Flickr is an image hosting Web site and online communityplatform. It creates a popular platform for users to share per-

2http://www.flickr.com.


Fig. 3. Examples for CBIR. (a) Query Image 1. (b) Rank 1. (c) Rank 2. (d) Rank 3. (e) Rank 4. (f) Rank 5. (g) Query Image 2. (h) Rank 1. (i) Rank 2. (j) Rank3. (k) Rank 4. (l) Rank 5. (m) Query Image 3. (n) Rank 1. (o) Rank 2. (p) Rank 3. (q) Rank 4. (r) Rank 5. (s) Query Image 4. (t) Rank 1. (u) Rank 2. (v) Rank 3.(w) Rank 4. (x) Rank 5.

sonal photographs, tag photographs, and communicate withother users. As of November 2007, it claims to host more than2 billion images [3]. Hence, Flickr is an ideal source for theinvestigation of image-related research.

In this paper, we randomly sample 597 332 images whichspan from January 1, 2007 to December 31, 2007. For eachimage, we record the image files and the associated tags. To-tally, we find 566 293 unique tags and 4 929 093 edges (tag as-signments) between images and tags, which indicates that onaverage, each image has been associated with 8.25 tags.

B. Parameter Discussions

In addition to the fusion parameter , we need to set anothertwo parameters: the parameter for building NN image simi-larity graph and the parameter for the random walk steps.

For the parameter , as suggested in [46], normally a smallvalue of performs well in practice. In this paper, we setempirically in all of the experiments, which indicates that inthe image similarity subgraph, every image has 40 most similarneighbors. Hence, the outdegree for every image node in theimage similarity subgraph is 40.

The parameter determines the resolution of the Markovrandom walk. If we choose a large enough , the random walkwill turn into the stationary distribution, where the final resultsdepend more on the graph structure with little informationabout the query node preserved. On the other hand, a short walk

preserves information about the starting node at a fine scale.Since we wish to preserve the information about the querynode, a relatively small is chosen in order to be far away fromthe stationary distribution. In this paper, we set in all ofour experiments.

For the parameter , it smoothly fuses the image-taginformation with the image visual information, and directly con-trols how much the image-tag information should be trustedother than the image visual information. We will discuss the im-pact of this parameter in the three different applications consid-ered in this research.

C. Content-Based Image Retrieval

In the content-based image retrieval (CBIR) tasks, we startthe random walk at an image node. After a -step random walk,we retrieve the top-ranked images as the retrieval results.

In Fig. 3, we perform four image retrievals with parameter. (This is the best setting based on our empirical anal-

ysis, and we will discuss the impact of later in this section.)The first retrieval is based on the image plotted in Fig. 3(a),which is a picture depicting a baseball player. Fig. 3(b)–(f)shows the top-5 images returned by our method. We can observethat these results are all semantically related to the originalpicture. Fig. 3(g), (m), and (s) are another three examples,with Fig. 3(h)–(I), (n)–(r), and (t)–(x) as the retrieval results,respectively. The results show the excellent performance of our


Fig. 4. Examples for RWIT method. (a) Query Image 1. (b) Rank 1. (c) Rank 2. (d) Rank 3. (e) Rank 4. (f) Rank 5. (g) Query Image 4. (h) Rank 1. (i) Rank 2.(j) Rank 3. (k) Rank 4. (l) Rank 5.

Fig. 5. Two failed examples for our method in CBIR. (a) Query Image 1. (b) Rank 1. (c) Rank 2. (d) Rank 3. (e) Rank 4. (f) Rank 5. (g) Query Image 2. (h) Rank1. (i) Rank 2. (j) Rank 3. (k) Rank 4. (l) Rank 5.

Fig. 6. P comparisons in CBIR.

approach. We also list some of the results in Fig. 4 which aregenerated from the RWIT method proposed in [9]. We can seethat our method can generate more reasonable results than theRWIT method. We also notice that in some cases, our algorithmcannot generate satisfactory results. Two such examples areillustrated in Fig. 5. For the first example, we can see that thequery image (flower) and the recommended images have very

Fig. 7. Impact of parameter in CBIR.

similar color and edge distributions. This is because of thatthe extracted visual features mainly take account of the globalcolor and edge distributions. Although the methods using localfeatures [22], [42], [24] can alleviate this problem, it requires


Fig. 8. Examples of top images using text query. (a) Rainbow. (b) Grand Canyon. (c) Fireworks. (d) Basketball. (e) Pyramid.

high computational power and large storage space to calculatethe local feature descriptors, especially for the Flickr photoswith relatively large size. The second example also shows thesimilar problem.

In order to show the performance improvement of our ap-proaches, we then compare our fusion by random walk (FRW)method and fusion by random walk with pseudo relevance feed-back (FRWPRF) method with another three methods.

1) RVF: this method is a baseline method, which is purelybased on image visual features. For every query image, weretrieve the top- images using the similarity calculationfunction defined in (1). We call this method retrieval byvisual feature (RVF) method.

2) RWIT: this method is based on the forward random walkmodel with self-transitions on the image-tag bipartitegraph which is proposed in [9]. For every query image,we start the random walk at the query image node on

the image-tag bipartite graph, and retrieve the top imagesas the results. We call this method random walk usingimage-tag (RWIT) relationships.

3) DiffusionRank: this method is a random walk methodbased on heat diffusion phenomenon which is proposedin [41]. For every query image, we start the heat diffusionprocess on the image-tag bipartite graph, and retrieve thetop images with the largest heat values as the results. Wecall this method DiffusionRank.

In order to evaluate these methods, we use the metrics Pre-cision . We select a set of 200 testing query images, andask a panel of three experts to measure the relevance betweenthe testing query images and the retrieved images. The Preci-sion is defined as

(7)


Fig. 9. Examples of top images using text query by RWIT method. (a) Basketball. (b) Pyramid.

where the set contains all the testing query images, isthe number of the testing query images, refers to the numberof relevant images retrieved as to the th testing query image,and is the number of top images retrieved for every testingquery image.

The comparison results are shown in Fig. 6. We can observethat our method FRW (with ) performs much better thanthe methods RVF, RWIT, and DiffusionRank. If we incorporatethe pseudo relevance feedback algorithm proposed in Algorithm1, the performance is further improved (in FRWPRF, we use thetop-5 results as the feedback images). This shows the promisingfuture of our proposed framework.

Parameter balances the information from image visual fea-tures and image-tag information. It takes advantages of thesetwo types of information. If , we only utilize the infor-mation from the image-tag bipartite graph; for , we onlymine the information from the image similarity graph. In othercases, we fuse these two sources together for the image retrievaltasks. To investigate the impact of parameter , we choose dif-ferent values of to evaluate our FRW method. Fig. 7 plots thetrend of the changing with parameter .

We can conclude that the value of affects the retrieval resultssignificantly. This indicates that fusing these two sources willnot always generate the best performance. We need to manuallychoose an appropriate value to avoid overtuning the parameters.Another interesting observation is when following the increaseof the value of , the value of first increases, but when

, the value of starts to drop. This phenomenondemonstrates in most cases, low-level visual features containless information than textual tags (that is why the optimal valueof lambda is closer to 1 than to 0), but a combination of bothsources of information usually achieves better results than usingonly one of them (that is why the optimal value of lambda is lessthan 1).

D. Text-Based Image Retrieval

In the text-based image retrieval (TBIR), we start the randomwalk at a tag node. After a -step random walk, we select thetop-ranked images as the retrieval results.

Fig. 10. P comparisons in TBIR.

Fig. 8 shows five TBIR examples (with ). The queriesare “Rainbow”, “Grand Canyon”, “Fireworks”, “Basketball”,and “Pyramid”, respectively. From the retrieved top-5 results,we can observe that our method performs very well. We alsolist some of the results generated by RWIT in Fig. 9 for com-parison. We create a set of 200 queries, and compare our FRWmethod and FRWPRF method in terms of TBIR with RWITand DiffusionRank methods, which only utilize the image-tagrelationships for retrieval. Fig. 10 describes the comparison re-sults. We find that both FRW and the relevance feedback methodFRWPRF perform much better than RWIT and DiffusionRankmethod. The parameter also play an important role in TBIR.Basically, it shares the same trend with CBIR, and the optimalvalue of is also around 0.7.

E. Image Annotation

Automated image annotation has been an active and chal-lenging research topic in computer vision and pattern recog-nition for years. Automated image annotation is essential tomake huge unlabeled digital photos indexable by existing text-based indexing and search solutions. In general, an image anno-tation task consists to assign a set of semantic tags or labels to anovel image based on some models learned from certain trainingdata. Conventional image annotation approaches often attempt


Fig. 11. Examples of image annotations.

TABLE IP IN AUTOMATED IMAGE ANNOTATION

to detect semantic concepts with a collection of human-labeledtraining images. Due to the long-standing challenge of objectrecognition, such approaches, though working reasonably wellfor small-sized testbeds, often perform poorly on large datasetin the real world. Besides, it is often expensive and time-con-suming to collect the training data.

In addition to the success in image retrieval, our FRM frame-work also provides a natural, effective, and efficient solution forautomated image annotation tasks. For every new image, we firstextract a 297-dimensional feature vector through the method de-scribed in Section III-A. Then, we find the top- similar im-ages using (1) ( in our experiments), and link the newimage to the top- images in the hybrid graph. Finally, we startthe random walk at this newly built image node, and return thetop- tags as the annotations to this image.

Fig. 11 gives six examples to demonstrate the qualitative per-formance of the annotation results by our framework (with

). We select a set of 50 images as the testing images for au-tomated image annotation. The image annotation accuracy isshown in Table I, which demonstrates a very competitive re-sult since automated image annotation is a very challengingproblem. We also observe that the trend of accuracy changingwith parameter is not similar to the one in CBIR and TBIR.As shown in Fig. 12, we can observe that the optimal value ofparameter is around 0.2. This is because at the beginning ofthe random walk, the starting image does not have any linksconnected to the tag nodes; hence, we need to trust more on theimage similarity subgraph. Otherwise, we cannot generate ac-curate image annotations, and the overall precision will suffer.

Fig. 12. Impact of parameter in image annotation.

V. CONCLUSIONS AND FUTURE WORK

In this paper, we present a novel framework for several imageretrieval tasks based on Markov random walk. The proposedframework bridges the semantic gap existing between visualcontents and textual tags in a simple but efficient way. We do notneed to train any learning function or any training data; hence,our method can be easily adapted to very large datasets. Finally,the experimental results on a large Flickr dataset show the ef-fectiveness of our approach.

In the future, we plan to incorporate more information intoour proposed framework. Specifically, we only utilize the imagecontents and the image tags information in this paper. Actually,there are lots of metadata on Flickr websites, such as the social


network information among users and the image notes informa-tion, which can also be employed to improve the retrieval per-formance. Moreover, we need to design a more flexible model toinclude all these information pieces. Another problem worthy ofinvestigation is to develop other Markov random walk models.Instead of using the “forward” random walk mode in this paper,we can also try the models like the “backward” model, and com-pare their performance. Another problem worthy of investiga-tion is that how the amount and quality of tags could affect theperformance of our method.

ACKNOWLEDGMENT

The authors also would like to thank the reviewers and asso-ciate editor for their helpful comments.

REFERENCES

[1] M. Ames and M. Naaman, “Why we tag: Motivations for annotation inmobile and online media,” in Proc. CHI’07, San Jose, CA, 2007, pp.971–980.

[2] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y.Wu, “An optimal algorithm for approximate nearest neighbor searchingfixed dimensions,” J. ACM, vol. 45, no. 6, pp. 891–923, 1998.

[3] E. Auchard, “Flickr to map the world’s latest photo hotspots,” Reuters,2007.

[4] S. Baluja, R. Seth, D. Sivakumar, Y. Jing, J. Yagnik, S. Kumar, D.Ravichandran, and M. Aly, “Video suggestion and discovery foryoutube: Taking random walks through the view graph,” in Proc.WWW’08, Beijing, China, 2008, pp. 895–904.

[5] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. M. Blei, and M.I. Jordan, “Matching words and pictures,” J. Mach. Learn. Res., vol. 3,pp. 1107–1135, 2003.

[6] D. M. Blei and M. I. Jordan, “Modeling annotated data,” in Proc.SIGIR’03, Toronto, ON, Canada, 2003, pp. 127–134.

[7] J. Canny, “A computational approach to edge detection,” IEEE Trans.Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov.1986.

[8] O. Chum, M. Perdoch, and J. Matas, “Geometric min-hashing: Findinga (thick) needle in a haystack,” in Proc. CVPR’09, 2009, pp. 17–24.

[9] N. Craswell and M. Szummer, “Random walks on the click graph,” inProc. SIGIR’07, Amsterdam, The Netherlands, 2007, pp. 239–246.

[10] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influ-ences, and trends of the new age,” ACM Comput. Surv., vol. 40, no. 2,pp. 1–60, 2008.

[11] C. Djeraba, “Association and content-based retrieval,” IEEE Trans.Knowl. Data Eng., vol. 15, no. 1, pp. 118–135, Jan.–Feb. 2003.

[12] P. Duygulu, K. Barnard, J. F. G. de Freitas, and D. A. Forsyth, “Ob-ject recognition as machine translation: Learning a lexicon for a fixedimage vocabulary,” in Proc. 7th Eur. Conf. Computer Vision—Part IV(ECCV’02), London, U.K., 2002, pp. 97–112, Springer-Verlag.

[13] N. Eiron, K. S. McCurley, and J. A. Tomlin, “Ranking the web frontier,”in Proc. WWW’04, New York, 2004, pp. 309–318.

[14] ESP. [Online]. Available: http://www.espgame.org.[15] J. Fan, Y. Gao, H. Luo, and R. Jain, “Mining multilevel image seman-

tics via hierarchical classification,” IEEE Trans. Multimedia, vol. 10,no. 2, pp. 167–187, Feb. 2008.

[16] J. Fan, Y. Gao, H. Luo, and G. Xu, “Automatic image annotation byusing concept-sensitive salient objects for image content representa-tion,” in Proc. SIGIR’04, Sheffield, U.K., 2004, pp. 361–368.

[17] Flickr. [Online]. Available: http://www.fickr.com.[18] A. Ghoshal, P. Ircing, and S. Khudanpur, “Hidden Markov models for

automatic annotation and content-based retrieval of images and video,”in Proc. SIGIR’05, Salvador, Brazil, 2005, pp. 544–551.

[19] A. Grigorova, F. G. B. D. Natale, C. K. Dagli, and T. S. Huang, “Con-tent-based image retrieval by feature adaptation and relevance feed-back,” IEEE Trans. Multimedia, vol. 9, no. 6, pp. 1183–1192, Oct.2007.

[20] W. H. Hsu, L. S. Kennedy, and S.-F. Chang, “Video search rerankingthrough random walk over document-level context graph,” in Proc.MM’07, Augsburg, Germany, 2007, pp. 971–980.

[21] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotationand retrieval using cross-media relevance models,” in Proc. SIGIR’03,Toronto, ON, Canada, 2003, pp. 119–126.

[22] Y. Jing and S. Baluja, “PageRank for product image search,” in Proc.WWW’08, Beijing, China, 2008, pp. 307–316.

[23] R. Krishnapuram, S. Medasani, S.-H. Jung, Y. Choi, and R. Balasubra-maniam, “Content-based image retrieval based on a fuzzy approach,”IEEE Trans. Knowl. Data Eng., vol. 16, no. 10, pp. 1185–1199, Oct.2004.

[24] Y.-H. Kuo, K.-T. Chen, C.-H. Chiang, and W. H. Hsu, “Query expan-sion for hash-based image object retrieval,” in Proc. MM’09, Beijing,China, 2009, pp. 65–74.

[25] M. Lades, J. C. Vorbruggen, J. Buhmann, J. Lange, C. von der Mals-burg, R. P. Wurtz, and W. Konen, “Distortion invariant object recogni-tion in the dynamic link architecture,” IEEE Trans. Comput., vol. 42,no. 3, pp. 300–311, Mar. 1993.

[26] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based mul-timedia information retrieval: State of the art and challenges,” ACMTrans. Multimedia Comput. Commun. Appl., vol. 2, no. 1, pp. 1–19,2006.

[27] D. G. Lowe, “Distinctive image features from scale-invariant key-points,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.

[28] H. Ma, I. King, and M. R. Lyu, “Effective missing data prediction forcollaborative filtering,” in Proc. SIGIR’07, Amsterdam, The Nether-lands, 2007, pp. 39–46.

[29] H. Ma, I. King, and M. R. Lyu, “Learning to recommend with socialtrust ensemble,” in Proc. SIGIR’09, Boston, MA, 2009, pp. 203–210.

[30] H. Ma, H. Yang, M. R. Lyu, and I. King, “SoRec: Social recommenda-tion using probabilistic matrix factorization,” in Proc. CIKM’08, NapaValley, CA, 2008, pp. 931–940.

[31] T. Ojala, M. Pietikainen, and D. Harwood, “A comparative study of tex-ture measures with classification based on feature distributions,” Pat-tern Recognit., vol. 29, no. 1, pp. 51–59, Jan. 1996.

[32] L. Page, S. Brin, R. Motwani, and T. Winograd, The Pagerank Ci-tation Ranking: Bringing Order to the Web, Tech. Rep. Paper, 1999,SIDLWP-1999-0120 (version of 11/11/1999).

[33] K. Pearson, “The problem of the random walk,” Nature, vol. 72, pp.294–294, 1905.

[34] G.-J. Qi, X.-S. Hua, and H.-J. Zhang, “Learning semantic distance fromcommunity-tagged media collection,” in Proc. MM’09, Beijing, China,2009, pp. 243–252.

[35] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra, “Relevance feedback:A power tool in interactive content-based image retrieval,” IEEE Trans.Circuits Syst. Video Technol., vol. 8, no. 5, pp. 644–655, Sep. 1998.

[36] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain,“Content-based image retrieval at the end of the early years,” IEEETrans. Pattern Anal. Mach. Intell., vol. 22, no. 12, pp. 1349–1380, Dec.2000.

[37] J. Tang, H. Li, G.-J. Qi, and T.-S. Chua, “Image annotation bygraph-based inference with integrated multiple/single instance repre-sentations,” IEEE Trans. Multimedia, vol. 12, no. 2, pp. 131–141, Feb.2010.

[38] S. Tong and E. Chang, “Support vector machine active learning forimage retrieval,” in Proc. MM’01, Ottawa, ON, Canada, 2001, pp.107–118.

[39] C. Wang, L. Zhang, and H.-J. Zhang, “Learning to reduce the semanticgap in web image retrieval and annotation,” in Proc. SIGIR’08, Singa-pore, 2008, pp. 355–362.

[40] L. Wu, X.-S. Hua, N. Yu, W.-Y. Ma, and S. Li, “Flickr distance,” inProc. MM’08, Vancouver, BC, Canada, 2008, pp. 31–40.

[41] H. Yang, I. King, and M. R. Lyu, “DiffusionRank: A possible penicillinfor web spamming,” in Proc. SIGIR’07, Amsterdam, The Netherlands,2007, pp. 431–438.

[42] S. Zhang, Q. Tian, G. Hua, Q. Huang, and S. Li, “Descriptive visualwords and visual phrases for image applications,” in Proc. MM’09, Bei-jing, China, 2009, pp. 75–84.

[43] J. Zhu, S. C. Hoi, and M. R. Lyu, “Face annotation by transductivekernel fisher discriminant,” IEEE Trans. Multimedia, vol. 10, no. 1, pp.86–96, Jan. 2008.

[44] J. Zhu, S. C. Hoi, M. R. Lyu, and S. Yan, “Near-duplicate keyframe re-trieval by nonrigid image matching,” in Proc. MM’08, 2008, pp. 41–50.

[45] J. Zhu, S. C. Hoi, M. R. Lyu, and S. Yan, “Near-duplicate keyframeretrieval by semi-supervised learning and nonrigid image matching,”ACM Trans. Multimedia Comput., Commun. Appl., to be published.

[46] X. Zhu, “Semi-supervised learning with graphs,” Ph.D. dissertation,Pittsburgh, PA, 2005, Chair-John Lafferty and Chair-Ronald Rosen-feld.


Hao Ma received the B.Eng. and M.Eng. degreesfrom the School of Information Science and En-gineering at Central South University, Changsha,China, in 2002 and 2005, respectively, and the Ph.D.degree from the Computer Science and EngineeringDepartment, The Chinese University of Hong Kong,Kowloon, in 2010.

He worked as a System Engineer in Intel Shanghai,Shanghai, China, before he joined CUHK as a Ph.D.student in November 2006. His research interestsinclude information retrieval, data mining, machine

learning, social network analysis, recommender systems, human computation,and social media analysis.

Jianke Zhu (M’09) received the B.S. degree inmechatronics and computer engineering from Bei-jing University of Chemical Technology, Beijing,China, the M.S. degree in electrical and electronicsengineering from University of Macau, Taipa,Macau, and the Ph.D. degree in computer scienceand engineering from The Chinese University ofHong Kong, Kowloon.

He is currently a postdoc in the Computer Sci-ence and Engineering Department in The ChineseUniversity of Hong Kong. His research interests

include computer vision, machine learning, pattern recognition, and multimediainformation retrieval.

Michael Rung-Tsong Lyu (F’04) received theB.S. degree in electrical engineering from NationalTaiwan University, Taipei, Taiwan, in 1981; the M.S.degree in computer engineering from the Universityof California, Santa Barbara, in 1985; and the Ph.D.degree in computer science from the University ofCalifornia, Los Angeles, in 1988.

He was with the Jet Propulsion Laboratory as aTechnical Staff Member from 1988 to 1990. From1990 to 1992, he was with the Department of Elec-trical and Computer Engineering, The University of

Iowa, Iowa City, as an Assistant Professor. From 1992 to 1995, he was a Memberof the Technical Staff in the applied research area of Bell CommunicationsResearch (Bellcore), Morristown, NJ. From 1995 to 1997, he was a ResearchMember of the Technical Staff at Bell Laboratories, Murray Hill, NJ. In 1998,he joined The Chinese University of Hong Kong, Jowloon, where he is now aProfessor in the Department of Computer Science and Engineering. He is alsoFounder and Director of the Video over InternEt and Wireless (VIEW) Tech-nologies Laboratory. His research interests include software reliability engi-neering, distributed systems, fault-tolerant computing, mobile and sensor net-works, Web technologies, multimedia information processing and retrieval, andmachine learning. He has published 330 refereed journal and conference papersin these areas. He was the editor of two book volumes: Software Fault Toler-ance (New York: Wiley, 1995) and The Handbook of Software Reliability Engi-neering (Piscataway, NJ: IEEE and New York: McGraw-Hill, 1996).

Dr. Lyu initiated the First International Symposium on Software ReliabilityEngineering (ISSRE) in 1990. He was the Program Chair for ISSRE 1996 andGeneral Chair for ISSRE 2001. He was also PRDC 1999 Program Co-Chair,WWW10 Program Co-Chair, SRDS 2005 Program Co-Chair, PRDC 2005General Co-Chair, ICEBE 2007 Program Co-Chair, and SCC 2010 ProgramCo-Chair. He will be the General Chair for DSN 2011 in Hong Kong. Hewas on the Editorial Board of the IEEE TRANSACTIONS ON KNOWLEDGE ANDDATA ENGINEERING, the IEEE TRANSACTIONS ON RELIABILITY, the Journalof Information Science and Engineering, and the Wiley Software Testing,Verification & Reliability Journal. Dr. Lyu is an AAAS Fellow and a CroucherSenior Research Fellow.

Irwin King (SM’08) received the B.Sc. degree in en-gineering and applied science from the California In-stitute of Technology, Pasadena, and the M.Sc. andPh.D. degrees in computer science from the Univer-sity of Southern California, Los Angeles.

He is with the Chinese University of Hong Kong,Kowloon. His research interests include machinelearning, web intelligence, social computing, datamining, and multimedia information processing.In these research areas, he has over 200 technicalpublications in journals and conferences. In addition,

he has contributed over 20 book chapters and edited volumes. Moreover, hehas over 30 research and applied grants. One notable patented system hehas developed is the VeriGuide System, which detects similar sentences andperforms readability analysis of text-based documents in both English andChinese to promote academic integrity and honesty.

Dr. King is an Associate Editor of the IEEE TRANSACTIONS ON NEURALNETWORKS (TNN) and the IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE(CIM). He is a member of the Editorial Board of the Open Information SystemsJournal, Journal of Nonlinear Analysis and Applied Mathematics, and NeuralInformation ProcessingCLetters and Reviews Journal (NIP-LR). He has alsoserved as Special Issue Guest Editor for Neurocomputing, International Journalof Intelligent Computing and Cybernetics (IJICC), Journal of Intelligent In-formation Systems (JIIS), and International Journal of Computational Intelli-gent Research (IJCIR). He is a member of ACM, International Neural NetworkSociety (INNS), and Asian Pacific Neural Network Assembly (APNNA). Cur-rently, he is serving the Neural Network Technical Committee (NNTC) and theData Mining Technical Committee under the IEEE Computational IntelligenceSociety (formerly the IEEE Neural Network Society). He is also a member ofthe Board of Governors of INNS and a Vice-President and Governing BoardMember of APNNA.

Date post:	12-May-2018
Category:	Documents
Upload:	dinhthien
View:	217 times
Download:	3 times

462 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, …king/PUB/TMM2010-Ma.pdf · Bridging the Semantic...

Documents