Contextual Kernel and Spectral Methods for Learning the Semantics of Images

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011 1739

Contextual Kernel and Spectral Methods for Learningthe Semantics of Images

Zhiwu Lu, Horace H. S. Ip, and Yuxin Peng

Abstract—This paper presents contextual kernel and spectralmethods for learning the semantics of images that allow us to au-tomatically annotate an image with keywords. First, to exploit thecontext of visual words within images for automatic image anno-tation, we define a novel spatial string kernel to quantify the sim-ilarity between images. Specifically, we represent each image as a2-D sequence of visual words and measure the similarity betweentwo 2-D sequences using the shared occurrences of -length 1-Dsubsequences by decomposing each 2-D sequence into two orthog-onal 1-D sequences. Based on our proposed spatial string kernel,we further formulate automatic image annotation as a contextualkeyword propagation problem, which can be solved very efficientlyby linear programming. Unlike the traditional relevance modelsthat treat each keyword independently, the proposed contextualkernel method for keyword propagation takes into account the se-mantic context of annotation keywords and propagates multiplekeywords simultaneously. Significantly, this type of semantic con-text can also be incorporated into spectral embedding for refiningthe annotations of images predicted by keyword propagation. Ex-periments on three standard image datasets demonstrate that ourcontextual kernel and spectral methods can achieve significantlybetter results than the state of the art.

Index Terms—Annotation refinement, kernel methods, keywordpropagation, linear programming, spectral embedding, stringkernel, visual words.

I. INTRODUCTION

W ITH the rapid growth of image archives, there is anincreasing need for effectively indexing and searching

these images. Although many content-based image retrieval sys-tems [1], [2] have been proposed, it is rather difficult for usersto represent their queries using the visual image features suchas color and texture. Instead, most users prefer image search bytextual queries, which is typically achieved by manually pro-viding image annotations and then searching over these anno-tations using a textual query. However, manual annotation is anexpensive and tedious task. Hence, automatic image annotation

Manuscript received June 28, 2010; revised September 25, 2010; acceptedDecember 13, 2010. Date of publication December 30, 2010; date of current ver-sion May 18, 2011. This work was supported in part by the Research Council ofHong Kong under Grant CityU 114007, the City University of Hong Kong underGrant 7008040, and the National Natural Science Foundation of China underGrants 60873154 and 61073084. The associate editor coordinating the reviewof this manuscript and approving it for publication was Dr. Sharath Pankanti.

Z. Lu and Y. Peng are with the Institute of Computer Science and Technology,Peking University, Beijing 100871, China (e-mail: [email protected];[email protected]).

H. H. S. Ip is with the Department of Computer Science, City Universityof Hong Kong, Kowloon, Hong Kong (e-mail: [email protected];[email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2010.2103082

plays an important role in efficient image retrieval. Recently,many methods for learning the semantics of images based onmachine learning techniques have emerged to pave the way forautomatic annotation of an image with keywords, and we canroughly classify them into two categories.

The traditional methods for automatic image annotation treateach annotation keyword or concept as an independent class andtrain a corresponding classifier to identify images belonging tothis class. This strategy has been adopted by methods such aslinguistic indexing of pictures [2] and image annotation usingsupport vector machine [3] or Bayes point machine [4]. Theproblem with these classification-based methods is that they arenot particularly scalable to a large-scale concept space. In thecontext of image annotation and retrieval, the concept spacegrows significantly large due to the large number (i.e., hundredsor even thousands) of keywords involved in the annotation ofimages. Therefore, the problems of semantic overlap and dataimbalance among different semantic classes become very se-rious, which lead to a significantly degraded classification per-formance.

Another category of automatic image annotation methodstake a different viewpoint and learn the correlation betweenimages and annotation keywords by means of keyword prop-agation. Many of such methods are based on the probabilisticgenerative models, among which an influential work is thecross-media relevance model [5] that estimates the joint proba-bility of image regions and annotation keywords on the trainingimage set. The relevance model for learning the semanticsof images has subsequently been improved through the de-velopment of continuous-space relevance model [6], multipleBernoulli relevance model [7], dual cross-media relevancemodel [8], and, more recently, our generalized relevance model[9]. Moreover, graph-based semi-supervised learning [10]has also been applied to keyword propagation for automaticimage annotation in [11]. However, these keyword propagationmethods ignore either the context of image regions or thecorrelation information of annotation keywords.

This paper focuses on keyword propagation for learning thesemantics of images. To overcome the problems with the abovekeyword propagation methods, we propose a 2-D string kernel,called the spatial spectrum kernel (SSK) [12], which quanti-fies the similarity between images and enables us to exploitthe context of visual words within images for keyword prop-agation. To compute the proposed contextual kernel, we repre-sent each image as a 2-D sequence of visual words and measurethe similarity between two 2-D sequences using the shared oc-currences of -length 1-D subsequences by decomposing each2-D sequence into two orthogonal 1-D sequences (i.e., the row-wise and column-wise ones). To the best of our knowledge, this

1057-7149/$26.00 © 2010 IEEE

1740 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011

Fig. 1. Illustration of the proposed framework for learning the semantics of images using visual and semantic context.

is the first application of string kernel for matching 2-D se-quences of visual words. Here, it should be noted that stringkernels were originally proposed for protein classification [13],and the number of amino acids (similar to the visual wordsused here) typically involved in the kernel definition was verysmall. In contrast, in the present work, string kernels are usedto capture and compare the context of a large number of visualwords within an image, and the associated problem of sequencematching becomes significantly more challenging. As comparedwith our previous work [12], this paper presents significantlymore extensive and convincing results, in particular, on the largeIAPR TC-12 image dataset [14]. More importantly, we furtherpresent a new and significant technical development that ad-dresses the issue of annotation refinement (see Section V). Thenovelty of the proposed refinement method is that it directlyconsiders the manifold structure of annotation keywords, whichgives rise to additional new and significant contributions uponour previous work.

Moreover, to exploit the semantic context of annotation key-words for automatically learning the semantics of images, a con-textual kernel method is proposed based on our spatial spec-trum kernel. We first formulate automatic image annotation asa contextual keyword propagation problem where multiple key-words can be propagated simultaneously from the training im-ages to the test images. Meanwhile, not to be confused by thetraining images that are far away (i.e., not in the same manifold),each test image is limited to absorbing the keyword informa-tion (e.g., confidence scores) only from its -nearest neighbors.Since this contextual keyword propagation problem is furthersolved very efficiently by linear programming [15], [16], ourcontextual kernel method is highly scalable and can be appliedto large image datasets. It should be noted that our contextualkeyword propagation distinguishes itself from previous workin that multiple keywords can be propagated simultaneously,which means that the semantic context of annotation keywordscan be exploited for learning the semantics of images. Moreimportantly, this type of semantic context can be further usedfor refining the annotations predicted by keyword propagation.Here, we first obtain spectral embedding [17]–[19] by exploitingthe semantic context of annotation keywords and then performannotation refinement in the resulting embedding space.

Finally, the above contextual kernel and spectral methods forlearning the semantics of images can be integrated in a unifiedframework as shown in Fig. 1, which contains three main com-ponents: visual context analysis with spatial spectrum kernel,learning with visual and semantic context, and annotationrefinement by contextual spectral embedding. In this paper, theproposed framework is tested on three standard image datasets:University of Washington (UW), Corel [20], and IAPR [14].Particularly, the Corel image dataset has been widely used

for the evaluation of image annotation [7], [21]. Experimentalresults on these image datasets demonstrate that the proposedframework outperforms the state-of-the-art methods. In sum-mary, the proposed framework has the following advantages.

1) Our spatial string kernel defined as the similarity betweenimages can capture the context of visual words within im-ages.

2) Our contextual spectral embedding method directly con-siders the manifold structure of annotation keywords forannotation refinement, and more importantly, the semanticcontext of annotation keywords can be incorporated intomanifold learning.

3) Our kernel and spectral methods can achieve promisingresults by exploiting both visual and semantic context forlearning the semantics of images.

4) Our contextual kernel and spectral methods are very scal-able with respect to the data size and can be used for large-scale image applications.

5) Our contextual kernel and spectral methods are verygeneral techniques and have the potential to improve theperformance of other machine learning methods that arewidely used in computer vision and image processing.

The remainder of this paper is organized as follows. Section IIgives a brief review of previous work. In Section III, we presentour spatial spectrum kernel to capture the context of visualwords which can be further used for keyword propagation.In Section IV, we propose our contextual kernel method forkeyword propagation based on the proposed spatial spectrumkernel. In Section V, the annotations predicted by our contextualkeyword propagation are further refined by novel contextualspectral embedding. Section VI presents the evaluation of theproposed framework on three standard image datasets. Finally,Section VII gives conclusions drawn from our experimentalresults.

II. RELATED WORK

Our keyword propagation method differs from the traditionalapproaches that are based on relevance model [5]–[7] and graph-based semi-supervised learning [11] in that the keyword corre-lation information has been exploited for image annotation. Al-though much effort had also been made in [22] to exploit thekeyword correlation information, it was limited to pairwise cor-relation of annotation keywords. In contrast, our method cansimultaneously propagate multiple keywords from the trainingimages to the test images. In [23], a particular structure of theannotation keywords was assumed in order to exploit the key-word correlation information. We argue that such an assump-tion could be violated in practice because the relationships be-tween annotation keywords may become too complicated. On

LU et al.: CONTEXTUAL KERNEL AND SPECTRAL METHODS FOR LEARNING THE SEMANTICS OF IMAGES 1741

the contrary, our method can exploit the semantic context of an-notation keywords of any order.

This semantic context is further exploited for refining the an-notations of images predicted by our contextual keyword prop-agation. Specifically, we first obtain contextual spectral embed-ding by incorporating the semantic context of annotation key-words into graph construction, and then perform annotation re-finement in the obtained more descriptive embedding space.This differs from previous methods, e.g., [21], [24], which di-rectly exploited the semantic context of keywords for annotationrefinement, without considering the manifold structure hiddenamong them.

More importantly, another type of context has also been in-corporated into our contextual keyword propagation. This canbe achieved by first representing each image as a 2-D sequenceof visual words and then defining a spatial string kernel to cap-ture the context of visual words. This contextual kernel can beused as a similarity measure between images for keyword prop-agation. In fact, both local and global context can be captured inour work. The spatial dependency between visual words learntwith our spatial spectrum kernel can be regarded as the localcontext, while the spatial layout of visual words obtained withmultiscale kernel combination (see Section III-C) provides theglobal context. In the literature, most previous methods onlyconsidered either local or global context of visual words. For ex-ample, the collapsed graph [25] and Markov stationary analysis[26] only learnt the local context, while the constellation model[27] and spatial pyramid matching [28] only captured the globalcontext.

To reduce the semantic gap between visual features and se-mantic annotations, we make use of an intermediate representa-tion with a learnt vocabulary of visual words, which is similar tothe bag-of-words methods such as probabilistic latent semanticanalysis (PLSA) [29] and latent Dirichlet allocation (LDA) [30].However, these methods typically ignore the spatial structure ofimages because the regions within images are assumed to be in-dependently drawn from a mixture of latent topics. In contrast,our present work captures the spatial context of regions based onthe proposed spatial spectrum kernel. It is shown in our exper-iments that this type of visual context is effective for keywordpropagation in the challenging application of automatic imageannotation.

III. VISUAL CONTEXT ANALYSIS WITH SPATIAL

SPECTRUM KERNEL (SSK)

To capture the context of visual words within images, we pro-pose an SSK which can be used as a similarity measure betweenimages for keyword propagation. We further present an efficientkernel computation method based on a tree data structure. Fi-nally, we propose multiscale kernel combination to capture theglobal layout of visual words within images. Hence, both localand global context within images can be captured and exploitedin our present work.

A. Kernel Definition

Similar to the bag-of-words methods, we first divide imagesinto equivalent blocks on a regular grid and then extract somerepresentative properties from each block by incorporating the

color and texture features. Through performing -means clus-tering on the extracted feature vectors, we generate a vocabu-lary of visual words which describes the contentsimilarities among the image blocks. Based on this universalvocabulary , each block is annotated automatically with a vi-sual word and an image is subsequently represented by a 2-Dsequence of visual words.

The basic idea of defining a spatial spectrum kernel is to mapthe 2-D sequence into a high-dimensional feature space:

. We first scan this 2-D sequence in the horizontal andvertical directions, which results in a row-wise 1-D sequenceand a column-wise 1-D sequence , respectively. The featuremapping can be formulated as follows:

(1)

where and denote the feature vectors for the row-wise and column-wise sequences, respectively. The above for-mulation means that these two feature vectors are stacked to-gether to form a higher dimensional feature vector for the orig-inal 2-D sequence .

More formally, for an image that is divided intoblocks on a regular grid, we can now denote it as a row-wisesequence and a column-wise one as

(2)

(3)

where is the visual word ofblock in the image. In the following, we will only give thedetails of the feature mapping for the row-wise sequences. Thecolumn-wise sequences can be mapped to a high dimensionalfeature space similarly.

Since the -spectrum of an input sequence is the set of all ofthe -length subsequences that it contains, our feature mappingused to define spatial spectrum kernel is indexed by all possible-length subsequences from the vocabulary (i.e., ),

that is, we can define the following that maps to a -di-mensional feature space:

(4)

where is the number of times that occurs in .We can find that in the feature space is now denoted as aweighted representation of its -spectrum. For example, given

and , the feature vector ofis

, where all of the possible -length subsequences(i.e., ) are AA, AB, BA, and BB, respectively.

Since the feature mapping for the column-wise sequences canbe defined similarly, our spatial spectrum kernel can be com-puted as the following inner product:

(5)

where and are two 2-D sequences (i.e., two images). Al-though the 2-D sequences are mapped to a high-dimensional(i.e., ) feature space even for fairly small , the feature vec-tors are extremely sparse: the number of nonzero coordinates isbounded by . This property en-ables us to compute our SSK very efficiently.


Fig. 2. Suffix tree constructed to compute the kernel value for two example sequences � � �� and �� : (a) the tree for � and (b) the tree after ��is compared with �. Here, � � �� and � � �. Each branch of the tree is labeled with a visual word from � , and each rectangular node denotes a leafthat stores two counts: one represents the number of times that an �-length subsequence of � ends at the leaf, while the other represents a similar count for ��.

B. Efficient Kernel Computation

A very efficient method for computing is to builda suffix tree for the collection of -length subsequences ofand , obtained by moving an -length sliding window acrosseither of and . Each branch of the tree is labeled with avisual word from . Each depth- leaf node of the tree storestwo counts: one represents the number of times that an -lengthsubsequence of ends at the leaf, while the other represents asimilar count for .

Fig. 2 shows a suffix tree constructed to compute thekernel value for two example sequences and

, where and . To comparethese two sequences, we first construct a suffix tree to collectall of the -length subsequences of . Moreover, to make thekernel computation more efficient, we ignore the -length sub-sequences of that do not occur in as they do not contributeto the kernel computation. Therefore, these subsequences (e.g.,AA) are not shown in Fig. 2.

It should be noted that this suffix tree has nodes be-cause each 2-D sequence on a grid only has(or ) -length subsequences. Using a linear timeconstruction algorithm for the suffix tree, we can build and an-notate the suffix tree with a time cost . The kernel valueis then calculated by traversing the suffix tree and computing thesum of the products of the counts stored at the depth- nodes.Hence, the overall time cost of calculating the spatial spectrumkernel is . Moreover, this idea of efficient kernel com-putation can be similarly used to build a suffix tree for all of theinput sequences at once and compute all of the kernel valuesin one traversal of the tree. This is essentially the method thatwe adopt to compute our kernel matrices in later experiments,though we use a recursive function rather than explicitly con-structing the suffix tree.

C. Multiscale Kernel Combination

We further take into account multiscale kernel combination tocapture the global layout of visual words within images. Similarto the idea of wavelet transform, we place a series of increas-ingly finer grids over the 2-D sequences of visual words, that is,

each subsequence at level will be divided into 2 2 parts atlevel , where and is the finest scale.Hence, we can obtain subsequences at level . Based on thesesubsequences, we can define a series of spatial spectrum kernelsand then combine them by a weighted sum.

Let be the th subsequence at level for a 2-D sequencem, that is, is in the th cell on the grid at this level.

The spatial spectrum kernel at this scale can be computed asfollows:

(6)

where and are two sequences, that is, we first define spa-tial spectrum kernel for each subsequence at level and thentake a sum of the obtained kernels. Intuitively, notonly measures the number of the same co-occurrences (i.e., spa-tial dependency) of visual words found at level in both and

, but also captures the spatial layout (e.g., from top or frombottom) of these co-occurrences on the grid at this level.

Since the co-occurrences of visual words found at level alsoinclude all of the co-occurrences found at the finer level ,the increment of the same co-occurrences found at level inboth and is measured by for

. The spatial spectrum kernels at multiple scales canthen be combined by a weighted sum

(7)

where a coarser scale is assumed to play a less important role.When , the above multiscale kernel degrades to theoriginal spatial spectrum kernel.


IV. LEARNING WITH VISUAL AND SEMANTIC CONTEXT

Here, we propose a contextual kernel method for keywordpropagation based on our spatial spectrum kernel. Since the se-mantic context of keywords can also be exploited for keywordpropagation, we succeed in learning the semantics of imagesusing both visual and semantic context.

A. Notations and Problems

We first present the basic notations for automatic image an-notation. Let denote the set of training im-ages annotated with keywords, where is the number oftraining images. Here, is the th training image representedas a 2-D sequence of visual words, while contains the an-notation keywords that are assigned to the image . We fur-ther employ a binary vector to represent a set of annotation key-words. In particular, for a keyword set , its vector representa-tion has its th element set to 1 onlywhen the th keyword and zero otherwise. Given a queryimage from the test set, our goal is to determine a confidencevector such that each element indicatesthe confidence score of assigning the th keyword to the queryimage .

Our contextual kernel method for keyword propagation de-rives from a class of single-step keyword propagation. Supposethe similarity between two images is measured by a kernel .The confidence score of assigning the th keyword to the testimage could be estimated by

(8)

where is set to 1 when the th keyword and zero other-wise. It should be noted that both graph-based semi-supervisedlearning [11] and probabilistic relevance model [6]–[8] can beregarded as variants of the above kernel method for keywordpropagation.

However, there are two problems with the above kernelmethod for keyword propagation. The first problem is thatthe confidence scores assigned to the test image are over-estimated, that is, all of the training images are assumed topropagate their annotation keywords to , and in the meantime each training image is assumed to propagate all of itskeywords to . These two assumptions are not necessarily truein many complex real-world applications. The second problemis that each keyword is propagated from the training images tothe test image independently of the other keywords, that is,the keyword correlation information is not used for keywordpropagation.

B. Contextual Keyword Propagation

To solve the above problems associated with automatic imageannotation, we propose a contextual kernel method for keywordpropagation in the following. First, not to overestimate the con-fidence scores assigned to the test image , we replace the

equality constraint for keyword propagation in (8) with the fol-lowing inequality constraint:

(9)

where is the set of -nearest neighbors of the test image. The above inequality indicates that the confidence score

propagated from the training images to the test image is upperbounded by the weighted sum of the pairwise similarity andcan not be obtained explicitly. Meanwhile, not to be confusedby the training images that are far away (i.e., not in the samemanifold structure), the test image is limited to absorbing theconfidence scores only from its -nearest neighbors.

Moreover, we exploit the keyword correlation information forkeyword propagation so that the annotation keywords are not as-signed to the test image independently. Given any set of anno-tation keywords represented as a binary vector

, it follows from (9) that

(10)

When the inequality is presented in the vector form of the an-notation keywords, it can be simplified as

(11)

Hence, given different annotation keywords and the trainingexamples , the confidence vector of assigning individual an-notation keywords to the test image is subject to the followingconstraints:

(12)

Actually, we can generalize the inner-product of binary vec-tors of annotation keywords (i.e., ) to a concave func-tion (see examples in Fig. 3), which means that the aboveinequality constraints are forced to be tighter. Thus, the con-straints in (12) are generalized in the following form:

(13)

In this paper, we only consider the exponential function, although there are other types of concave

functions. As shown in Fig. 3, this function ensures that we canobtain tighter constraints in (13). Here, it should be noted that

.Since it is insufficient to identify the appropriate only with

the constraints, we assume that, among all of the confidencescores that satisfy the constraints in (13), the optimal solution


Fig. 3. Exponential functions �� used by our method. Here,we show two examples with � � �� or ��. It can be observed that �� , where � � �.

is the one that “maximally” satisfies the constraints. This as-sumption leads to the following optimization problem:

(14)

where are the weights of annotation key-words. This is actually a linear programming problem, and wecan solve it efficiently by the following discrete optimization al-gorithm [16].Step 1) Sort annotation keywords as

.Step 2) Compute

for , where .Step 3) Set and output the confidence scores

for .According to [15], the concavity of ensures that the above

algorithm can find the optimal solution of the linear program-ming problem defined in (14). Here, it should be noted that ouralgorithm differs from [15] in three ways: 1) the motivation ofkeyword propagation is explained in more detail and the con-straints for linear programming are derived with fewer assump-tions; 2) each test image is limited to absorbing the confidencescores only from its -nearest neighbors in order to speed up theprocess of keyword propagation and avoid overestimating theconfidence scores; and 3) the visual context is incorporated intokeyword propagation by defining the similarity between imageswith our spatial spectrum kernel so that both visual and semanticcontext can be exploited for learning the semantics of images.The above algorithm for contextual keyword propagation is de-noted as CKP in the following. The time complexity of CKP is

for annotating a single query image. In this paper, weset (e.g., ) to ensure that the annotation processis very efficient for a large image dataset.

V. ANNOTATION REFINEMENT BY CONTEXTUAL

SPECTRAL EMBEDDING

Here, the semantic context of annotation keywords is furtherexploited for annotation refinement based on manifold learningtechniques, that is, we first present our contextual spectral em-bedding using the semantic context of annotation keywords, andthen perform annotation refinement in the more descriptive em-bedding space.

A. Contextual Spectral Embedding

To exploit the semantic context for spectral embedding, wefirst represent the correlation information of annotation key-words by the Pearson product moment (PPM) correlation mea-sure [31] as follows. Given a set of training images anno-tated with keywords, we collect the histogram of keywordsas , where isthe count of times that keyword occurs in image . The PPMcorrelation between two keywords and can be defined by

(15)

where and are the mean and standard deviationof , respectively. It is worth notingthat the semantic context of annotation keywords has actuallybeen captured from the set of training images using the abovecorrelation measure.

We now construct an undirected weighted graph for spectralembedding with the set of annotation keywords as the vertexset. We set the affinity matrix to measure thesimilarity between annotation keywords. The distinct advantageof using this similarity measure is that we have eliminated theneed to tune any parameter for graph construction which cansignificantly affect the performance and has been noted as aninherent weakness of graph-based methods. Here, it should benoted that the PPM correlation will be negative if and

are not positively correlated. In this case, we setto ensure that the affinity matrix is nonnegative. While thenegative correlation does reveal useful information among thekeywords and serves to measure the dissimilarity between thekeywords, our goal here, however, is to compute the affinity (orsimilarity) between the keywords and to construct the affinitymatrix of the graph used for spectral embedding. Although thedissimilarity information is not exploited directly, by setting theentries between the negatively correlated keywords to be zeros,we have effectively unlinked the negatively correlated keywordsin the graph (e.g., given two keywords “sun” and “moon” thatare unlikely to appear in the same image, we set their similarityto zero to ensure that they are not linked in the graph). In thisway, we have made use of the negative correlation informationfor annotation refinement based on spectral embedding. In fu-ture work, we will look into other possible ways to make use ofthe negative correlation information for image annotation.

The goal of spectral embedding is to represent each vertex inthe graph as a lower dimensional vector that preserves the sim-ilarities between the vertex pairs. Actually, this is equivalent tofinding the leading eigenvectors of the normalized graph Lapla-cian , where is a diagonal matrix with its


Fig. 4. Illustration of annotation refinement in the spectral embedding space: (a) an example image associated with the ground truth, refined, and unrefined annota-tions (the incorrect keywords are red-highlighted); (b) annotation refinement based on linear neighborhoods. Here, � � � � denotes the set of top 7 keywords that aremost highly correlated with a keyword, and in this neighborhood the keywords that also belong to the ground truth annotations of the image are blue-highlighted.

-element equal to the sum of the th row of the affinity ma-trix . In this paper, we only consider this type of normalizedLaplacian [19], regardless of other normalized versions [18]. Let

be the set of eigenvalues and the as-sociated eigenvectors of , where and

. The spectral embedding of the graph can be repre-sented by

(16)

where the th row is the new representation for vertex. Since we usually set , the annotation keywords

have actually been represented as lower dimensional vectors.In the following, we will present our approach to annotationrefinement using this more descriptive representation.

B. Annotation Refinement

To exploit the semantic context of annotation keywords forannotation refinement, the confidence scores of a query imageestimated by our contextual keyword propagation can be ad-justed based on linear neighborhoods in the new embeddingspace. The corresponding algorithm is summarized as follows.Step 1) Find smallest nontrivial eigenvectors

and associated eigenvaluesof . Here, is

the PPM correlation matrix.Step 2) Form , and

normalize each row of to have unit length. Here,the th row is a new feature vector for keyword

.

Step 3) Compute the new affinity matrix between keywordsas . Here, if , we set

to ensure that is nonnegative.Step 4) Adjust the confidence scores of each query image by

, where is aweight parameter, is the set of top key-words that are most highly correlated with keyword

, and is the confidence score of assigning key-word to this image.

It is worth noting that Step 2) slightly differs from (16).Here, we aim to achieve better refinement results throughpreprocessing (i.e., weighting and normalizing) the new featurevectors. Moreover, in Step 4), we perform annotation refine-ment based on linear neighborhoods in the new embeddingspace, as illustrated in Fig. 4. More importantly, the examplegiven by this figure presents a detailed explanation of howthe semantic context of keywords encoded in new embeddingspace is used to refine the annotations of the image. Beforerefinement, the three keywords “yellow_lines”, “people”, and“pole” are ranked according to their predicted confidencescores as follow: “pole” “people” “yellow_lines”. Hence,the two keywords “people” and “pole” are incorrectly attachedto the image, while the ground truth annotation “yellow_lines”is wrongly discarded. However, we can find that the keyword“yellow_lines” is highly semantically correlated with theground truth annotations of the image (see the five blue-high-lighted keywords in (yellow_lines) shown in Fig. 4). Thissemantic context is further exploited here for annotation refine-ment and the confidence score of “yellow_lines” is accordinglyincreased to the largest among the three keywords, i.e., this


Fig. 5. Some annotated examples selected from UW (first row), Corel (second row), and IAPR (third row) image datasets.

keyword can now be annotated correctly. On the contrary,since the keyword “people” is not at all semantically correlatedwith the ground truth annotations of the image, its confidencescore is decreased to the smallest value and it is discardedsuccessfully by our annotation refinement. Additionally, asfor the keyword “pole”, although not included in the groundtruth annotations of the image, we can still consider that thiskeyword is semantically correlated with the image (see thethree blue-highlighted keywords in (pole) shown in Fig. 4).

The above algorithm for annotation refinement by contextualspectral embedding is denoted as CSE in the following. The timecomplexity of CSE is for refining the annotations of asingle query image [mainly for spectral embedding in Step 1)].Since we have , our algorithm is very efficient evenfor a large image dataset (see the later experiments on the IAPRdataset). Moreover, our algorithm for annotation refinement hasanother distinct advantage. That is, besides the semantic con-text of annotation keywords captured from the training imagesusing the PPM correlation measure, other types of semanticcontext derived from prior knowledge (e.g., ontology) can alsobe readily exploited for annotation refinement by incorporatingthem into graph construction.

VI. EXPERIMENTAL RESULTS

Here, our SSK combined with CKP and CSE (i.e.,) is compared to three other repre-

sentative methods for image annotation: 1) spatial pyramidmatching (SPM) [28] combined with CKP and CSE (i.e.,

); 2) PLSA [29] combined with CKP andCSE (i.e., ); and 3) multiple Bernoullirelevance models (MBRM) [7] combined with CSE (i.e.,

). Moreover, we also make comparison betweenannotation using the semantic context and that without usingthe semantic context. These two groups of comparison arecarried out over three image datasets: University of Washington

(UW),1 Corel [20], and IAPR TC-12 [14]. Some annotatedexamples selected from these image datasets are shown inFig. 5

A. Experimental Setup

Our annotation method is tested on three standard imagedatasets. The first image dataset comes from University ofWashington (UW) and contains 1109 images annotated with338 keywords. Each image is annotated with 1–13 keywords.The images are of the size 378 252 pixels. The second imagedataset is Corel [20] that consists of 5000 images annotated with371 keywords. Each image is annotated with 1–5 keywords.The images are of the size 384 256 pixels. This image datasethas been widely used for the evaluation of image annotation inprevious work, e.g., [7], [21]. The third image dataset is IAPRTC-12 [14] that contains 20 000 images annotated with 275keywords. Each image is annotated with 1–18 keywords. Theimages are of the size 480 360 pixels. It is worth noting thatthe task of image annotation is very challenging on such a largeimage dataset.

For the three image datasets, we first divide images intoblocks on a regular grid, and the size of blocks is empiricallyselected: 64 64 pixels for MBRM just as [7], but 8 8pixels for the three annotation methods that adopt our CKP.Furthermore, we extract a 30-D feature vector from each block:six color features (block color average and standard deviation)and 24 texture features (average and standard deviation ofGabor outputs over three scales and four orientations). Here, itshould be noted that these feature vectors are directly used byMBRM and the computational cost in the annotation processthus becomes extremely large, while this problem can be solvedby the other three methods that adopt our CKP through firstquantizing these feature vectors into visual words. In thispaper, we consider a moderate vocabulary size forthe three image datasets.

1http://www.cs.washington.edu/research/imagedatabase/groundtruth/


Fig. 6. Effect of different parameters on the annotation performance measured by � for the UW image dataset. (a) Varying the neighborhood size �. (b) Varyingthe length of subsequences �. (c) Varying the scale �.

TABLE IRESULTS OF ANNOTATION USING VISUAL AND SEMANTIC CONTEXT ON THE UW IMAGE DATASET

In the experiments, we divide the UW image dataset ran-domly into 909 training images and 200 test images, and an-notate each test image with the top seven keywords. For theCorel image dataset, we split it into 4500 training images and500 test images just as [20], and annotate each test image withthe top five keywords. The IAPR image dataset is partitionedinto 16 000 training images and 4000 test images, and each testimage is annotated with the top nine keywords. After splittingthe datasets, as with previous work, we evaluate the obtainedannotations of the test images through the process of retrievingthese test images with single keyword. For each keyword, thenumber of correctly annotated images is denoted as , thenumber of retrieved images is denoted as , and the number oftruly related images in test set is denoted as . Then, the recall,precision, and measures are computed as follows:

(17)

(18)

which are further averaged over all of the keywords in the testset. Besides, we give a measure to evaluate the coverage of cor-rectly annotated keywords, i.e., the number of keywords withrecall which is denoted by # keywords . Themeasure is important because a biased model can achieve highprecision and recall values by only performing quite well on asmall number of common keywords.

Since the solution returned by our CKP algorithm is depen-dent only on the relative order of the weights of theannotation keywords, we only need to sort the weights withoutproviding their exact values. One straightforward way is to orderthe weights to be in the reverse order of keyword frequency,namely , where is the frequency of theth keyword in the training set. Moreover, according to Fig. 6(a),

we set for our CKP algorithm on the UW dataset. Here,

we can find that our CKP algorithm is not sensitive to this pa-rameter. Finally, according to Fig. 6(b) and (c), we set forour SSK and for both SSK and SPM on the UW dataset.The other parameters are also set the respective optimal valuessimilarly.

B. Results on UW Image Dataset

The results of annotation using visual and semantic contextare averaged over ten random partitions of the UW imagedataset and then listed in Table I. We can observe that ourannotation method (i.e., ) performs muchbetter than all of the other three methods. This observationmay be due to the fact that our method not only exploits thecontext of annotation keywords for keyword propagation andannotation refinement but also captures the context of visualwords within images to define the similarity between imagesfor keyword propagation. That is, we have successfully ex-ploited both visual and semantic context for image annotation.Particularly, as compared with MBRM that propagates a singlekeyword independently of the other keywords, our annotationmethod leads to 23% gain on the measure through contextualkeyword propagation using our spatial spectrum kernel.

We make further observations on the three methods that adoptCKP for image annotation. It is shown in Table I that both SSKand SPM achieve better results than PLSA which does not con-sider the context of visual words within images. Moreover, sinceSPM can only capture the global context of visual words, ourSSK performs better than SPM due to the fact that both localand global context are used to define the similarity between im-ages. These observations show that the context of visual wordsindeed helps to improve the annotation performance of keywordpropagation.

More importantly, to demonstrate the gain of exploiting thesemantic context of annotation keywords for image annotation,


Fig. 7. Comparison between annotation using the semantic context (CKP and Refined) and that without using the semantic context (KP and Unrefined) on theUW image dataset. (a) Keyword propagation versus contextual keyword propagation. (b) Unrefined annotation versus refined annotation by contextual spectralembedding.

TABLE IIRESULTS OF ANNOTATION USING VISUAL AND SEMANTIC CONTEXT ON THE COREL IMAGE DATASET

Fig. 8. Comparison between annotation using the semantic context (CKP and Refined) and that without using the semantic context (KP and Unrefined) on theCorel image dataset. (a) Keyword propagation versus contextual keyword propagation. (b) Unrefined annotation versus refined annotation by contextual spectralembedding.

we also compare annotation using this semantic context to an-notation without using this semantic context. The comparison isshown in Fig. 7. Here, keyword propagation given by (8) is de-noted as KP (without using the semantic context), while our pro-posed contextual keyword propagation is denotes as CKP (usingthe semantic context). Meanwhile, the refined annotation resultsby contextual spectral embedding are denoted as Refined (usingthe semantic context), while the annotation results before refine-ment are denoted as Unrefined (without using the semantic con-text). We can observe from Fig. 7 that the semantic context ofannotation keywords plays an important role in both keywordpropagation and annotation refinement.

C. Results on Corel Image Dataset

The results of annotation using visual and semantic contexton the Corel image dataset are listed in Table II. From this table,we can draw similar conclusions (compared to Table I). Throughexploiting both visual and semantic context for keyword propa-gation and annotation refinement, our method still performs thebest on this image dataset. Moreover, our method is also com-pared with more recent state-of-the-art methods [21], [32] usingtheir own results. As shown in Table II, our method outperforms[21], [32] because both visual and semantic context are used forlearning the semantics of images. To the best of our knowledge,


TABLE IIIRESULTS OF ANNOTATION USING VISUAL AND SEMANTIC CONTEXT ON THE IAPR IMAGE DATASET

the results reported in [32] are the best in the literature. How-ever, our method can still achieve 8% gain on the measureover this method. More importantly, we show in Fig. 8 the com-parison between annotation using the semantic context of an-notation keywords and annotation without using this semanticcontext. We can similarly find that this semantic context playsan important role in both keyword propagation and annotationrefinement on this image dataset.

D. Results on IAPR Image Dataset

To verify that our annotation method is scalable to largeimage datasets, we present the annotation results on IAPRdataset in Table III. In the experiments, we do not compare ourannotation method with PLSA and MBRM, since PLSA needshuge memory and MBRM incurs a large time cost when thedata size is 20 000. From Table III, we find that our SSK canachieve 17% gain over SPM (see versus

). That is, the visual context capturedby our method indeed helps to improve the annotation perfor-mance. Moreover, we also find that both of our CKP and CSEcan achieve improved results by exploiting the semantic contextof annotation keywords. Another distinct advantage of thesekernel and spectral methods is that they are very scalable withrespect to the data size. The time taken by our CKP and CSEon the large IAPR dataset is 21 and 1 min, respectively. We runthese two algorithms (Matlab code) on a PC with 2.33 GHzCPU and 2 GB RAM.

VII. CONCLUSION

We have proposed contextual kernel and spectral methods forlearning the semantics of images in this paper. To capture thecontext of visual words within images, we first define a spatialstring kernel to measure the similarity between images. Basedon this spatial string kernel, we further formulate automaticimage annotation as a contextual keyword propagation problem,which can be solved very efficiently by linear programming.Different from the traditional relevance models that treat eachkeyword independently, our contextual kernel method considersthe semantic context of annotation keywords and propagatesmultiple keywords simultaneously from the training images tothe test images. More importantly, such semantic context canalso be incorporated into spectral embedding for refining theannotations of images predicted by keyword propagation. Ex-periments on three standard image datasets demonstrate that ourcontextual kernel and spectral methods can achieve superior re-sults. In future work, these kernel and spectral methods will beextended to the temporal domain for problems such as videosemantic learning and retrieval. Moreover, since our contextualkernel and spectral methods are very general techniques, they

will be adopted to improve the performance of other machinelearning methods that are widely used in computer vision andimage processing.

REFERENCES

[1] R. Zhang and Z. Zhang, “Effective image retrieval based on hiddenconcept discovery in image database,” IEEE Trans. Image Process.,vol. 16, no. 2, pp. 562–572, Feb. 2007.

[2] J. Li and J. Wang, “Automatic linguistic indexing of pictures by a sta-tistical modeling approach,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 25, no. 9, pp. 1075–1088, Sep. 2003.

[3] Y. Gao, J. Fan, X. Xue, and R. Jain, “Automatic image annotation byincorporating feature hierarchy and boosting to scale up SVM classi-fiers,” in Proc. ACM Multimedia, 2006, pp. 901–910.

[4] E. Chang, G. Kingshy, G. Sychay, and G. Wu, “CBSA: Content-Basedsoft annotation for multimodal image retrieval using Bayes point ma-chines,” IEEE Trans.Circuits Syst. Video Technol., vol. 13, no. 1, pp.26–38, Jan. 2003.

[5] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annota-tion and retrieval using cross-media relevance models,” in Proc. SIGIR,2003, pp. 119–126.

[6] V. Lavrenko, R. Manmatha, and J. Jeon, “A model for learning thesemantics of pictures,” Adv. Neural Inf. Process. Syst., vol. 16, pp.553–560, 2004.

[7] S. Feng, R. Manmatha, and V. Lavrenko, “Multiple bernoulli relevancemodels for image and video annotation,” in Proc. IEEE Comput. Soc.Conf. Comput. Vision Pattern Recognit., 2004, vol. 2, pp. 1002–1009.

[8] J. Liu, B. Wang, M. Li, Z. Li, W. Ma, H. Lu, and S. Ma, “Dual cross-media relevance model for image annotation,” in Proc. ACM Multi-media, 2007, pp. 605–614.

[9] Z. Lu and H. Ip, “Generalized relevance models for automatic imageannotation,” in Proc. Pacific Rim Conf. Multimedia, 2009, pp. 245–255.

[10] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Schölkopf,“Ranking on data manifolds,” Adv. Neural Inf. Process. Syst., vol. 16,pp. 169–176, 2004.

[11] J. Liu, M. Li, W. Ma, Q. Liu, and H. Lu, “An adaptive graph model forautomatic image annotation,” in Proc. ACM Int. Workshop MultimediaInf. Retrieval, 2006, pp. 61–70.

[12] Z. Lu, H. Ip, and Q. He, “Context-based multi-label image annotation,”in Proc. ACM Int. Conf. Image Video Retrieval, 2009, pp. 1–7.

[13] C. Leslie, E. Eskin, and W. Noble, “The spectrum kernel: A stringkernel for SVM protein classification,” in Proc. Pacific Symp. Biocom-puting, 2002, pp. 566–575.

[14] H. Escalante, C. Hernández, J. Gonzalez, A. López-López, M. Montes,E. Morales, L. Sucar, L. Villasenor, and M. Grubinger, “The segmentedand annotated IAPR TC-12 benchmark,” Comput. Vis. Image Underst.,vol. 114, no. 4, pp. 419–428, 2010.

[15] F. Kang, R. Jin, and R. Sukthankar, “Correlated label propagationwith application to multi-label learning,” in Proc. CVPR, 2006, pp.1719–1726.

[16] Discrete Optim, R. Parker and R. Rardin, Eds. New York: Academic,1988.

[17] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, “Graph em-bedding and extensions: A general framework for dimensionality re-duction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1, pp.40–51, Jan. 2007.

[18] A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis andan algorithm,” Adv. Neural Inf. Process. Syst., vol. 14, pp. 849–856,2002.

[19] S. Lafon and A. Lee, “Diffusion maps and coarse-graining: A unifiedframework for dimensionality reduction, graph partitioning, and dataset parameterization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28,no. 9, pp. 1393–1403, Sep. 2006.


[20] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth, “Object recog-nition as machine translation: Learning a lexicon for a fixed image vo-cabulary,” in Proc. ECCV, 2002, pp. 97–112.

[21] J. Liu, M. Li, Q. Liu, H. Lu, and S. Ma, “Image annotation via graphlearning,” Patt Recog, vol. 42, no. 2, pp. 218–228, 2009.

[22] S. Zhu, X. Ji, W. Xu, and Y. Gong, “Multi-labelled classification usingmaximum entropy method,” in Proc. SIGIR, 2005, pp. 274–281.

[23] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor, “On max-imum margin hierarchical multi-label classification,” in Proc. NIPSWorkshop Learning Structured Outputs, 2004, pp. 1–4.

[24] C. Wang, F. Jing, L. Zhang, and H.-J. Zhang, “Image annotation re-finement using random walk with restarts,” in Proc. ACM Multimedia,2006, pp. 647–650.

[25] R. Behmo, N. Paragios, and V. Prinet, “Graph commute times forimage representation,” in Proc. CVPR, 2008, pp. 1–8.

[26] J. Li, W. Wu, T. Wang, and Y. Zhang, “One step beyond histograms:Image representation using Markov stationary features,” in Proc.CVPR, 2008, pp. 1–8.

[27] A. Holub, M. Welling, and P. Perona, “Hybrid generative-discrimi-native object recognition,” Int. J. Comput. Vis., vol. 77, no. 1–3, pp.239–258, 2008.

[28] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features:Spatial pyramid matching for recognizing natural scene categories,” inProc. CVPR, 2006, pp. 2169–2178.

[29] T. Hofmann, “Unsupervised learning by probabilistic latent semanticanalysis,” Mach. Learn., vol. 41, no. 1–2, pp. 177–196, 2001.

[30] D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet allocation,” J. Mach.Learn. Res., vol. 3, no. 4–5, pp. 993–1022, 2003.

[31] J. Rodgers and W. Nicewander, “Thirteen ways to look at the correla-tion coefficient,” Amer. Stat., vol. 42, no. 1, pp. 59–66, Feb. 1988.

[32] A. Makadia, V. Pavlovic, and S. Kumar, “A new baseline for imageannotation,” in Proc. ECCV, 2008, pp. 316–329.

Zhiwu Lu received the M.Sc. degree in applied math-ematics from Peking University, Beijing, China, in2005. He is currently working toward the Ph.D. de-gree in the Department of Computer Science, CityUniversity of Hong Kong, Kowloon, Hong Kong.

From July 2005 to August 2007, he was a Soft-ware Engineer with Founder Corporation, Beijing,China. From September 2007 to June 2008, he wasa Research Assistant with the Institute of ComputerScience and Technology, Peking University, Beijing,where he is currently an Assistant Professor. He has

authored or coauthored over 30 papers in international journals and conferenceproceedings. His research interests lie in pattern recognition, machine learning,multimedia information retrieval, and computer vision.

Horace H.S. Ip received the B.Sc. (first-classhonors) degree in applied physics and Ph.D. de-gree in image processing from the UniversityCollege London, London, U.K., in 1980 and 1983,respectively.

Currently, he is the Chair Professor of computerscience, the Founding Director of the Centre forInnovative Applications of Internet and MultimediaTechnologies (AIMtech Centre), and the ActingVice-President of City University of Hong Kong,Kowloon, Hong Kong. He has authored or coau-

thored over 200 papers in international journals and conference proceedings.His research interests include pattern recognition, multimedia content analysisand retrieval, virtual reality, and technologies for education.

Prof. Ip is a Fellow of the Hong Kong Institution of Engineers, the U.K. In-stitution of Electrical Engineers, and the IAPR.

Yuxin Peng received the Ph.D. degree in computerscience and technology from Peking University, Bei-jing, China, in 2003.

He joined the Institute of Computer Science andTechnology, Peking University, as an AssistantProfessor in 2003 and was promoted to a Professorin 2010. From 2003 to 2004, he was a VisitingScholar with the Department of Computer Science,City University of Hong Kong. His current researchinterests include content-based video retrieval,image processing, and pattern recognition.

Date post:	24-Sep-2016
Category:	Documents
Upload:	hhs
View:	212 times
Download:	0 times

Contextual Kernel and Spectral Methods for Learning the Semantics of Images

Documents