+ All Categories
Home > Documents > Statistical modeling and conceptualization of natural images

Statistical modeling and conceptualization of natural images

Date post: 22-Nov-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
21
Pattern Recognition 38 (2005) 865 – 885 www.elsevier.com/locate/patcog Statistical modeling and conceptualization of natural images Jianping Fan a , , Yuli Gao a , Hangzai Luo a , Guangyou Xu b a Department of Computer Science, University of North Carolina, 9201 Univ. City Blvd., Charlotte, NC 28223, USA b Department of Computer Science, Tsinghua University, Beijing, PR China Received 1 November 2003; received in revised form 30 June 2004; accepted 1 July 2004 Abstract Multi-level annotation of images is a promising solution to enable semantic image retrieval by using various keywords at different semantic levels. In this paper, we propose a multi-level approach to interpret and annotate the semantics of natural images by using both the dominant image components and the relevant semantic image concepts. In contrast to the well-known image-based and region-based approaches, we use the concept-sensitive salient objects as the dominant image components to achieve automatic image annotation at the content level. By using the concept-sensitive salient objects for image content representation and feature extraction, a novel image classification technique is developed to achieve automatic image annotation at the concept level.To detect the concept-sensitive salient objects automatically, a set of detection functions are learned from the labeled image regions by using support vector machine (SVM) classifiers with an automatic scheme for searching the optimal model parameters. To generate the semantic image concepts, the finite mixture models are used to approximate the class distributions of the relevant concept-sensitive salient objects. An adaptive EM algorithm has been proposed to determine the optimal model structure and model parameters simultaneously. In addition, a large number of unlabeled samples have been integrated with a limited number of labeled samples to achieve more effective classifier training and knowledge discovery.We have also demonstrated that our algorithms are very effective to enable multi-level interpretation and annotation of natural images. 2004 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Semantic image classification; Salient object detection; Adaptive EM algorithm; SVM 1. Introduction As high-resolution digital cameras become more afford- able and widespread, high-quality digital natural images have exploded on the Internet. With the exponential growth This project is supported by National Sciences Foundation under 0208539-IIS and the grant from AO Research Foundation, Switzerland. The work of Prof. Xu is supported by Chinese National Sciences Foundation under 60273005. Corresponding author. E-mail addresses: [email protected] (J. Fan), [email protected] (Y. Gao), [email protected] (H. Luo), [email protected] (G. Xu). 0031-3203/$30.00 2004 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2004.07.011 on high-quality digital natural images, the need of semantic image classification is becoming increasingly important to support effective image database indexing and retrieval at the semantic level [1–6]. As shown in Fig. 1, the semantic image similarity can be categorized into two major classes [7]: (a) similar image components (e.g., similar image objects such as sky, grass [28–35]) or similar global visual properties (e.g., similar global configurations such as openness, naturalness [27]); (b) similar semantic image concepts such as garden, beach, mountain view or similar abstract image concepts (e.g., im- age events such as sailing, skiing) [8–26]. To achieve the first class of semantic image similarity, it is very important to achieve a middle-level understanding of the semantics of image contents [28–35]. To achieve the
Transcript

Pattern Recognition 38 (2005) 865–885www.elsevier.com/locate/patcog

Statistical modeling and conceptualization of natural images�

Jianping Fana,∗,Yuli Gaoa, Hangzai Luoa, Guangyou XubaDepartment of Computer Science, University of North Carolina, 9201 Univ. City Blvd., Charlotte, NC 28223, USA

bDepartment of Computer Science, Tsinghua University, Beijing, PR China

Received 1 November 2003; received in revised form 30 June 2004; accepted 1 July 2004

Abstract

Multi-level annotation of images is a promising solution to enable semantic image retrieval by using various keywords atdifferent semantic levels. In this paper, we propose a multi-level approach to interpret and annotate the semantics ofnaturalimagesby using both the dominant image components and the relevant semantic image concepts. In contrast to the well-knownimage-based and region-based approaches, we use the concept-sensitive salient objects as the dominant image componentsto achieve automatic image annotation at the content level. By using the concept-sensitive salient objects for image contentrepresentation and feature extraction, a novel image classification technique is developed to achieve automatic image annotationat the concept level. To detect the concept-sensitive salient objects automatically, a set of detection functions are learned fromthe labeled image regions by using support vector machine (SVM) classifiers with an automatic scheme for searching theoptimal model parameters. To generate the semantic image concepts, the finite mixture models are used to approximate theclass distributions of the relevant concept-sensitive salient objects. Anadaptive EM algorithmhas been proposed to determinethe optimal model structure and model parameters simultaneously. In addition, a large number of unlabeled samples have beenintegrated with a limited number of labeled samples to achieve more effective classifier training and knowledge discovery. Wehave also demonstrated that our algorithms are very effective to enable multi-level interpretation and annotation ofnaturalimages.� 2004 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.

Keywords:Semantic image classification; Salient object detection; Adaptive EM algorithm; SVM

1. Introduction

As high-resolution digital cameras become more afford-able and widespread, high-quality digital natural imageshave exploded on the Internet. With the exponential growth

� This project is supported by National Sciences Foundationunder 0208539-IIS and the grant from AO Research Foundation,Switzerland. The work of Prof. Xu is supported by Chinese NationalSciences Foundation under 60273005.

∗ Corresponding author.E-mail addresses:[email protected](J. Fan), [email protected]

(Y. Gao),[email protected](H. Luo), [email protected](G. Xu).

0031-3203/$30.00� 2004 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.doi:10.1016/j.patcog.2004.07.011

on high-quality digital natural images, the need of semanticimage classification is becoming increasingly important tosupport effective image database indexing and retrieval atthe semantic level[1–6].

As shown inFig. 1, the semantic image similarity canbe categorized into two major classes[7]: (a) similar imagecomponents (e.g., similar image objects such as sky, grass[28–35]) or similar global visual properties (e.g., similarglobal configurations such as openness, naturalness[27]);(b) similar semantic image concepts such as garden, beach,mountain view or similar abstract image concepts (e.g., im-age events such as sailing, skiing)[8–26].

To achieve the first class of semantic image similarity, itis very important to achieve a middle-level understandingof the semantics of image contents[28–35]. To achieve the

866 J. Fan et al. / Pattern Recognition 38 (2005) 865–885

Fig. 1. Human beings interpret the semantics of images based on: (a) certain types of the concept-sensitive salient objects, (b) the globalconfiguration among the concept-sensitive salient objects.

second class of semantic image similarity, semantic imageclassification was reported as a promising approach, but itsperformance largely depends on two key issues[1]: (1) Theeffectiveness of visual patterns for image content representa-tion and feature extraction; (2) The significance of the algo-rithms for semantic image concept modeling and classifiertraining. Many techniques for semantic image classificationhave been proposed in the literatures[8–26]. However, fewexisting work has provided a good framework to address thefollowing inter-related problems jointly:

1.1. Quality of features

The success of most existing techniques for semantic im-age classification is often limited and largely depends onthe discrimination power of the low-level visual features be-cause ofsemantic gap. On the other hand, the discrimina-tion power of the low-level visual features largely dependson the effectiveness of the underlying visual patterns thatare selected for image content representation and feature ex-traction. Two approaches are widely used for image contentrepresentation and feature extraction: (1) image-based ap-proaches that treat the whole images as the individual visualpatterns for feature extraction, (2) region-based or Blob-based approaches that take the homogeneous image regionsor connected homogeneous image regions with the samecolor or texture (i.e., Blobs) as the underlying visual pat-terns for feature extraction. Both these two approaches haveobtained some successful evidences and negative evidences[27–35].

One common weakness of the region-based approaches isthat the homogeneous image regions have little correspon-dence to the semantic image concepts, thus they are not ef-fective for multi-class image classification[28–35]. Withoutusing image segmentation, the image-based approaches arevery attractive to enable a low-cost framework for featureextraction and image classification[20,21,24,25,27]. Since

only the global visual properties are used for image contentrepresentation[27], the image-based approaches may notwork very well for the images that contain the individualobjects, especially when the individual objects are used byhuman beings to interpret the semantics of images as shownin Fig. 1a [28–35].

To enhance the quality of features on discriminating be-tween different semantic image concepts, we need certainmeans to detect suitable semantic-sensitive visual patternsso that a middle-level understanding of the semantics of im-age contents can be achieved effectively.

1.2. Semantic image concept modeling and interpretation

In order to interpret the semantic image concepts intu-itively, it is very important to explore the contextual re-lationships and joint effects among the relevant concept-sensitive salient objects. However, no existing work has ad-dressed what kind of image context integration model canbe used to link the concept-sensitive salient objects to themost relevant semantic image concepts. Statistical imagemodeling is a promising approach to formulate the inter-pretations of the semantic image concepts quantitatively byusing generative or discriminative models[15,45]. How-ever, no existing work has addressed the origin of the con-cept models and the mathematical spaces for the conceptmodels.

1.3. Semantic image classification

Semantic image classification plays an important role onachieving the second class of semantic image similarity,many techniques have been proposed in the past[8–26].The limitation of pages does not allow us to survey all theseworks. Instead we try to emphasize some of these worksthat are most relevant to our proposed work. A Bayesianframework has been developed to enable binary semantic

J. Fan et al. / Pattern Recognition 38 (2005) 865–885 867

classification of the vacation images by using the image-based global visual features and vector quantization for den-sity estimation[20]. SIMPLIcity was reported as a binaryimage classification system by using the region-based vi-sual features and distance-based nearest neighbor classifier[19]. By using the image Blobs for content representation,Barnard et al. have also developed a Bayesian frameworkfor interpreting the semantics of images[10–12]. Withoutdetecting and recognizing the individual objects, Lipson etal. have presented a configuration-based technique for im-age classification by exploring the spatial relationships andarrangements of various image regions in an image[21].Schyns and Oliva have recently developed a novel approachfor natural image understanding and classification by treat-ing the discriminant power spectrum templates in the low-frequency domain as the concept-sensitive visual patterns[27], but this novel technique may fail for the natural im-ages that contain the individual objects and human beingsmay use these individual objects to interpret the semanticsof images as shown inFig. 1a. As mentioned above, theimage semantics can be described in multiple levels (i.e.,both the content level and the concept level). Thus a goodimage classification and annotation scheme should enablethe interpretation and annotation of both the dominant im-age components and the relevant semantic image concepts.However, few existing work has achieved such multi-levelannotation of images[2,3,8].

Based on these observations, we propose a novel frame-work to enable more effective interpretation of semantic im-age concepts and multi-level annotation ofnatural images.This paper is organized as follows: Section 2 presents a novelframework for semantic-sensitive image content representa-tion by using theconcept-sensitive salient objects; Section3 proposes a new framework for semantic image conceptinterpretation by using the finite mixture models to approxi-mate the class distributions of the relevant concept-sensitivesalient objects; Section 4 introduces an automatic techniquefor salient object detection; Section 5 presents a novel algo-rithm for incremental classifier training by using anadaptiveEM algorithm for parameter estimation and model selec-tion; Section 6 shows the benchmark environment for eval-uating our techniques for semantic image classification andinterpretation; We conclude in Section 7.

2. Semantic-sensitive image content representation

As mentioned above, the quality of features largely de-pends on the underlying visual patterns that are selectedfor image content representation and feature extraction.The visual features, that are extracted from whole imagesor homogeneous image regions, may not be effective ondiscriminating between different semantic image conceptsbecause the homogeneous image regions have little corre-spondence to the semantic image concepts and one singleimage may consist of multiple semantic image concepts

under different vision purposes. Thus a good framework forsemantic-sensitive image content representation should beable to achieve a middle-level understanding of the seman-tics of image contents and enhance the quality of features ondiscriminating between different semantic image concepts[28–35].

In order to enhance the quality of features, we proposea novel framework by using the concept-sensitive salientobjects to enable more expressive representation of imagecontents. The concept-sensitive salient objects are definedas the dominant image components that are semantic to hu-man being and are also visually distinguishable[28–35] orthe global visual properties of whole images that can beidentified by using the spectrum templates in the frequencydomain[27]. For example, the concept-sensitive salient ob-ject “sky” is defined as the connected image regions withlarge sizes (i.e., dominant image components) that are re-lated to the human semantics “sky”. The concept-sensitivesalient objects that are related to the global visual proper-ties in the frequency domain can be obtained easily by us-ing wavelet transformation[27]. In the following discussion,we focus on modeling and detecting the concept-sensitivesalient objects that correspond to the visually distinguish-able image components. In addition, thebasic vocabularyof such concept-sensitive salient objects can be obtained byusing the taxonomy of the dominant image components ofnatural imagesas shown inFig. 2.

Since the concept-sensitive salient objects are semanticto human beings, they can act as a middle-level representa-tion of image contents and break the semantic gap into two“smaller” and “bridgable” gaps as shown inFig. 3: (a) Gap1: The bridgable gap between the low-level image signalsand the concept-sensitive salient objects; (b) Gap 2: Thebridgable gap between the concept-sensitive salient objectsand the relevant semantic image concepts.

Through this multi-level framework, the semantic gap canbe bridged efficiently by two steps: (1) bridging theGap 1bydetecting the concept-sensitive salient objects automaticallyby learning the detection functions from the labeled imageregions, (2) bridging theGap 2by using the finite mixturemodels to approximate the class distributions of the relevantconcept-sensitive salient objects.

3. Semantic image concept modeling andinterpretation

In order to achieve automatic image annotation at the con-cept level, we have also proposed a novel technique for se-mantic image concept modeling and interpretation. To quan-titatively interpret the contextual relationships between thesemantic image concepts and the relevant concept-sensitivesalient objects, we use the finite mixture model (FMM) toapproximate the class distribution of the concept-sensitivesalient objects that are relevant to a specific semantic image

868 J. Fan et al. / Pattern Recognition 38 (2005) 865–885

cloudyclear

Sky

sky

rockwatergrasssandfloor

branch foliagefoliage

floralfoliagegreen

Nature

FoliageGround

skyskyblue

Fig. 2. Examples of the taxonomy of natural images.

Salient

Concept NcImage

Semantic Image

Concept j

Semantic

Concept 1Image

Semantic

Gap

2G

ap 1

Object

Image MImage mImage 1Segmentation

Image Automatic

Salient Object Detection

Learning for

Image Classification

Semantic

Color/Texture Pattern RPattern k

Color/Texture Color/TexturePattern 1

Type NeObject Salient

Type iObject Salient

Type 1

Fig. 3. The proposed image content representation and semantic context integration framework.

conceptCj

P (X,Cj , �,�cj ,�cj ) =�∑

i=1

P(X|Si, �si )�si , (1)

whereP(X|Si, �si ) is the ith multivariate mixture compo-nent withn independent means and a commonn×n covari-ance matrix,� indicates the optimal number of multivariatemixture components,�cj = {�si , i = 1, . . . , �} is the set ofthe model parameters for these multivariate mixture com-ponents,�cj = {�si , i = 1, . . . , �} is the set of the relativeweights among these multivariate mixture components, andX is then-dimensional visual features that are used for rep-resenting the relevant concept-sensitive salient objects. Forexample, the semantic image concept, “beach scene”, is re-lated to at least three types (classes) of the concept-sensitivesalient objects such as “sea water”, “sky”, “beach sand” andother hidden visual patterns.

It is known that an image object has various appear-ances under different viewing conditions, thus its princi-pal visual properties may look different at different light-ing and capturing conditions[29,34,35]. For example, theconcept-sensitive salient object “sky” consists of various ap-pearances, such as “blue sky pattern”, “white (clear) skypattern”, “cloudy sky pattern”, “dark (night) sky patterns”,and “sunset/sunrise sky pattern”, which have very differ-ent properties on color and texture under different viewingconditions. Thus, the data distribution for a specific typeof concept-sensitive salient object is approximated by usingmultiple mixture components to accommodate the variabil-ity of the same type of concept-sensitive salient object (i.e.,presence/absence of distinctive parts, variability in overallshape, changing of visual properties due to lighting condi-tions, viewpoints etc.).

The fundamental assumptions of our finite mixture mod-els are: (a) There is a many-to-one correspondence betweenthe multivariate mixture components and each type (class) of

J. Fan et al. / Pattern Recognition 38 (2005) 865–885 869

concept-sensitive salient object. (b) Different types (classes)of the concept-sensitive salient objects are independent intheir visual feature space.

For a specific semantic image concept, the optimal num-ber of mixture components and their relative weights areacquired automatically through a machine learning process.Using the finite mixture models for probabilistic interpre-tation of semantic image concepts is able to maintain thevariability (heterogeneity) among different semantic imageconcepts, thus it will offer a number of additional theoreti-cal advantages.

4. Automatic salient object detection

The objective of image analysis is to parse the naturalimages into the concept-sensitive salient objects in the ba-sic vocabulary. Based on the basic vocabulary as shown inFig. 2, we have already implemented 32 functions to detect32 types of concept-sensitive salient objects in natural im-ages, and each function is able to detect a specific type ofthese concept-sensitive salient objects in the basic vocabu-lary. Each detection function consists of three parts: (a) au-tomatic image segmentation by using the mean shift tech-nique[42], (b) binary image region classification by usingthe SVM classifiers with an automatic scheme for searchingthe optimal model parameters[40,41], (c) label-based ag-gregation of the connected similar image regions for salientobject generation.

We use our detection function of the concept-sensitivesalient object “beach sand” as an example to show how wecan design our detection functions. As shown inFig. 4, theimage regions with homogeneous color or texture are firstobtained by using the mean shift techniques[42]. Since thevisual properties of a certain type of concept-sensitive salientobject may look different at different lighting and capturingconditions[29], using only one single image is insufficientto represent its principal visual characteristics. Thus thisautomatic image segmentation procedure is performed on aset of training images which contain the concept-sensitivesalient object “beach sand”.

The homogeneous image regions in the training images,that are related to the concept-sensitive salient object “beachsand”, are selected and labeled as the training samples byhuman interaction. It is worth noting that the homogeneousimage regions related to the same type of concept-sensitivesalient object may have different visual properties on coloror texture. Region-based low-level visual features, such as1-dimensional coverage ratio (i.e., density ratio) for a coarseshape representation, 6-dimensional region locations (i.e.,2-dimensions for region center and 4-dimensions to indicatethe rectangular box for a coarse shape representation), 7-dimensional LUV dominant colors and color variances (i.e.,overall colors and color purity), 14-dimensional Tamura tex-ture, and 28-dimensional wavelet texture/color features, areextracted to characterize the principal visual properties of

these labeled image regions that are explicitly related tothe concept-sensitive salient object “beach sand”. The 6-dimensional region locations will be used to determine thespatial contexts among different types of concept-sensitivesalient objects and increase the expressiveness[34]. Thespatial context refers to the relationship and arrangement ofthe concept-sensitive salient objects in a natural image.

We useone-against-allrule to label the training samples�gj = {Xl, Lj (Xl)|l = 1, . . . , N}: positive samples for thespecific concept-sensitive salient object “beach sand” andnegative samples. Each labeled training sample is a pair(Xl, Lj (Xl)) that consists of a set of region-based low-levelvisual featuresXl and the semantic labelLj (Xl) for thecorresponding homogeneous image region.

The image region classifier is learned from these availablelabeled training samples. We use the well-known SVM clas-sifiers for binary image region classification[40,41]. Con-sider a binary classification problem with linearly separa-ble sample set�gj = {Xl, Lj (Xl)|l = 1, . . . , N}, where thesemantic labelLj (Xl) for the homogeneous image regionwith the visual featureXl is either+1 or −1. For the pos-itive samplesXl with Lj (Xl) = +1, there exists the trans-formation parameters� andb such that� · Xl + b> + 1.Similarly, for the negative samplesXl with Lj (Xl) = −1,we have�·Xl+b<−1. The margin between these two sup-porting planes will be 2/‖�‖2. The SVM classifier is thendesigned for maximizing the margin with the constraints� ·Xl +b>+1 for the positive samples and� ·Xl +b<−1for the negative samples.

Given the training set�gj = {Xl, Lj (Xl)|l = 1, . . . , N},the margin maximization procedure is then transformed intothe following optimization problem:

arg min�,b,�

1

2�T · � + C

N∑l=1

�l

Lj (� · �(Xl) + b)�1 − �l , (2)

where�l �0 represents the training error rate,C >0 is thepenalty parameter to adjust the training error rate and the

regularization term�T·�2 , �(Xl) is the function that maps

Xl into higher-dimensional space (i.e., feature dimensionsplus the dimension of response) and the kernel functionis defined as�(Xi,Xj ) = �(Xi)

T�(Xj ). In our currentimplementation, we select the radial basis function (RBF),�(Xi,Xj ) = exp(−‖Xi − Xj‖2), >0.

We have developed an efficient search algorithm to de-termine the optimal model parameters(C, ) for the SVMclassifiers: (a) The labeled image regions for a specific typeof concept-sensitive salient object are partitioned into sub-sets in equal size, where−1 subsets are used for classifiertraining and the remaining one is used for classifier vali-dation. (b) Our feature set for image region representationis first normalized to avoid the features in greater numericranges that dominate those in smaller numeric ranges. Be-cause inner product is usually used to calculate the kernel

870 J. Fan et al. / Pattern Recognition 38 (2005) 865–885

Fig. 4. The flowchart for automatic salient object extraction.

Fig. 5. The detection results of the concept-sensitive salient object “sunset/sunrise”.

values, this normalization procedure is able to avoid the nu-merical problem. (c) The numeric ranges for the parametersC and are exponentially partitioned into small pieces withM pairs. For each pair, − 1 subsets are used to train theclassifier model. When theM classifier models are avail-able, cross-validation is then used to determine the under-lying optimal parameter pair(C, ). (d) Given the optimalparameter pair(C, ), the final classifier model (i.e., sup-port vectors) is trained again by using the whole trainingdata set. (e) The spatial contexts among different types ofconcept-sensitive salient objects (i.e., coherence among dif-ferent types of concept-sensitive salient objects) have alsobeen used to cope well with the over-segmented images[34,35].

We currently have implemented 32 detection functionsfor 32 types of concept-sensitive salient objects in naturalimages. If all these detection functions fail to detect these32 types of concept-sensitive salient objects from the testimages, the wavelet transformation is performed on the testimages to obtain the 33rd type of concept-sensitive salientobject, i.e., spectrum templates in the frequency domain thatare used to represent the global visual properties of the testimages.

Some results for our detection functions are shown inFigs. 5–8. From these experimental results, one can find thatthe concept-sensitive salient objects can be visually distin-guishable and the principal visual properties of the domi-nant image components can be expressively represented. As

J. Fan et al. / Pattern Recognition 38 (2005) 865–885 871

Fig. 6. The detection results of the concept-sensitive salient object “sand field”.

Fig. 7. The detection results of the concept-sensitive salient object “cat”.

shown inFig. 9, the mean shift technique often partitionsone single object into multiple homogeneous image regionswith none of them being representative for the object and

thus the homogeneous image regions have little correspon-dence to the semantic image concepts. On the other hand,the concept-sensitive salient objects have the capability to

872 J. Fan et al. / Pattern Recognition 38 (2005) 865–885

Fig. 8. The detection results of the concept-sensitive salient object “water”.

Fig. 9. The concept-sensitive salient objects can enable more expressive representation of image contents: (a) original images, (b) homogeneousimage regions, (c) concept-sensitive salient objects.

characterize the principal visual properties of the corre-sponding image object and thus using the concept-sensitivesalient objects for image content representation and featureextraction can enhance the quality of features and result inmore effective semantic image classification. In addition,the concept-sensitive salient objects are semantic to human

beings and thus the keywords for interpreting the concept-sensitive salient objects can also be used to achieve auto-matic image annotation at the content level.

The optimal parameters(C, ) for some detection func-tions are given inTable 1. Precision� and recall� areused to measure the average performance of our detection

J. Fan et al. / Pattern Recognition 38 (2005) 865–885 873

Table 1The optimal parameters(C, ) of some detection functions

Salient objects Brown horse Grass Purple flower

C 8 10 6 0.5 1.0 0.5

Salient objects Red flower Rock Sand fieldC 32 32 8 0.125 2 2

Salient objects Water Human skin SkyC 2 32 8192 0.5 0.125 0.03125

Salient objects Snow Sunset/sunrise WaterfallC 512 8 32 0.03125 0.5 0.0078125

Salient objects Yellow flower Forest Sail clothC 8 32 64 0.5 0.125 0.125

Salient objects Elephant Cat ZebraC 32 512 512 0.5 1.125 4

functions

� =

+ ε, � =

+ ϑ, (3)

where is the set of true positive samples that are relatedto the corresponding type of concept-sensitive salient objectand are detected correctly,ε is the set of true negative sam-ples that are irrelevant to the corresponding type of concept-sensitive salient object but are detected incorrectly, andϑ isthe set of false positive samples that are related to the cor-responding type of concept-sensitive salient object but aremis-detected. The average performance for some detectionfunctions is given inTable 2.

It is worth noting that the procedure for salient objectdetection is automatic and the human interaction is only in-volved in the procedure to label the training samples (i.e.,homogeneous image regions) for learning the detection func-tions.After the concept-sensitive salient objects are extractedautomatically from the images, a set of visual features arethen calculated to characterize their principal visual proper-ties. These visual features include 1-dimensional coverageratio (i.e., density ratio) for a coarse shape representation,6-dimensional object locations (i.e., 2-dimensions for objectcenter and 4-dimensions to indicate the rectangular box for acoarse shape representation of salient object), 7-dimensionalLUV dominant colors and color variances (i.e., the overall

Table 2The average performance of some detection functions

Salient objects Brown horse Grass Purple flower

� (%) 95.6 92.9 96.1� (%) 100 94.8 95.2

Salient objects Red flower Rock Sand field� (%) 87.8 98.7 98.8� (%) 86.4 100 96.6

Salient objects Water Human skin Sky� (%) 86.7 86.2 87.6� (%) 89.5 85.4 94.5

Salient objects Snow Sunset/sunrise Waterfall� (%) 86.7 92.5 88.5� (%) 87.5 95.2 87.1

Salient objects Yellow flower Forest Sail cloth� (%) 87.4 85.4 96.3� (%) 89.3 84.8 94.9

Salient objects Elephant Cat Zebra� (%) 85.3 90.5 87.2� (%) 88.7 87.5 85.4

color and color purity of a certain concept-sensitive salientobject can be described in terms of the presence/absence ofthe dominant colors and color variances), 14-dimensionalTamura texture, and 28-dimensional wavelet texture/colorfeatures.

5. Semantic image classification and annotation

For each semantic image concept, the semantic labelsfor a limited number of training samples are provided byhuman-computer interaction. We useone-against-allruleto organize the labeled samples�cj = {Xl, Cj (Sl)|l =1, . . . , NL} into: positive samplesfor a specific se-mantic image conceptCj and negative samples. Eachlabeled sample is a pair(Xl, Cj (Sl)) that consists ofa set of n-dimensional visual featuresXl and the se-mantic label Cj (Sl) for the corresponding sampleSl .

The unlabeled samples�cj = {Xk, Sk |k = 1, . . . , Nu}can be used to achieve a better approximation ofthe class distribution and select more accurate modelstructure. For a certain semantic image conceptCj ,we then define the mixture training sample set as� = �cj

⋃�cj .

874 J. Fan et al. / Pattern Recognition 38 (2005) 865–885

5.1. Adaptive EM algorithm

Since the maximum likelihood estimate prefers the com-plex models with more free parameters[20,36], a penaltyterm is added to determine the underlying optimal modelstructure. Thus the optimal model structure and parameters(�̂, �̂cj , �̂cj ) for a certain semantic image concept are de-termined by

(�̂, �̂cj , �̂cj ) = arg max�,�cj

,�cj

{L(Cj ,X, �,�cj ,�cj )}, (4)

where L(Cj , X, �, �cj ,�cj ) = −∑Si∈�cj

logP(X,Cj ,

�,�cj ,�cj ) + logp(�,�cj ,�cj ) is the objectivefunction, −∑

Si∈�cjlogP(X, Cj , �, �cj , �cj ) is

the likelihood function, and logp(�, �cj , �cj ) =−n+�+3

2∑�

l=1 log N�l12 − �

2 log N12 − �(N+1)

2 is theminimum description length (MDL) term to penalize thecomplex models[20,36], N is the total number of trainingsamples andn is the dimensions of visual features.

The estimate of maximum likelihood can be achievedby using the EM algorithm[37,38]. Unfortunately, theEM iteration needs the knowledge of� and a “suitable”value of � is usually pre-defined based on personal ex-perience. However, it is critical to determine the valuesof � automatically for different semantic image conceptsbecause they may consist of different types of concept-sensitive salient objects. Thus a pre-defined� (i.e., afixed number of mixture components) may misfit theclass distribution of the relevant concept-sensitive salientobjects.

To estimate the optimal number of mixture components,we propose anadaptive EM algorithmby integrating pa-rameter estimation and model selection (i.e., selecting theoptimal number� of mixture components) seamlessly in asingle algorithm. It takes the following steps:

Step1: To avoid the initialization problem, our adaptiveEM algorithm starts from a reasonably large value of�to explain the essential structure of the training samplesand reduce the number of mixture components sequentially.The model parameters are initialized by using the labeledsamples.Step2: When the mixture components overpopulate in

some sample areas but underpopulate in other sample areas,the EM iteration encounters the local extrema. To escapethe local extrema, our adaptive EM algorithm performs au-tomaticmerging, splitting anddeathto re-organize the dis-tributions of mixture components and modify the optimalnumber of mixture components according to the real distri-butions of the training samples.

TheKullback divergenceKL(P (X|Si , �si ),P(X|Sj , �sj ))is used to measure the divergence between theith mixturecomponentP(X|Si, �si ) and the jth mixture component

P(X|Sj , �sj ) in the same concept model[39]

KL(P (X|Si, �si ), P (X|Sj , �sj ))=

∫P(X|Si, �si ) log

P(X|Si, �si )P (X|Sj , �sj )

. (5)

If KL(P (X|Si, �si ), P (X|Sj , �sj )) is small, these twostrongly overlapped mixture components provide similardensities and overpopulate the relevant sample areas. Thusthey can be potentially merged as one single mixture com-ponent. If they are merged, thelocal Kullback divergenceKL(P (X|Sij , �sij ), P(X|�sij )) is used to measure thelocal divergence between the merged mixture componentP(X|Sij , �sij ) and the local sample densityP(X|�sij ). Thelocal sample densityP(X|�sij ) is modified as the empiri-cal distribution weighted by the posterior probability anddefined as[20]:

P(X|�sij ) =∑N

i=1 �(X − Xi)P (Si |X,Cij , �sij )

P (Si |X,Cij , �sij ), (6)

whereP(Si |X,Cij , �sij ) is the posterior probability.To detect the best candidate for merging, our adaptive

EM algorithm tests�(�−1)2 pairs of mixture components,

and the pair with the minimum value of the local Kullbackdivergence is selected as the best candidate for merging.

At the same time, our adaptive EM algorithm also cal-culates thelocal Kullback divergenceKL(P (X,Cj |Sl, �sl ),P (X,Cj |�)) to measure the divergence between thelthmixture componentP(X,Cj |Sl, �sl ) and the local sampledensityP(X,Cj |�). If the local Kullback divergence fora specific mixture componentP(X,Cj |Sl, �sl ) is big, therelevant sample area is underpopulated and the elongatedmixture componentP(X,Cj |Sl, �sl ) is selected as the bestcandidate to besplit into two representative mixture com-ponents.

In order to achieve discriminative classifier training, theclassifiers for multiple semantic image concepts are trainedsimultaneously, where the positive samples for a certain se-mantic image concept can be used as the negative samplesfor other semantic image concepts. To control the potentialoverlapping among the class distributions for different se-mantic image concepts, our adaptive EM algorithm calcu-lates the Kullback divergenceKL(P (X, Cj |Sl , �sl ), P(X,Ci |Sm, �sm)) between two mixture components from theclass distributions of two different semantic image conceptsCj andCi . If the Kullback divergence between these twomixture componentsP(X,Cj |Sl , �sl ) andP(X,Ci |Sm, �sm)is small, these two mixture components are overlapped inthe feature space and thus they are selected as the best can-didates to be removed from the concept models so that dis-criminative classifier training can be achieved. By removingthe overlapped mixture components (i.e.,death), our classi-fier training technique is able to maximize the margin amongmultiple classifiers for different semantic image conceptsand thus it will result in higher prediction power.

J. Fan et al. / Pattern Recognition 38 (2005) 865–885 875

Step3: To optimize three kinds of operations of merging,splitting and death, their probabilities are defined as:

Jmerge(l, k,�) = KL(P (X,Cj |Slk, �slk ), P (X,Cj |�))

�(�),

(7)

Jdeath(l,m,�)

= KL(P (X,Cj |Sl, �sl ), P (X,Ci |Sm, �sm))�(�)

, (8)

Jsplit (l,�) = �(�)

KL(P (X,Cj |Sl, �sl ), P (X,Cj |�)), (9)

where�(�) is a normalized factor to make

�j∑l=1

Jsplit (l,�) +�j∑l=1

�j∑k=l+1

Jmerge(l, k,�)

+�j∑l=1

�i∑m=l+1

Jdeath(l,m,�) = 1. (9a)

The acceptance probability to perform the merging, splittingor death operation is defined by:

Paccept = min

(exp

[−|L(X,�1) − L(X,�2)|

],1

),

(10)

whereL(X,�1) andL(X,�2) are the objective functionsfor the models�1 and�2 (i.e., before and after performingthe merging, splitting or death operation) as described inEq. (4),� is a constant that is determined by experiments.In our current experiments,� is set as� = 9.8.Step4: Given the finite mixture model with a certain num-

ber of mixture components (i.e., after performing the merg-ing, splitting or death operation), the EM iteration is per-formed to estimate their mixture parameters such as means,covariances and weights among different mixture compo-nents.E Step: Calculates the expected likelihood function and

the posterior probability by using the mixture parametersobtained at thetth iteration.

P(Si |X,Cj , �,�cj ,�cj ) = P(X|Si, �si )�si∑�i=1P(X|Si, �si )�si

. (11)

M Step: Finds the(t + 1)th estimation of the mixtureparameters

�t+1si

= 1

Nl

Nl∑l=1

P(Si |Xl, Cj , �,�cj ,�cj ),

�t+1cj

=∑Nl

l=1XlP (Si |Xl, Cj , �,�cj ,�cj )∑Nl

l=1P(Si |Xl, Cj�,�cj ,�cj ),

�t+1cj

=∑Nl

l=1(Xl−�t+1cj )(Xl−�t+1

cj )TP(Si |Xl, Cj , �,�cj ,�cj )∑Nl

l=1P(Si |Xl, Cj�,�cj ,�cj ).

(12)

After the EM iteration procedure converges, a weak clas-sifier is built. The performance of this weak classifier is ob-tained by testing a small number of labeled samples thatare not used for classifier training. If the average perfor-mance of this weak classifier is good enough,P(Cj |X, �,�cj , �cj )��1, go to step 5. Otherwise, go back to step 2.�1 is set to 80% in our current experiments.Step5: Output the mixture model and parameters�̂, �̂cj ,

�̂cj .By performing the merging, splitting and death operations

automatically, our adaptive EM algorithm has the followingadvantages: (a) It does not require a careful initializationof the model structure by starting with a reasonably largenumber of mixture components, and the model parametersare initialized directly by using the labeled samples. (b) Itis able to take advantage of negative samples to achievediscriminative classifier training. (c) It is able to escape thelocal extrema and enable a global solution by re-organizingthe distributions of mixture components and modifying theoptimal number of mixture components.

We have also achieved a theoretical justification for theconvergence of the proposedadaptive EM algorithm. In ourproposedadaptive EM algorithm, the parameter spaces forthe two approximated models that are estimated incremen-tally have the following relationship:Merging operation: Two original mixture components,

P(X,Cj |Si , �si ) andP(X,Cj |Sj , �si ) are merged as onesingle representative mixture componentP(X|Sij , �sij )

�Sij = �Si + �Sj ,

� = � − 1,

�Sij �Sij = �Si �Si + �Sj �Sj ,

�Sij P (X|Sij , �sij ) = �Si P (X|Si, �si )+ �Sj P (X|Sj , �sj ). (13)

Split operation: The original mixture componentP(X,Cj |Sl, �sl ) is split into two representative mixturecomponents,P(X,Cj |Sm, �sm) andP(X,Cj |Sk, �sk )

�Sm = �Sk = �Sl

2,

� = � + 1,

�Sm�Sm = �Sl�Sl + �1, �Sk�Sk = �Sl�Sl − �1,

�Sl P (X|Sl, �sl ) = �SmP (X|Sm, �sm)+ �SkP (X|Sk, �sk ). (14)

876 J. Fan et al. / Pattern Recognition 38 (2005) 865–885

Death operation: The overlapped mixture componentP(X,Cj |Sl , �sl ) is removed from the finite mixture model

� = � − 1,

P (X,Cj , � − 1,�cj ,�cj ) = 1

1−�sl

�−1∑i=1

P(X|Si, �si )�si .

(15)

The real class distributionP(X, Cj , �∗,�∗

cj,�∗

cj) is de-

fined as the underlying optimal model that our proposedadaptive EM algorithmshould converge to. Thus we putthe real class distributionP(X, Cj , �∗, �∗

cj, �∗

cj) as the

first augment in the following discussion. Given the ap-proximated class distributionsP(X,Cj , �̂, �̂cj , �̂cj ) andP(X,Cj , �,�cj ,�cj ) that are estimated sequentially, theKullback divergences, between the real class distributionP(X,Cj , �

∗,�∗cj,�∗

cj) and the approximated class distri-

butions, is calculated as

�1 =∫

P(X,Cj , �∗,�∗

cj,�∗

cj)

× logP(X,Cj , �

∗,�∗cj,�∗

cj)

P (X,Cj , �,�cj ,�cj )dX, (16)

�2 =∫

P(X,Cj , �∗,�∗

cj,�∗

cj)

× logP(X,Cj , �

∗,�∗cj,�∗

cj)

P (X,Cj , �̂, �̂cj , �̂cj )dX, (17)

where�1 and�2, are always non-negative[39].Thus the differenceD, between�1 and �2, is able to

reflect the convergence of ouradaptive EM algorithm. ThedifferenceD is calculated as

D = �1 − �2

=∫

P(X,Cj , �∗,�∗

cj,�∗

cj)

× logP(X,Cj , �

∗,�∗cj,�∗

cj)

P (X,Cj , �̂, �̂cj , �̂cj )dX

−∫

P(X,Cj , �∗,�∗

cj,�∗

cj)

× logP(X,Cj , �

∗,�∗cj,�∗

cj)

P (X,Cj , �,�cj ,�cj )dX

=∫

P(X,Cj , �∗,�∗

cj,�∗

cj)

× logP(X,Cj , �,�cj ,�cj )

P (X,Cj , �̂, �̂cj , �̂cj )dX. (18)

By considering the implicit relationships among�, �̂, �∗,�cj , �̂cj , �∗

cj, �cj , �̂cj , �∗

cjandP(X, Cj , �∗, �∗

cj, �∗

cj),

P(X, Cj , �̂, �̂cj , �̂cj ), P(X, Cj , �, �cj , �cj ), we canprove:

D�0, if �̂, ���∗,D >0, if �̂, �> �∗. (19)

Hence ouradaptive EM algorithmcan reduce the divergencesequentially and thus it can be converged to the underlyingoptimal model incrementally. Our experimental results havealso demonstrated this convergence of our adaptive EM al-gorithm as shown inFigs. 19and20.

5.2. Classifier training with unlabeled samples

After the weak classifiers for the semantic image conceptsare available, we use the Bayesian framework to achieve a“soft” classification of unlabeled images (i.e., each unlabeledimage may belong to different semantic image concepts withdifferent posterior probabilities), thus the confidence scorefor an unlabeled image (i.e., unlabeled sample){Xl, Sl} canbe defined as[25]

�(Xl, Sl) =√

��(Xl, Sl)��(Xl, Sl), (20)

where ��(Xl, Sl) = max{P(Cj |Xl, �,�cj ,�cj )} is themaximum posterior probability for the unlabeled sample{Xl, Sl}, ��(Xl, Sl)=��(Xl, Sl)−max{P(Cj |Xl, �,�cj ,

�cj )|P(Cj |Xl, �,�cj ,�cj ) = ��(Xl, Sl)} is the multi-concept margin for the unlabeled sample{Xl, Sl}. For aspecific unlabeled sample{Xl, Sl}, its confidence score�(Xl, Sl) can be used as the criterion to indicate its possi-bility to be taken as theoutlier.

Based on their confidence scores, the unlabeled sam-ples can be categorized into two classes: (a)known contextclasses of existing semantic image concepts; (b) uncertainsamples. The unlabeled samples with high confidence scoresoriginate from the known context classes (i.e., known mix-ture components of concept models) of existing semanticimage concepts. On the other hand, the unlabeled sampleswith low confidence scores are treated as the uncertain sam-ples. The unlabeled samples with high confidence scorescan be used to improve the density estimation (i.e., regularupdating of the model parameters� and�) incrementally.However, they cannot provide additional image contexts toachieve more accurate modeling of the semantic image con-cepts (i.e., cannot discover additional mixture components)and thus they do not have the capability to improve themodel selection.

By adding the unlabeled samples with high confidencescores for incremental classifier training, the confidencescores for the uncertain samples can be updated over time.Thus the uncertain samples can be further categorized intotwo classes according to their updated confidence scores: (1)The uncertain samples that originate from theunknown con-text classes(i.e., unknown mixture components of conceptmodels)of existing semantic image concepts; (2) The un-certain samples that originate from theuncertain concepts.

The uncertain samples with a significant change of con-fidence scores originate from the unknown context classesof existing semantic image concepts, and thus they shouldbe included for incremental classifier training because theycan provide some additional image contexts to achieve

J. Fan et al. / Pattern Recognition 38 (2005) 865–885 877

more accurate modeling of the semantic image concepts.On the other hand, the uncertain samples without a sig-nificant change of confidence scores may originate fromthe uncertain concepts, and thus they should be eliminatedfrom the training set.

After the unlabeled samples for the uncertain concepts areeliminated, the likelihood function as described in Eq. (4) isreplaced by a joint likelihood function for both the labeledsamples and the unlabeled samples with high confidencescores. This joint likelihood function is defined as[43,44]

log P(X,Cj , �,�cj ,�cj )

=∑

Si∈�cj

logP(X,Cj |Sl ∈ Cj , �sl )�sl

+ �∑

Sm∈�cj

log

�j∑m=1

P(Y,Cj |Sm, �sm)�sm, (21)

where the discount factor� is used to determine the relativecontribution of the unlabeled samples for density estimation.Using the joint likelihood function in Eq. (21) to replacethe likelihood function in Eq. (4), our adaptive EM algo-rithm is then performed on the set of mixture training sam-ples, both originally and probabilistically labeled, to learna new classifier incrementally. By eliminating the outlyingvisual patterns, the unlabeled samples can be used to obtainmore accurate classifiers by improving the estimation of themixture-based sample density and model structure.E Step: Calculates the expected likelihood function and

the posterior probability by using the mixture parametersobtained at thetth iteration

P(Si |X,Cj , �,�cj ,�cj ) = P(X|Si, Cj , �si )�si∑�i=1P(X|Si, Cj , �si )�si

.

(22)

M Step: Finds the(t + 1)th estimation of the mixtureparameters by integrating the unlabeled samples with highconfidence scores

�t+1si

= 1

Nl + �Nu

Nl∑l=1

P(Si |Xl, Cj , �,�cj ,�cj )

+ �Nu∑l=1

P(Si |Xl, Cj , �,�cj ,�cj )

,

�t+1cj

=∑Nl

l=1XP(Si |Xl, Cj , �,�cj ,�cj ) + �∑Nu

l=1XP(Si |Xl, Cj , �,�cj ,�cj )∑Nl

l=1P(Si |Xl, Cj , �,�cj ,�cj ) + �∑Nu

l=1P(Si |Xl, Cj , �,�cj ,�cj ),

�t+1cj

=∑Nl

l=1 �P(Si |Xl, Cj , �,�cj ,�cj ) + �∑Nu

l=1 �P(Si |Xl, Cj , �,�cj ,�cj )∑Nl

l=1P(Si |Xl, Cj , �,�cj ,�cj ) + �∑Nu

l=1P(Si |Xl, Cj , �,�cj ,�cj ), (23)

where� = (Xl − �t+1cj )(Xl − �t+1

cj )T.

In order to incorporate the feature subset selection withthe model selection within our framework, we have proposeda wrapper strategy to combine the feature subset selectionwith the underlying classifier training procedure seamlessly.Given two feature subsets with different dimensions,F1and F2, our adaptive EM algorithm performs the modelselection and parameter estimation through these two featuresubsets and achieves two models,�1 and�2. To achievethe feature selection and model selection simultaneously, anovel technique has been developed to compare the classifiermodels which exist in the spaces of different feature subsetsand different number of mixture components:

P(XF1, Cj ,�1) =�1∑l=1

P(XF1, Cj |Sl, �sl , �sl )�sl ,

P (XF2, Cj ,�2) =�2∑

m=1

P(XF2, Cj |Sm, �sm, �sm)�sm,

(24)

where �1 = {�1, �sl ,�sl |l = 1, . . . , �1} and �2 ={�2, �sm,�sm |m = 1, . . . , �2}. The local Kullback di-vergenceKL(P (XF1, Cj ,�1), P(XF1,�1)) is used tomeasure the divergence between the mixture distributionP(XF1, Cj ,�1) and the local sample densityP(XF1,�1).

KL(P (XF1, Cj ,�1), P (XF1,�1))

=�1∑l=1

�sl

∫P(XF1, Cj |Sl, �sl , �sl )

× logP(XF1, Cj |Sl, �sl , �sl )

P (XF1, �sl )dX. (25)

If KL(P (XF1, Cj ,�1),P(XF1,�1))<KL(P (XF2, Cj ,

�2), P(XF2,�2)), the feature subsetF1 and con-cept model �1 is selected; If KL(P (XF1, Cj ,�1),P(XF1,�1)) = KL(P (XF2, Cj ,�2), P(XF2,�2)), themodel with smaller feature subset is selected. To implementthis feature subset selection scheme, we have proposeda backward search scheme by eliminating the irrelevantfeatures sequentially. In addition, the feature correlationshave also been considered in our feature selection proce-dure. If a certain dimension of feature subset is eliminated,other dimensions of feature subset that have higher correla-tions with it (i.e., with big value in covariance matrix) areselected as the prior candidates to be eliminated.

878 J. Fan et al. / Pattern Recognition 38 (2005) 865–885

Fig. 10. The flowchart for semantic image classification and annotation.

Fig. 11. The result for our multi-level image annotation system, where the image annotation includes the keywords for the concept-sensitivesalient objects “sky”, “rock”, “snow”, “forest” and the semantic concept “mountain view”.

By integrating the unlabeled samples for parameter es-timation, model selection and feature subset selection, ourincremental classifier training technique is able to achieveknowledge discovery (i.e., discovering the unknown contextclasses of existing semantic image concepts) and results inmore accurate modeling of the semantic image concepts.

5.3. Semantic image classification

Once the classifiers for theNc pre-defined semantic im-age concepts are in place, our system takes the followingsteps for semantic image classification as shown inFig. 10:(1) Given a specific test imageIl , the underlying concept-sensitive salient objects are detected automatically. It is im-portant to note that one specific test image may consist ofmultiple types of different concept-sensitive salient objectsin the basic vocabulary. ThusIl={S1, . . . , Si , . . . , Sn}. Thisscheme of concept-sensitive image segmentation enables usto interpret the semantics of the complex natural images col-lectively. (2) The class distribution of these concept-sensitivesalient objectsIl = {S1, . . . , Si , . . . , Sn} is then modeled asa finite mixture modelP(X,Cj |�,�cj ,�cj ) [15]. (3) Thetest imageIl is finally classified into the best matching se-mantic image conceptCj with the maximum posterior prob-ability.

P(Cj |X, Il,�) = P(X,Cj |�,�cj ,�cj )P (Cj )∑Nc

j=1P(X,Cj |�,�cj ,�cj )P (Cj ),

(26)

where � = {�,�cj ,�cj , j = 1, . . . , Nc} is the set ofthe mixture parameters and relative weights for the clas-sifiers, andP(Cj ) is the prior probability (i.e., relative

weight) of the semantic image conceptCj in the database.Thus the approach to semantic image classification takenhere is to model the class distribution of the relevantconcept-sensitive salient objects by using the finite mixturemodels.

Our current experiments focus on generating 15 basic se-mantic image concepts, such as “beach”, “garden”, “moun-tain view”, “sailing” and “skiing”, which are widely dis-tributed in natural images. It is important to note that oncean unlabeled test image is classified into a specific semanticimage concept, the text keywords that are used for interpret-ing the relevant semantic image concept and the underlyingconcept-sensitive salient objects become the text keywordsfor annotating the multi-level semantics of the correspond-ing image. The text keywords for interpreting the concept-sensitive salient objects (i.e., dominant image components)provide the annotations of the images at the content level.The text keywords for interpreting the relevant semantic im-age concepts provide the annotations of the images at theconcept level. Thus our multi-level image annotation frame-work can support more expressive interpretations of naturalimages as shown inFigs. 11–14. In addition, our multi-levelimage annotation technique will be very attractive to enablesemantic image retrieval so that naive users will have moreflexibility to specify their query concepts via various key-words at different semantic levels.

6. Performance evaluation

Our experiments are conducted on two image databases:a photography database that is obtained from Google

J. Fan et al. / Pattern Recognition 38 (2005) 865–885 879

Fig. 12. The result for our multi-level image annotation system, where the image annotation includes the keywords for the concept-sensitivesalient objects “sky”, “grass”, “flower”, “forest” and the semantic concept “garden”.

Fig. 13. The result for our multi-level image annotation system, where the image annotation includes the keywords for the concept-sensitivesalient objects “sky”, “sea water”, “sail cloth” and the semantic concept “sailing”.

Fig. 14. The result for our multi-level image annotation system, where the image annotation includes the keywords for the concept-sensitivesalient objects “sky”, “sand field”, “sea water” and the semantic concept “beach”.

search engine and a Corel image database. The photogra-phy database consists of 35,000 digital pictures. The Corelimage database includes more than 125,000 pictures withdifferent image concepts. These images (total 160,000) areclassified into 15 pre-defined classes of semantic imageconcepts and one additional category foroutliers. Our train-ing sets for 15 semantic image concepts consist of 1800labeled samples, where each semantic image concept has120 positive labeled samples.

Our algorithm and system evaluation works focus on: (1)By using the same classifier, evaluating the performance dif-ferences of two image content representation frameworks:concept-sensitive salient objectsversusimage blobs. (2) Un-der the same image content representation framework (i.e.,using the concept-sensitive salient objects), comparing theperformance differences between our proposed classifiersand the well-known SVM classifiers. (3) Using the concept-sensitive salient objects for image content representation,evaluating the performance differences of our proposed clas-sifiers by using different sizes of unlabeled samples for clas-sifier training.

6.1. Benchmark metric

The benchmark metricfor classifier evaluation includesclassification precision� and classification recall�. Theyare defined as

� = �

� + �, � = �

� + �, (27)

where� is the set of true positive samples that are relatedto the corresponding semantic image concept and are clas-sified correctly,� is the set of true negative samples that areirrelevant to the corresponding semantic image concept andare classified incorrectly,� is the set of false positive sam-ples that are related to the corresponding semantic imageconcept but are mis-classified.

As mentioned above, two key issues may affect the per-formance of the classifiers: (a) the performance of our de-tection functions of concept-sensitive salient objects; (b) theperformance of the underlying classifier training techniques.Thus the real impact for semantic image classification comesfrom these two key issues, theaverage precision� and

880 J. Fan et al. / Pattern Recognition 38 (2005) 865–885

Table 3The classification performance (i.e., average precision versus av-erage recall) comparison for our classifiers

Concept Mountain view Beach Garden

Salient � (%) 81.7 80.5 80.6Objects � (%) 84.3 84.7 90.6

Image � (%) 78.5 74.6 73.3Blobs � (%) 75.5 75.9 78.2

Concept Sailing Skiing DesertSalient � (%) 87.6 85.4 89.6Objects � (%) 85.5 83.7 82.8

Image � (%) 79.5 79.3 76.6Blobs � (%) 77.3 78.2 78.5

average recall� are then defined as

� = � × �, � = � × �, (28)

where� and� are the precision and recall for our detectionfunctions of the relevant concept-sensitive salient objects,� and � are the classification precision and recall for theclassifiers.

6.2. Performance comparison

To obtain the real impact of using the concept-sensitivesalient objects for semantic image classification, we com-pared the performance differences of the same semanticimage classifier by using the image blobs and the concept-sensitive salient objects. The average performance differ-ences are given inTables 3and4 for some semantic imageconcepts. For the SVM approach, the search scheme as in-troduced in Section 4 is used to obtain the optimal modelparameters as given inTable 5. The average performancesare obtained by averagingprecisionandrecall over 125,000Corel images and 35,000 photographs. One can find that us-ing the concept-sensitive salient objects for image contentcharacterization has improved the accuracy of the semanticimage classifiers significantly (i.e., both the finite mixturemodels and the SVM classifiers). It is worth noting that theaverage performance results� and� shown inTables 3and4have already included the potential detection errors that areinduced by our detection functions of the relevant concept-sensitive salient objects. In addition, the problem of over-detection of semantic image concepts can also be avoided.

By using the concept-sensitive salient objects for im-age content representation and feature extraction, the per-formance comparison between our classifiers and the SVMclassifiers is given inFig. 15. The experimental results areobtained for 15 semantic image concepts from the same test

Table 4The classification performance (i.e., average precision versus av-erage recall) comparison for the SVM classifiers

Concept Mountain view Beach Garden

Salient � (%) 81.2 81.1 79.3Objects � (%) 80.5 82.3 84.2

Image � (%) 80.1 75.4 74.7Blobs � (%) 76.6 76.3 79.4

Concept Sailing Skiing DesertSalient � (%) 85.5 84.6 85.8Objects � (%) 86.3 87.3 88.3

Image � (%) 81.2 78.9 80.2Blobs � (%) 75.6 79.4 81.7

Table 5The optimal parameters(C, ) of the SVM classifiers for somesemantic concepts

Semantic concepts Mountain view Beach Garden

C 512 32 312 0.0078 0.125 0.03125

Semantic concepts Sailing Skiing DesertC 56 128 8 0.625 4 2

dataset. By determining the optimal model structure and re-organizing the distributions of mixture components, our pro-posed classifiers are very competitive with the SVM classi-fiers. Another advantage of our classifiers is that the modelsfor image concept modeling are semantically interpretable.

Some results of our multi-level image annotation systemare given inFigs. 16and 17, where the keywords for au-tomatic image annotation include the multi-level keywordsfor interpreting both the visually distinguishable concept-sensitive salient objects and the relevant semantic imageconcepts.

Given the limited sizes of the labeled training samples,we have tested the performance differences of our classi-fier training algorithm by using different sizes of unlabeledsamples (i.e., different ratios� of unlabeled samples in themixture training set). The average performance differencesare given inFig. 18 for some semantic image concepts.One can find that the unlabeled samples can improve theclassifier’s performance significantly when only a limitednumber of labeled samples are available for classifier train-ing. The reasons are that a limited number of labeled sam-ples cannot interpret the necessary image contexts for se-

J. Fan et al. / Pattern Recognition 38 (2005) 865–885 881

Fig. 15. The performance comparison (i.e., classification precision versus classification recall) between finite mixture model (FMM) andSVM approach.

Fig. 16. The semantic image classification and annotation results of the natural scenes that consist of the semantic image concept “garden”and the most relevant concept-sensitive salient objects.

mantic image concept interpretation and the unlabeled sam-ples have the capability to provide additional image con-texts to learn the finite mixture models more correctly. Ifthe sizes of the available labeled samples are large enough,

the benefit from the unlabeled samples is limited becausethe labeled samples have already provided the necessary im-age contexts for interpreting the semantic image conceptscorrectly.

882 J. Fan et al. / Pattern Recognition 38 (2005) 865–885

Fig. 17. The semantic image classification and annotation results of the natural scenes that consist of the semantic image concept “sailing”and the most relevant concept-sensitive salient objects.

Ratio between Unlabeled Samples and Labeled Samples

2.52.01.51.00.5

Pre

cisi

on

mountain view

gardenskiingsailing

beach

1.0

0.9

0.8

0.7

0.6

0.5

Fig. 18. The relationship between the classifier performance (i.e.,precision�) and the percentage of the unlabeled samples� = Nu

NLfor classifier training.

6.3. Convergence evaluation

We have also tested the convergence of our adaptive EMalgorithm experimentally. As shown inFigs. 19 and 20,one can find that the classifier’s performance increases in-crementally before our adaptive EM algorithm convergesto the underly optimal model. After our adaptive EM al-gorithm converges to the underlying optimal model, merg-ing the overlapped mixture components in the finite mix-

Fig. 19. The classifier performance versus different number of mix-ture components, where the optimal number of mixture compo-nents for the semantic image concept “sailing” is� = 12.

ture models decreases the classifier’s performance. The EMalgorithm is guaranteed to converge only to the local ex-trema and does not guarantee the global solution. On theother hand, our adaptive EM algorithm is able to avoid thelocal extrema by involving an automatic merging, splittingand death procedure. Thus it can support the reasonably sta-

J. Fan et al. / Pattern Recognition 38 (2005) 865–885 883

Fig. 20. The classifier performance versus different number of mix-ture components, where the optimal number of mixture compo-nents for the semantic concept “garden” is� = 36.

ble convergence to the global solution as shown inFigs. 19and20.

7. Conclusions and future work

This paper has proposed a novel framework to enablemore effective semantic image classification and multi-levelimage annotation. Based on a novel framework for semantic-sensitive image content representation and classifier train-ing, our multi-level image annotation system has achievedvery good performance. Integrating unlabeled samples forclassifier training not only dramatically reduces the cost forlabeling sufficient samples required for accurate classifiertraining but also increases the classifier accuracy signifi-cantly. Experimental results have also demonstrated the ef-ficiency of our new framework and strategies for semanticimage classification.

It is worth noting that the proposed automatic salient ob-ject detection and semantic image classification techniquescan also be used for other image domains when the labeledtraining samples are available. It is also very important toclassify the images into multi-level semantic image con-cepts via concept hierarchy. Our future work will focus onaddressing these problems.

8. Summary

The semantic image similarity can be categorized intotwo major classes: (a) similar image components (e.g., sky,grass) or similar global visual properties (i.e., openness, nat-uralness); (b) similar semantic image concepts (e.g., garden,beach, mountain view) or similar abstract image concepts(e.g., image events such as sailing, skiing). To achieve thefirst class of semantic image similarity, it is very important

to enable more expressive representation and interpretationof the semantics of image contents. To achieve the secondclass of semantic image similarity, semantic image classifi-cation was reported as a promising approach, but its perfor-mance largely depends on two key issues: (1) The effective-ness of visual patterns for image content representation andfeature extraction; (2) The significance of the algorithms forsemantic image concept modeling and classifier training.

Based on these observations, we have proposed a multi-level approach to interpret the semantics ofnatural imagesby using both the dominant image components and the rel-evant semantic image concepts. The major contributions ofthis paper include: (a) Using the concept-sensitive salientobjects as the dominant image components to achieve amiddle-level understanding of the semantics of image con-tents and enhance the quality of features on discriminatingbetween different semantic image concepts, (b) Automaticdetection of the concept-sensitive salient objects by usingthe SVM classifiers with an automatic scheme for search-ing the optimal model parameters, (c) Semantic image con-cept modeling and interpretation by using the finite mixturemodels to approximate the class distributions of the relevantconcept-sensitive salient objects, (d) Adaptive EM algorithmto achieve the optimal model selection and model parame-ter estimation simultaneously, (e) Integrating a large num-ber of unlabeled samples with a limited number of labeledsamples to achieve knowledge discovery from large sets ofnatural images (i.e., discover the hidden image contexts), (f)Semantic image classification and multi-level image annota-tion to enable more effective image retrieval at the semanticlevel so that naive users will have more flexibility to specifytheir query concepts by using various keywords at differentsemantic levels.

References

[1] Y. Rui, T.S. Huang, S.F. Chang, Image retrieval: Past, present,and future, Journal of Visual Comm. and Image Represent.10 (1999) 39–62.

[2] A.B. Benitez, J.R. Smith, S.-F. Chang, MediaNet: amultimedia information network for knowledge represen-tation, Proceedings of the SPIE, vol. 4210, 2000.

[3] J.R. Smith, S.F. Chang, Visually searching the web for content,IEEE Multimedia, 1997.

[4] E. Chang, Statistical learning for effective visual informationretrieval, Proceedings of the ICIP, 2003.

[5] X. He, W.-Y. Ma, O. King, M. Li, H.J. Zhang, Learning andinferring a semantic space from user’s relevance feedback,ACM MM (2002).

[6] X. Zhu, T.S. Huang, Unifying keywords and visual contentsin image retrieval, IEEE Multimedia (2002) 23–33.

[7] R. Oami, A. Benitez, S.-F. Chang, N. Dimitrova, Under-standing and modeling user interests in consumer videos,ICME (2004).

[8] J.R. Smith, S.-F. Chang, Multi-stage classification of imagesfrom features and related text, Proceedings of the DELOS(1997).

884 J. Fan et al. / Pattern Recognition 38 (2005) 865–885

[9] J.R. Smith, C.S. Li, Image classification and querying usingcomposite region templates, Computer Vision and ImageUnderstanding 75 (1999).

[10] P. Duygulu, K. Barnard, N. de Freitas, D. Forsyth, Objectrecognition as machine translation: learning a lexicon for afixed image vocabulary, ECCV (2002).

[11] K. Branard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei,M.I. Jordan, Matching words and pictures, J. Mach. LearningRes. 3 (2003) 1107–1135.

[12] K. Barnard, D. Forsyth, Learning the semantics of words andpictures, Proceedings of the ICCV, 2001, pp. 408–415.

[13] M. Szummer, R.W. Picard, Indoor-outdoor image classifi-cation, Proceedings of the ICAIVL, 1998.

[14] M.R. Naphade, T.S. Huang, A probabilistic framework forsemantic video indexing, filtering, and retrieval, IEEE Trans.Multimedia 3 (2001) 141–151.

[15] H. Greenspan, J. Goldberger, A. Mayer, Probabilistic space-time video modeling via piecewise GMM, IEEE Trans. PAMI26 (3) (2004).

[16] C. Carson, S. Belongie, H. Greenspan, J. Malik, Region-basedimage querying, ICAIVL (1997).

[17] J. Huang, S.R. Kumar, R. Zabih, An automatic hierarchicalimage classification scheme, ACM MM (1998).

[18] N. Campbell, B. Thomas, T. Troscianko, Automatic segmen-tation and classification of outdoor images using neuralnetworks, Int. J. Neural Systems 8 (1997) 137–144.

[19] J.Z. Wang, J. Li, G. Wiederhold, SIMPLIcity: Semantic-sensitive integrated matching for picture libraries, IEEE Trans.PAMI 23 (2001) 947–963.

[20] A. Vailaya, M. Figueiredo, A.K. Jain, H.J. Zhang, Imageclassification for content-based indexing, IEEE Trans. ImageProcess. 10 (2001).

[21] P. Lipson, E. Grimson, P. Sinha, Configuration based sceneclassification and image indexing, CVPR (1997).

[22] W.H. Adams, G. Iyengar, C.-Y. Lin, M.R. Naphade, C. Neti,H.J. Nock, J.R. Smith, Semantic indexing of multimediacontent using visual, audio and text cues, EURASIP JASP 2(2003) 170–185.

[23] F. Jing, M. Li, L. Zhang, H.J. Zhang, B. Zhang, Learning inregion-based image retrieval, CIVR (2003).

[24] E. Chang, K. Goh, G. Sychay, G. Wu, CBSA: content-basedannotation for multimodal image retrieval using Bayes pointmachines, IEEE Trans. CSVT (2002).

[25] B. Li, K. Goh, E. Chang, Confidence-based dynamic ensamblefor image annotation and semantic discovery, ACM MM(2003).

[26] A. Mojsilovic, J. Gomes, B. Rogowitz, ISee: perceptualfeatures for image library navigation, Proceedings of the SPIE,2001.

[27] A.B. Torralba, A. Oliva, Semantic organization of scenes usingdiscriminant structural templates, Proceedings of the IEEEICCV, 1999.

[28] M.R. Naphade, J.R. Smith, A hybrid framework for detectingthe semantics of concepts and context, CIVR (2003).

[29] J. Luo, S. Etz, A physical model-based approach to detectingsky in photographic images, IEEE Trans. Image Process. 11(2002).

[30] S.F. Chang, W. Chen, H. Sundaram, Semantic visual template:linking visual features to semantics, Proceedings of the ICIP,1998.

[31] S. Li, X. Lv, H.J. Zhang, View-based clustering ofobject appearances based on independent subspace analysis,Proceedings of the IEEE ICCV, 2001, pp. 295–300.

[32] J. Luo, C. Guo, Perceptual grouping of segmented regions incolor images, Pattern Recognition 36 (2003) 2781–2792.

[33] Y.-F. Ma, H.J. Zhang, Contrast-based image attention analysisby using fuzzy growing, ACM MM (2003) 374–381.

[34] A. Singhal, J. Luo, W. Zhu, Probabilistic spatial contextmodels for scene content understanding, Proceedings of theCVPR, 2003.

[35] J. Luo, A. Singhal, S. Etz, R. Gray, A computational approachto determine of main subject regions in photographic images,Image Vision Comput. 22 (2004) 227–241.

[36] M. Hansen, B. Yu, Model selection and the principal ofminimum description length, J. Am. Stat. Assoc. 96 (2001)746–774.

[37] G. McLachlan, T. Krishnan, The EM Algorithm andExtensions, Wiley, New York, 2000.

[38] Y. Wu, Q. Tian, T.S. Huang, Discriminant-EM algorithm withapplication to image retrieval, Proceedings of the CVPR, 2000,pp. 222–227.

[39] S. Kullback, R. Leibler, On information and sufficiency, Ann.Math. Stat. 22 (1951) 76–86.

[40] S. Tong, E. Chang, Support vector machine active learningfor image retrieval, ACM MM (2001).

[41] T. Joachims, Transductive inference for text classificationusing support vector machine, Proceedings of the ICML, 1999.

[42] D. Comanicu, P. Meer, Mean shift: a robust approachtoward feature space analysis, IEEE Trans. PAMI 24 (2002)603–619.

[43] M.R. Naphade, X. Zhou, T.S. Huang, Image classificationusing a set of labeled and unlabeled images, Proceedings ofthe SPIE, November, 2000.

[44] Y. Wu, Q. Tian, T.S. Huang, Integrating unlabeled imagesfor image retrieval based on relevance feedback, ICPR(2000).

[45] S.C. Zhu, Statistical modeling and conceptualization of visualpatterns, IEEE Trans. PAMI 25 (2003) 691–712.

About the Author —JIANPING FAN received his M.S. degree in theory physics from Northwestern University, Xian, China, in 1994. andthe Ph.D. degree in optical storage and computer science from Shanghai Institute of Optics and Fine Mechanics, Chinese Academy ofSciences, Shanghai, China, in 1997.

He was a researcher at Fudan University, Shanghai, China, during 1998. From 1998 to 1999, he was a researcher with the Japan Societyfor Promotion of Sciences (JSPS), Department of Information System Engineering, Osaka University, Osaka, Japan. From September 1999to 2001, he was researcher in Department of Computer Science, Purdue University, West Lafayette, IN. He is now an assistant professor inthe Department of Computer Science, University of North Carolina, Charlotte, NC. His research interests include semantic video computingand content-based video retrieval for medical education.

About the Author —HANGZAI LUO received his B.S. degree in computer science from Fudan University, Shanghai, China, in 1998. From1998 to 2002, he was lecturer in Department of Computer Science, Fudan University. He is now pursuing his Ph.D. degree in InformationTechnology at University of North Carolina, Charlotte, NC. His research interests include video analysis and content-based video retrieval.

J. Fan et al. / Pattern Recognition 38 (2005) 865–885 885

About the Author —YULI GAO received his B.S. degree in computer science from Fudan University, Shanghai, China, in 2002. He is nowpursuing his Ph.D. degree at Department of Computer Science, University of North Carolina at Charlotte. His research areas are imagesegmentation and classification.

About the Author —GUANGYOU XU received his B.S. degree in computer Science from Tsinghua University, Beijing, China, in 1963.He joined Tsinghua University as an assistant professor at 1963 and a professor at 1989. From 1982 to 1984, he was a visiting professorat Purdue University, West Lafayette, IN, USA. From 1993 to 1994, he was a visiting professor at Beckman Institute, University of Illinoisat Urbana Champaign. His research areas include computer vision, content-based image/video analysis and retrieval.


Recommended