[IEEE 2011 18th IEEE International Conference on Image Processing (ICIP 2011) - Brussels, Belgium...

GENERATING VOCABULARY FOR GLOBAL FEATURE REPRESENTATION TOWARDSCOMMERCE IMAGE RETRIEVAL

Zhang Chen, Ling-Yu Duan, Chunyu Wang, Tiejun Huang, Wen Gao

The Institute of Digital Media, School of EE&CS, Peking University, Beijing, 100871, China{azhang, lingyu, cywang, tjhuang, wgao}@pku.edu.cn

ABSTRACT

This paper studies the problem of retrieving images by color, tex-ture and shape in the context of visual assisted product recommenda-tion in E-commerce sites. Different from general CBIR applications,commerce image retrieval puts more emphasis on outlier-free rank-ing (top N) to gain perfect user experience. We suggest to extend thebag-of-words (BoW) model to global feature characterization ratherthan commonly used histogram based low-level feature representa-tion. Although BoW is a common practice in object recognition, weargue generating feature vocabulary is useful to address the globalfeature characterization that could be elegantly adapted to domainspecific commerce image search. The representation is compact anddiscriminative, which may adapt with individual websites. Quanti-tative as well as subjective evaluation demonstrates the functionalityof the proposed method. In practice, the vocabulary based globalfeatures greatly reduce outliers in top rank images, so that desirableuser experience can be obtained in E-Commerce applications.

Index Terms— Image retrieval, Global feature representation,codebook, vocabulary tree, Commerce images

1. INTRODUCTION

Emerging visual assisted product indexing, retrieval, and recom-mendation is boosting successful applications of image analysis inE-commerce, such as two popular sites http://www.like.com andhttp://www.departmentofgoods.com/. Such sites allow users to se-lect or upload a query image and retrieve images in an order thatusers are comfortable with in terms of color, texture or shape. Whentexts cannot exactly tell a product (e.g, garment, devices, etc.), visualassistance undoubtedly plays an important role. In CBIR researchdomain, lots of works have been devoted to representing global fea-tures like color, edge, texture, shape, etc. As a typical nonparametricmethod, various histograms have been widely developed where fea-ture spaces are quantized into equal or unequal bins[1]. However,histogram representation suffers from either considerable quanti-zation errors or curses of dimensionality. An alternative is to usedominant color descriptor[2]to generate very compact signatures.However, dominant components are insufficient to meet more finecharacterization in the context of commercial image retrieval.

Targeting visual assisted recommendation, the traditional his-togram based representation is deficient to meet user requirementsof nearly outlier free ranking in the top list. Extensive empiricalstudy shows that quantitative retrieval evaluation (e.g. MAP @ N)cannot exactly reflect actual user experience. Such problem ariseswhen there are few positive instances per query in a small scale (say10,000) image dataset (a typical volume on a B2C website). Asshown in Fig.1, even one outlier in top ranking results would de-teriorate user experience.

Fig. 1. Two rows of retrieval results with comparable average preci-sion but quite different user experience (The first row is much better).

Recent research on visual search has put more efforts on theproblem of near-duplicate image search (finding instances contain-ing similar objects (e.g. logo, landmark, etc.)[3][4] that may undergodifferent transformation or view-point changes). It has been greatlyadvanced by the BoW models[3] and invariant local descriptors[5].Salient regions or interest points are detected, followed by comput-ing a descriptor for each local patch. To quantize local descriptorsinto a vocabulary of visual words, either flat[6] or hierarchical[4]k-means clustering are applied to learn the vocabulary.

The success of Bow model in local features based image searchmotivates us to derive compact and discriminative global feature rep-resentation. We made successful attempts to apply BoW to representcolor, texture and shape features by dense sampling. Distinct vo-cabularies for each feature are constructed by applying hierarchicalk-means to densely sampled features from large amounts of trainingimages. Based on each vocabulary, we quantize densely sampled lo-cal features in an image to generate a compact representation. Theimage similarity is computed by Earth Mover Distance (EMD) dis-tance. Experiments over product images from real E-commerce siteshave shown that our method greatly improves image search perfor-mance through quantitative as well as subjective evaluation.

The paper is organized as follows. Section 2 reviews relatedworks on visual search using bag-of-words model. Section 3 intro-duces the codebook generation. Experiment settings and results aresummarized in section 4. Section 5 concludes the work.

2. RELATED WORK

The idea of bag-of-words to generate codebook has been studied invisual search where local patches are obtained either by dense sam-pling or key-points detectors. Leung at al.[7] tried dense sampling inimages, applying a bank of Gabor-like filters to each patch, followedby vector quantization to generate codebook to recognize textures.Winn at al.[8] extract features from all pixels of an image and quan-tize them into visual words using a K-means derived codebook.

2011 18th IEEE International Conference on Image Processing

978-1-4577-1303-3/11/$26.00 ©2011 IEEE 105

Csurka et al.[9] applied bag-of-words to object classification.K-means was used to quantized SIFT descriptors over Harris-affinekeypoints. Philbin[3] adopted key points based sampling and quan-tization to landmark search. To deal with a large scale codebook,Nister et al.[4] proposed scalable vocabulary tree to assign descrip-tors to visual words with hierarchical k-means.

Basically our work is inspired by the well-known scalable vo-cabulary tree[4]. However, our approach differs in two aspects.Firstly, we use hierarchical clustering to construct compact, dis-criminative and adaptive codebooks for improving global featurerepresentation rather than for hierarchical relevance scoring. Sim-ilarity measure is computed separately. Secondly, our approachfocuses on commerce images and we revisit the CBIR potentialsin emerging visual assisted commerce image indexing, search, andrecommendation rather than object recognition[4].

3. CODEBOOK GENERATION

3.1. Approach Overview

Our approach involves both off-line and on-line stages. In off-line stage, object/background segmentation is first performed on thewhole image set to reduce the negative contribution of background toforeground feature representation. Generic object/background seg-mentation is challenging. However, to deal with commerce images,we propose an effective algorithm. After segmentation, featuresare densely sampled and extracted from foreground pixels of allimages or a subset of training images. Codebook is constructed byperforming hierarchical k-means on those features. In online-stage,for each image, we segment foreground objects out, densely extractvisual features from the object region and quantize them into visualwords using the codebook generated off-line. The final signature isformed by the most significant visual words and their percentages.

3.2. Object/background segmentation

Segmentation deserves more attention as there are quite a lot of casesthat background features negatively dominate the global feature rep-resentation. Referring to the tie image in Fig.2, the white back-ground occupies a major part, thus those objects in white are morelikely to be retrieved. It is easier to imagine that simple intensitythresholding technique is difficult to achieve desirable results. Takethe first row in Fig.2 as an example, a major part of an object regionwould be regarded as background as their intensities are similar.

Through extensive observation of images from several popularE-Commerce websites such as Amazon and Vancl, we found thatmost images share a common feature that pixel values from back-ground regions don’t change much while foreground pixels do (referto Fig.2). Such observations inspire us to perform segmentation us-ing variance threshold technique. For an image, we divide it intosub-regions R = {r1, r2, ..., rn}with fixed size � ∗ �. Variance ofeach region is defined as the sum of variance of R,G,B channels.

V ari =1

∣ri∣∑P∈ri

[(PR − uR)2 + (PG − uG)

2 + (PB − uB)2]

(1)where uR, uG, uB represent mean values of R, G, B channels in theregion, P denotes a pixel and ∣ri∣ denotes the number of pixels of itℎregion. Then for all regions of an image, we search for a threshold� to separate them into either object or background groups by theirvariance values. The threshold is calculated by OTSU[10]. Experi-ments conducted on vancl and Mbaobao(refer to 4.1) show that more

Fig. 2. Segmentation results of variance based OTSU threshold.Variance is calculated in non-overlapping patches of 16*16.

than 94% of images can be correctly segmented(We regard the im-age as being correctly segmented if more than 90% of its pixels arelabeled correctly). The proposed algorithm requires no human inter-vention and is ready to be applied to other commerce image sets.

3.3. Generating codebooks

For each image Ii, feature vectors, qi = {qi1 , qi2 , ..., qil}, are ex-tracted from densely sampled foreground pixels. The codebook V ={v1, v2, ..., vn} is then constructed by performing hierarchical k-means on the feature vectors of all imagesQ = q1

∪q2

∪, ...,

∪qm

where vi denotes the itℎ visual word, n denotes the codebook sizeand m is the number of images. We define the quantization error ofitℎ node as the mean square error between all features qij belongingto the node and its center ci, Ei =

∑mij=1 ∣∣qij − ci∣∣

2. mi denotesthe number of features belonging to itℎ node. After generating acodebook for the image database, densely sampled features will bequantized into visual words to form a signature(see 3.4).

First, initial k-means process is performed on all descriptor vec-tors to partition them into k groups, where each group consists ofthe descriptor vectors closest to a particular cluster center. Then thesame process will be recursively applied to each group of descriptorvectors. The process stops at node i if the number of features be-longing to the node or its quantization error is less than a threshold,that is Ei ≤ a or mi ≤ b . When the algorithm stops, the centers ofall leave nodes compose the codebook.

3.4. Online quantization

In the on-line stage, for each image, segmentation is first performedto remove background regions. Descriptors F = {f1, f2, ..., fl}are densely extracted from foreground pixels. Each feature fiis recursively propagated down the vocabulary tree by at eachlevel comparing the descriptor vector to the k candidate clus-ter centers and choosing the closest one until reaching a leavenode vk into which it will be quantized. The resulting histogramhas the form of R = {(c1, w1), (c2, w2), ..., (cn, wn)} wherewi denotes the portion of ci. From R, we derive a signatureS = {(c1, w1), (c2, w2), ..., (ck, wk)} for the image by firstlyranking items in R in descending order by wi and retaining a smallnumber of items satisfying that the accumulated weight is no lessthan a threshold, k = minj

∑i=ji=1 ci ≥ ℓ. So the representation is

compact. Take color as an example, the 15 most populated centerswill occupy more than 85% of foreground pixels for more than


106

Table 1. Lists of query numbers as well as the sizes of resultingcodebook with our approach over each image set.

image sets query codebook sizecolor texture shape color texture shape

vancl 384 370 387 1017 574 312baobao 397 390 400 729 215 278

95% of images. Such compactness enables us to use powerful butcomputationally expensive similarity measures such as EMD. Inorder to further reduce computation time, fast EMD[11] is used inimplementation.

4. EXPERIMENTS

To validate our codebook generation scheme, we conducted color,texture and shape based image retrieval on two data sets. Averagedprecision is used for quantitative evaluation. As discussed above,quantitative evaluation cannot comprehensively reflect the retrievalperformance(refer to Fig.1), we further conducted a user study forsubjective evaluation.

4.1. Dataset

Our empirical datasets were collected from two popular E-commercewebsites in China, Vancl(http://vancl.com/) and Mbaobao(http://mbaobao.com/)that promote the goods of dresses and bags, respectively. Vancl con-tains 12826 images that were manually grouped into 18 categoriesby color, 36 categories by shape and 12 categories by texture. Sim-ilarly, 13260 images from Mbaobao were grouped into 10, 15, 17categories by color, shape and texture separately. These manual cat-egorizations will serve as the ground truth for retrieval evaluation.About 3% images are selected as queries by uniform sampling fromeach category for each kind of visual features(refer to table 1).

4.2. Visual features

For color features, R,G,B values are first extracted from each fore-ground pixel and then converted to CIE-lab space which composesa 3-dimensional descriptor. In uniform quantization, we conductedseveral experiments using different bin sizes and chose the one (l =20, a = 32, b = 32) achieving the best retrieval precision.

Shape context[12] is adopted to characterize shape features. Thecontours of an image is first detected. Then for each point on thecontour, the distribution of relative positions of the remaining pointsis computed. The bins are normally taken to be uniform in log-polarspace, and we use 5 bins for log r and 12 bins for �.

To characterize texture features, Gabor feature[13] is extractedat 3 scales and 8 orientations for each sub-region in an image. Wesubdivide the image at 4 different levels of resolution and at eachlevel l, divide the image into l2 subregions where texture featurevectors are computed and concatenated.

4.3. Quantitative evaluation

4.3.1. Our method vs. Uniform quantization

We follow the same convention as previous work[1] to use uni-form quantization as baseline. Parameter k defining the number ofbranches of each node is set to 4. The codebook sizes are listed intable 1. Intersection measure is employed for baseline as[1].

As shown in Fig.4, our method outperforms the baseline in allcases. Similar results are achieved on Mbaobao. Due to limitedspace, we don’t list the results here. It is noteworthy that the code-book sizes of two image sets are different. Taking color codebookas an instance, there are 1017 visual words in vancl but only 729 inMbaobao. This is because those images in Mbaobao are not so col-orful as those in vancl and our method accordingly discards colorsthat are absent in the image set to reduce codebook size.

4.3.2. Retrieval with/without segmentation

To investigate the impact of object segmentation, we compare theresults with and without running segmentation. Both codebooks aregenerated using vocabulary tree method with the same parametersas 4.3.1. Fig.5 demonstrates that retrieval precision is increased byabout 5% for color feature. Although the improvement looks small,the subsequent subjective user study shows that users are quite satis-fied with the improvement because the number of outliers are furtherreduced(See Fig.3). In addition, there is little improvement in tex-ture and shape based retrieval because, in our current datasets, back-ground is relatively homogeneous and would hardly degenerate theirperformance.

4.3.3. Hierarchical clustering vs. flat clustering

On the other hand, we compared retrieval precision of our approachusing hierarchical k-means and flat k-means algorithms. For flatk-means, we choose k to be equal to the codebook size of hierarchi-cal one. As shown in Fig.6, the codebook generated by hierarchicalclustering achieves comparable performance as flat clustering. How-ever, it is noteworthy that we selected a near optimal k for flat clus-tering which is actually difficult to determine. In addition, imagedatabases of E-commerce sites could change frequently over time.Our approach powered by hierarchical clustering is more efficientto reconstruct a new codebook over flat clustering. For example, ittakes about 40 minutes to generate a codebook for 2M color featuredescriptors using hierarchical k-means (k = 4) while it would liketo take about 8 hours for flat k-means.

4.4. User study

We have implemented a real commerce image retrieval platformbased on four different approaches, i.e. vocabulary tree, vocabularytree without segmentation, flat clustering and uniform quantization.A user study is conducted on Vancl data set using the platform to ex-plicitly investigate how much the retrieval performance is improved.In total 200 queries whose AP values are similar(within the AP rangeof 80% ∼ 85%) for the four approaches are selected. We invited 40volunteers including 36 ordinary graduate students or teachers and 4researchers involved in the system development. They are requiredto score (1-10, 1 is the worst, 10 is the best) retrieval performancefrom following three perspectives:

1. How is the overall impression of the retrieval results?

2. How is the ranking order of retrieval results?

3. How serious are the outliers degenerating user experience?

Average scores of 40 volunteers for 4 prototypes are calculatedand summarized in Fig.3. Our method achieves comparable usersatisfaction with flat clustering but at lower computational cost. Be-sides, it is noteworthy that although AP values are not distinctly in-creased (5%) when performing segmentation (See Fig.5), user satis-faction is improved by 20% as the number of outliers are reduced.


107

Fig. 3. User study results on four prototype systems.

Fig. 4. Retrieval precision using the codebook generated by ourmethod and uniform binning strategy on Vancl.

Fig. 5. Retrieval precision with/without segmentation on Vancl.

Fig. 6. Retrieval precision of hierarchical clustering and flat cluster-ing on Vancl.

The user study demonstrates the superiority of our approach and pro-vides an insight that although some state-of-the-art algorithms obtainsound quantitative retrieval precision they may not meet user expec-tation, and carefully designed algorithms are needed for specific do-mains.

5. CONCLUSION

A commerce image retrieval approach is proposed. The BoW basedglobal feature representation is compact and discriminative withdata-driven quantization. Fast EMD distance is employed to speedup the similarity computing. As retrieval outliers can be reducedlargely by removing background regions, the advantages of our ap-proach have been demonstrated in quantitative and subjective evalu-ation. However, the current segmentation cannot handle images withclutter background. Developing more powerful object/backgroundsegmentation algorithms would be an important future work towardsgeneric and robust commerce image retrieval applications.

6. ACKNOWLEDGEMENTS

This work was supported by the National Basic Research Programof China under contract no. 2009CB320902, in part by grant fromthe Chinese National Nature Science Foundation under contract no.60902057, in part by grant from the Beijing Nature Science Foun-dation under contract no. 4102023, and in part by the fund from theNEC Lab China.

7. REFERENCES

[1] R. etc. Datta, “Image retrieval: Ideas, influences, and trends ofthe new age,” ACM Computing Surveys (CSUR), vol. 40, no.2, pp. 1–60, 2008.

[2] BS. etc. Manjunath, “Color and texture descriptors,” IEEECSVT, vol. 11, no. 6, pp. 703–715, 2002.

[3] etc. Philbin, “Object retrieval with large vocabularies and fastspatial matching,” in CVPR. IEEE, 2007, pp. 1–8.

[4] D. Nister and H. Stewenius, “Scalable recognition with a vo-cabulary tree,” in CVPR. IEEE, 2006, vol. 2, pp. 2161–2168.

[5] D.G. Lowe, “Distinctive image features from scale-invariantkeypoints,” IJCV, vol. 60, no. 2, pp. 91–110, 2004.

[6] J. Sivic and A. Zisserman, “Video Google: A text retrievalapproach to object matching in videos,” 2003.

[7] T. Leung and J. Malik, “Representing and recognizing the vi-sual appearance of materials using three-dimensional textons,”IJCV, vol. 43, no. 1, pp. 29–44, 2001.

[8] J. etc. Winn, “Object categorization by learned universal visualdictionary,” in ICCV. IEEE, 2005, vol. 2, pp. 1800–1807.

[9] C. etc. Dance, “Visual categorization with bags of keypoints,”in ECCV international workshop on statistical learning incomputer vision, 2004, vol. 4.

[10] N. Otsu, “A threshold selection method from gray-level his-tograms,” Automatica, vol. 11, pp. 285–296, 1975.

[11] etc. Rubner, “The earth mover’s distance as a metric for imageretrieval,” IJCV, vol. 40, no. 2, pp. 99–121, 2000.

[12] etc. Belongie, “Shape matching and object recognition usingshape contexts,” IEEE PAMI, pp. 509–522, 2002.

[13] I. Fogel and D. Sagi, “Gabor filters as texture discriminator,”Biological cybernetics, vol. 61, no. 2, pp. 103–113, 1989.


108

Date post:	16-Oct-2016
Category:	Documents
Upload:	wen
View:	220 times
Download:	5 times

[IEEE 2011 18th IEEE International Conference on Image Processing (ICIP 2011) - Brussels, Belgium...

Documents