[Lecture Notes in Computer Science] Computational Science and Its Applications – ICCSA 2013 Volume...

Combining Descriptors Extracted from Feature

Maps of Deconvolutional Networks and SIFTDescriptors in Scene Image Classification

Dung A. Doan1, Ngoc-Trung Tran1, Dinh-Phong Vo1, Bac Le1,and Atsuo Yoshitaka2

1 University of Science, 227 Nguyen Van Cu street, District 5Ho Chi Minh City, Viet Nam

2 Japan Advanced Institute of Science and Technology, [email protected], {tntrung,vdphong,lhbac}@fit.hcmus.edu.vn,

[email protected]

Abstract. This paper presents a new method to combine descriptorsextracted from feature maps of Deconvolutional Networks and SIFT de-scriptors by converting them into histograms of local patterns, so theconcatenation operation can be applied and ensure to increase the clas-sification rate. We use K-means clustering algorithm to construct code-books and compute Spatial Histograms to represent the distribution oflocal patterns in an image. Consequently, we can concatenate these his-tograms to make a new one that represents more local patterns thanthe originals. In the classification step, SVM associated with HistogramIntersection Kernel is utilized. In the experiments on Scene-15 Datasetcontaining 15 categories, the classification rates of our method are around84% which outperforms Reconfigurable Bag-of-Words (RBoW), SparseCovariance Patterns (SCP), Spatial Pyramid Matching (SPM), Spa-tial Pyramid Matching using Sparse Coding (ScSPM) and Visual WordReweighting (VWR).

Keywords: Scene image classification, Deconvolutional Networks, Bag-of-Word model, Spatial Pyramid Matching.

1 Introduction

Scene classification is an essential and challenging open problem in computervision with multiples of applications involved, for example: content-based imageretrieval, automatic assigning labels to images and image grouping from givenkeywords. Because of some natural conditions of images such as the ambiguity,illumination, scaling, etc, scene classification is a difficult problem as well asthere are many approaches proposed to overcome these challenges.

Specifically, early works on classifying scenes extract appearance features(color, texture, power spectrum, etc) [1][2][3] and use dissimilarity measures[4][5] to distinguish scene categories, but they can only be used in the case of

B. Murgante et al. (Eds.): ICCSA 2013, Part V, LNCS 7975, pp. 321–331, 2013.c© Springer-Verlag Berlin Heidelberg 2013

322 D.A. Doan et al.

classifying images into small number of categories such as indoor/outdoor andhuman-made/natural. In 2006, Svetlana Lazebnik et al. [6] presents Spatial Pyra-mid Matching (SPM), which is a remarkable extension of bag-of-features (BoF).SPM exploits descriptors inside each local patch, partitions the image into thesegments and computes histograms of these local-features within each segment.After that, SPM framework uses histogram intersection kernel associated withSupport Vector Machine (SVM) to classify images. According to Lazebnik [6],SPM gets high performance when using SIFT descriptors [7] or ”gist” [8]. Themethod is also the major component of the state-of-the-art systems [9].

In 2010, Matthew Zeiler et al. [10] proposed Deconvolutional Networks (DN)which reconstructs images, but maintains stable latent representations and localinformation such as edge intersections, parallelism and symmetry. To be morespecific, DN uses convolution operator principally, hence the networks can as-sist the grouping behavior and pooling operation on feature maps to extractdescriptors (DN descriptors, briefly).

However, because edges in scene images are complex and DN cannot controlthe shape of learned filters, it is difficult for DN to represent edge informationcomprehensively. Figure 1 shows some bad reconstructed scene images that aredone by 1 layer-DN. One way should be considered to overcome this disadvantageof DN is SIFT descriptors, because SIFT features totally provides directionalinformation to enhance edge representation more powerful. Nevertheless, DNand SIFT descriptors are not the same representation, so we cannot ensure toget better performance if naively combining them together.

In this paper, we propose a new method to convert DN and SIFT descriptorsinto histograms of local patterns, then these histograms can be concatenatedto produce a new one for classification step. Specifically, we first use K-meansclustering algorithm to construct two codebooks, each word in these codebookscorresponds to a local pattern. Then, two spatial histograms are also built torepresent the distribution of local patterns in an image. After that, histogramconcatenation is carried out to make a new histogram that represents more localpatterns than originals. Finally, SVM associated with Histogram IntersectionKernel [6] is used to classify images into appropriate labels. Our approach repre-senting local patterns is similar to bag-of-features [11] [12]. However, note that toimprove our method’s performance, Spatial Pyramid Histograms are constructedby following the approach of Lazebnik [6]. To evaluate the performance of ourmethod, we experimented the method in a large database of fifteen natural scenecategories [6] and get an significant improvement of accuracy over some recentmethods.

To express our method clearly, we organize the paper as follows: Section 2introduces about recent related works. In section 3, we introduce our method indetail. Our results on Scene-15 Dataset are illustrated in section 4. Finally, insection 5, conclusions, discussion and future works are described.

Combining Descriptors Extracted from Feature Maps of DN 323

Fig. 1. Original images and correspondence images reconstructed by the first layer ofDeconvolutional Networks

2 Related Work

2.1 Deconvolutional Networks

Deconvolutional Networks proposed by Matthew Zeiler et al. [10] in 2010 hasonly a decoder which tries to reconstruct feature maps being expectantly closeto the original image. To be more specific, from an input image and learnedfilters, DN sums over values which are generated by convolution of feature mapsand filters in each layer, hopefully these values are proximate with the image.Sparseness constraint is also added to feature maps to encourage economicalrepresentation at each level of the hierarchy, thus more complex and high-levelfeatures are naturally produced. With both sparsity and convolution approaches,DN can preserve locality, mid-level concepts and basic geometric elements whichcan open the way for pooling operation and grouping behavior to extract de-scriptors. In practice, DN descriptor is particular successful when applying toobject recognition and denoising images.

2.2 Image Classification by Spatial Pyramid Matching

Motivated by Grauman and Darrell’s method [13] in pyramid matching, SpatialPyramid Matching (SPM) is proposed by Svetlana Lazebnik [6] in 2006. First,SPM extracts local descriptors (for example: SIFT descriptors) inside each sub-region of images, quantizes these descriptors into vectors and then performs


K-means to construct dictionary. Secondly, with the constructed dictionary, SPMcomputes histogram of local descriptors and then these histograms are multipliedby appropriate weights at each increasing resolution. Finally, putting all thehistograms together to get a pyramid match kernel for classification step.

3 The Proposed Method

3.1 Descriptor Extraction

Deconvolutional Networks Descriptors. In training filters of DN, given a

set of Iu unlabeled images y(1)u , y

(2)u , y

(3)u ,..., y

(Iu)u . K0 denotes the number of

color channel, so we have a cost function for the first layer:

C1(yu) =λT

2

Iu∑

i=1

K0∑

c=1

||K1∑

k=1

(z(i)k,1 ⊕ f1

k,c

)− y(i)u,c||22 +

I∑

i=1

K1∑

k=1

|z(i)k,1|p (1)

where, z(i)k,l and f l

k,c respectively are feature map and filter k of layer l, Kl

indicates the number of feature maps in layer l, obviously, l = 0 and l = 1in equation (1). λT is a constant value that balance the the contribution of

reconstruction of y(i)u,c and the sparsity of z

(i)k,1. We can also follow Matthew

Zeiler [10] to form a hierarchy that means feature maps in layer l become inputsfor layer l + 1.

In reconstruction step, with the learned filters f1k,c and a label/scene image

yscene, we infer zk,1 by minimizing reconstruction error:

minzk,1

λR

2

K0∑

c=1

||K1∑

k=1

(zk,1 ⊕ f1

k,c

)− yscenec ||22 +K1∑

k=1

|zk,1|p

We can also infer feature maps in higher layer if following [10].Each feature map zk,1 is split into overlapping p1 × p1 patches with spacing

of s1 pixels. Each patch are pooled and then grouped to give local descriptors.

SIFT Descriptors. By conducting some experiments, we observe that becauseedges in scene images are very complicated, hence the feature maps z of 1 layer-DN are not enough (figure 1). Therefore, we utilize SIFT descriptors to supportedge representation in scene recognition problem. Concretely, we densely exploitlocal SIFT descriptors in overlapping p2 × p2 patches at a stride of s2 pixels.

3.2 Building Histograms

Given a set of SIFT descriptors XSIFT = [x(1)SIFT , x

(2)SIFT , ..., x

(N)SIFT ]

T ∈ RN×128

and DN descriptors XDN = [x(1)DN , x

(2)DN , x

(2)DN , ..., x

(M)DN ]

T ∈ RM×D, we represent

x(i)SIFT and x

(i)DN being 128 and D−dimensional feature space respectively. With


BSIFT and BDN being codebooks of SIFT and DN descriptors, K-means methodare applied to minimize the following cost functions:

minVSIFT , BSIFT

VDN , BDN

N∑i=1

||x(i)SIFT − (BSIFT )

T.v

(i)SIFT ||22

+M∑i=1

||x(i)DN − (BDN )

T.v

(i)DN ||22

(2)

subject to:||v(i)SIFT ||0 = 1; ||v(i)SIFT ||1 = 1; v(i)SIFT,j ≥ 0,∀i,j; ||v(i)DN ||0 = 1;

||v(i)DN ||1 = 1; v(i)DN,j ≥ 0, ∀i,j

where, v(i)SIFT,j and v

(i)DN,j are elements of vector v

(i)SIFT and v

(i)DN respectively.

VSIFT = [v(1)SIFT , v

(2)SIFT , ..., v

(N)SIFT ]

Tand VDN = [v

(1)DN , v

(2)DN , ..., v

(N)DN ]

Tare in-

dexes vectors. In training phase, we minimize cost function (2) with respect toBDN , BSIFT , VDN and VSIFT , but in coding phase, with the learned BDN andBSIFT , we only minimize equation (2) with respect to VDN and VSIFT .

After obtaining VDN and VSIFT , we compute the histogram:

HSIFT = 1N

N∑i=1

v(i)SIFT

HDN = 1M

M∑i=1

v(i)DN

With the aim to improve our performance, Spatial Pyramid Histogram are madeby following to approach of SPM [6].

3.3 Image Classification

In this stage, DN and SIFT descriptors are the same representation, so followingequation may be applied:

H = HSIFT �HDN

where, � denotes concatenation operation. The new histogram H , that is movedinto SVM associated with Histogram Intersection Kernel [6], represents morelocal patterns than HSIFT and HDN .

Figure 2 shows all steps of our method.

4 Experiments

We adopt Scene-15 Dataset [6], which contains 15 categories (office, kitchen,living room, mountain, etc). Each category has from 200 to 400 images whichhave average size 300×250 pixels, the major image sources is COREL collection,


Fig. 2. Each step of our method. After extracting DN and SIFT descriptors, K-means isapplied to build two codebooks. Then, the distribution of local patterns is representedin Histogram Building stage. Histogram Concatenation are carried out to producea new histogram that represents more local patterns. Finally, SVM associated withHistogram Intersection Kernel is used to classify images.

Google image search and personal photographs. Example images of Scene-15Dataset are illustrated in figure 5.

In our experiments, all images are converted into gray-scale and then contrastnormalization before applying to DN. We train 8 feature maps of 1 layer-DNby using only 20 images which consist of 10 fruit and 10 city images, Scene-15 images are only used to train supervised classifier. Specifically, we followexperiment setup for Scene-15 Dataset of Lazebnik [6], training on 100 imagesper class and testing on the rest.

In classification step, multi-class SVM rule is 1-vs-All: a SVM classifier istrained to classify each class from the rest and a test image is assigned to thelabel having highest response.

4.1 Codebook Size

Selecting the codebook size influences the trade-off between discriminative andgeneralizable characteristics. Concretely, small codebook size can lead to thelack of discriminative characteristic, the dissimilar features may be assigned tosame cluster/local patterns. On the other hand, a large codebook size is morediscriminative but less generalizable, less tolerant to noises; because the similarfeatures may be mapped to different local patterns.

Therefore, in this experiment, we would like to survey how the size of DN andSIFT codebook affects to the classification rate. Specifically, in the survey ofSIFT codebook size, we keep the size of DN codebook fixed in 200 and graduallyincrease SIFT codebook size from 50 to 2000. Similarly, in the survey of DNcodebook size, SIFT codebook is fixed in size of 200, and DN codebook is alsoraised between 50 and 2000. Note that spatial pyramid histogram is not used inthis experiment and our detailed results are illustrated in figure 3.

On both situations, as the dictionary size increases from 50 to 1000, theperformance rises rapidly, and then reaches the peak. When keeping on increasingthe dictionary size, the classification rate decreases gradually.


Fig. 3. The classification rates at different sizes of DN and SIFT codebook

4.2 Histogram Combination and Naive Combination

In this experiments, we compare histogram combination with naive combination.Concretely, in naive combination, the following equation is applied on DN anddense SIFT descriptors firstly:

x = xDN � xSIFT (3)

where, xDN and xSIFT denote DN and dense SIFT descriptors, respectively.Then, codebook is constructed by using K-means, building local pyramid his-tograms in each sub-region of images at increasingly resolutions. Finally, SVMassociated with Histogram Intersection Kernel are utilized to classify images.

Fig. 4. In naive combination, DN and SIFT descriptors are concatenated firstly. Then,a codebook is constructed by using K-means. The distribution of local patterns are rep-resented in Histogram Building. Finally, SVM associated with Histogram IntersectionKernel is utilized in classification step.


The detail of the steps are illustrated in figure 4. In naive combination ap-proach, the parameters that we setup are K1 = 8, p1 = 16, s1 = 8, p2 = 16,s2 = 8, λT = λR = 10 and the images are also resized to 150 × 150 before ex-tracting dense SIFT descriptors to easily perform equation (3). Experiments areconducted 5 times with different randomly selected training and testing images,then mean and standard deviation are calculated.

Table 1. Histogram combination compares to naive combination, DN descriptors andSIFT descriptors

Method Classification rate (%)

DN Descriptors 75.3 ± 0.9SIFT Descriptors 81.5 ± 0.3Naive combination 72.8 ± 0.9

Histogram combination 84.3 ± 0.2

Our mean and standard deviation results are shown in table 1, we comparehistogram combination with not only naive combination approach but also DNand SIFT descriptors respectively. From table 1, the proposed histogram com-bination outperforms the others.

4.3 Comparison with Other Methods

In this experiment, we would like to compare the performance of our methodwith other recent methods, and the parameters that we use are K1 = 8, p1 = 16,s1 = 2, p2 = 16, s2 = 8, λT = λR = 10.

The experimental processes are repeated 10 times with different randomlyselected training and testing images. The final results are reported as the meanand standard deviation of the recognition rates, our results for 3-fold cross vali-dation in SVM training are shown in table 2. As shown, our method outperformssome recent methods.

Table 2. Classification rate (%) comparison on 15-Scene Dataset

Method Classification rate (%) Year

SPM [6] 81.4 ± 0.5 2006ScSPM [14] 80.3 ± 0.9 2009VWR [15] 83.0 ± 0.2 2011RBoW [16] 78.6 ± 0.7 2012SCP [17] 80.4 ± 0.5 2012

Our method 84.4 ± 0.4


Fig. 5. Some example images of Scene-15 Dataset

5 Conclusion, Discussion and Future Works

Motivated by observations and experiments, we realize that because edges ofscene images are very complicated and the learned filters of DeconvolutionalNetworks cannot be controlled, feature maps of DN are not enough to representedge information comprehensively. In this paper, we propose a new method to useSIFT descriptors to support edge representation, both DN and SIFT descriptorsare converted into histograms of local patterns, so we can concatenate thesehistograms together to make a new one that represents more local patternsthan the originals. Consequently, our method makes data for classification stepbecome more discriminative, experimental results on 15-Scene Dataset showedthat our method has better performance than the recent methods.

However, there are some disadvantages in our method: it is still difficult tomake a real-time application because the processing time is slow. Furthermore,by reason of using K-means in codebook construction, it is too restrictive inassigning each sample to only one local patterns.

In the future works, we would like to improve codebook construction step tobe more flexible, not restrictive like K-means; supervised fashion in constructingcodebook is also considered as well as implementing practical application is on-going, too.


Acknowledgments. This research is supported by funding from Advanced Pro-gram in Computer Science, University of Science, Vietnam National University- Ho Chi Minh City.

References

1. Faloutsos, C., Barber, R., Flickner, M., Hafner, J., Niblack, W., Petkovic, D.,Equitz, W.: Efficient and effective querying by image content. Journal of intelligentinformation systems (1994)

2. Hampapur, A., Gupta, A., Horowitz, B., Shu, C., Fuller, C., Bach, J., Gorkani, M.,Jain, R.: Virage video engine. In: Electronic Imaging 1997, International Societyfor Optics and Photonics (1997)

3. Ma, W., Manjunath, B.: Netra: A toolbox for navigating large image databases.In: International Conference on Image Processing (1997)

4. Puzicha, J., Buhmann, J., Rubner, Y., Tomasi, C.: Empirical evaluation of dissim-ilarity measures for color and texture. In: The Proceedings of the Seventh IEEEInternational Conference on Computer Vision (1999)

5. Santini, S., Jain, R.: Similarity measures. IEEE Transactions on Pattern Analysisand Machine Intelligence (1999)

6. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramidmatching for recognizing natural scene categories. In: IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (2006)

7. Lowe, D.G.: Towards a computational model for object recognition in IT cortex.In: Lee, S.-W., Bulthoff, H.H., Poggio, T. (eds.) BMCV 2000. LNCS, vol. 1811, pp.20–31. Springer, Heidelberg (2000)

8. Torralba, A., Murphy, K., Freeman, W., Rubin, M.: Context-based vision systemfor place and object recognition. In: Proceedings of the Ninth IEEE InternationalConference on Computer Vision (2003)

9. Bosch, A., Zisserman, A., Muoz, X.: Image classification using random forests andferns. In: IEEE 11th International Conference on Computer Vision, ICCV 2007(2007)

10. Zeiler, M., Krishnan, D., Taylor, G., Fergus, R.: Deconvolutional networks. In: 2010IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)

11. Yang, J., Jiang, Y., Hauptmann, A., Ngo, C.: Evaluating bag-of-visual-words rep-resentations in scene classification. In: Proceedings of the International Workshopon Multimedia Information Retrieval (2007)

12. Jiang, Y., Yang, J., Ngo, C., Hauptmann, A.: Representations of keypoint-basedsemantic concept detection: A comprehensive study. IEEE Transactions on Multi-media (2010)

13. Grauman, K., Darrell, T.: The pyramid match kernel: Discriminative classificationwith sets of image features. In: Tenth IEEE International Conference on ComputerVision, ICCV 2005 (2005)

14. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching usingsparse coding for image classification. In: IEEE Conference on Computer Visionand Pattern Recognition, CVPR 2009 (2009)


15. Zhang, C., Liu, J., Wang, J., Tian, Q., Xu, C., Lu, H., Ma, S.: Image classifi-cation using spatial pyramid coding and visual word reweighting. In: Kimmel, R.,Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part III. LNCS, vol. 6494, pp. 239–249.Springer, Heidelberg (2011)

16. Parizi, S., Oberlin, J., Felzenszwalb, P.: Reconfigurable models for scene recog-nition. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition(CVPR) (2012)

17. Wang, L., Li, Y., Jia, J., Sun, J., Wipf, D., Rehg, J.: Learning sparse covariancepatterns for natural scenes. In: 2012 IEEE Conference on Computer Vision andPattern Recognition (CVPR) (2012)

Date post:	19-Dec-2016
Category:	Documents
Upload:	osvaldo
View:	213 times
Download:	1 times

[Lecture Notes in Computer Science] Computational Science and Its Applications – ICCSA 2013 Volume...

Documents