Similarity learning for object recognition based on derived kernel

Neurocomputing 83 (2012) 110–120

Contents lists available at SciVerse ScienceDirect

Neurocomputing

0925-23

doi:10.1

n Corr

E-m

journal homepage: www.elsevier.com/locate/neucom

Similarity learning for object recognition based on derived kernel

Hong Li a, Yantao Wei b, Luoqing Li c, Yuan Yuan d,n

a School of Mathematics and Statistics, Huazhong University of Science and Technology, Wuhan 430074, Chinab Institute for Pattern Recognition and Artificial Intelligence, Huazhong University of Science and Technology, Wuhan 430074, Chinac Faculty of Mathematics and Computer Science, Hubei University, Wuhan 430062, Chinad Center for OPTical IMagery Analysis and Learning (OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics,

Chinese Academy of Sciences, Xi’an 710119, China

a r t i c l e i n f o

Article history:

Received 1 August 2011

Received in revised form

21 November 2011

Accepted 3 December 2011

Communicated by: X. Gaomeasure, a new template selection method is proposed in this paper. In this method, the redundancy is

Available online 24 December 2011

Keywords:

Derived kernel

Hierarchical learning

Image similarity

Neural response

Object recognition

Template selection

12/$ - see front matter & 2012 Elsevier B.V. A

016/j.neucom.2011.12.005

esponding author.

ail address: [email protected] (Y. Yuan).

a b s t r a c t

Recently, derived kernel method which is a hierarchical learning method and leads to an effective

similarity measure has been proposed by Smale. It can be used in a variety of application domains such

as object recognition, text categorization and classification of genomic data. The templates involved in

the construction of the derived kernel play an important role. To learn more effective similarity

reduced and the label information of the training images is used. In this way, the proposed method can

obtain compact template sets with better discrimination ability. Experiments on four standard

databases show that the derived kernel based on the proposed method achieves high accuracy with

low computational complexity.

& 2012 Elsevier B.V. All rights reserved.

1. Introduction

Object recognition is one of the active research topics in thecomputer vision [1–10]. The reliable recognition of interestedobjects in an input image with arbitrary background clutter andocclusion has always been an attractive but elusive goal. For themost part, the performance of the object recognition system iscritically dependent on the similarity measure [11–14]. Conse-quently, it is interesting to define an effective similarity measure.In the last few decades, various image similarity measures havebeen proposed such as Minkowski distance, fractional distance,tangent distance [15], Hausdorff distance [16], histogram inter-section and so on. Among these measures, Euclidean distance(Minkowski distance of order 2) is the most commonly usedmethod due to its simplicity. But it suffers often from a highsensitivity even to small deformation. Tangent distance has beenwidely used in handwritten digit recognition. Although tangentdistance compensates for global affine transformations, it issensitive to local image deformations [17]. The generalizedHausdorff distance is not only robust to noise, but also allowsportions of one image to be compared with another [18]. How-ever, how to realize the rotation and scale invariant object

ll rights reserved.

matching remains a problem for it. In addition, a lot of similaritymeasures require the image to be previously segmented intoregions that represent entire objects. Unfortunately this kind ofbottom-up high-level perceptual grouping seems to be infeasiblein the presence of clutter, and is impossible in the presence ofocclusion [19].

As is well known, the similarity between two images is notonly related to the similarity function but also to the features tobe measured. Features usually fall into one of two categories:patch-based and histogram-based [2,1]. The patch-based featuresare very selective for object shape, but they are lack of invariancewith respect to object transformations. By contrast, histogram-based features [3] are robust to transformations. For example, thescale-invariant feature transform (SIFT) are local and based on theappearance of the object at particular interest points, and areinvariant to image scale and rotation [3]. However, as proved in[2], it is unlikely that the SIFT-based features could perform wellon a generic object recognition task with such degree of invar-iance. Given the vastly superior performance of human vision onobject recognition, it is reasonable to look to biology for inspira-tion. Currently, designing biologically inspired algorithm is animportant direction in computer vision [1,20].

In [21], Smale et al. proposed the notion of derived kernel onthe space of images in light of neuroscience of visual cortex. It isthe inner product defined by the neural response which is anatural image representation [22,21,23]. This model gives an

www.elsevier.com/locate/neucom

www.elsevier.com/locate/neucom

dx.doi.org/10.1016/j.neucom.2011.12.005

mailto:[email protected]

dx.doi.org/10.1016/j.neucom.2011.12.005

Fig. 1. Nested patch domains [21].

Fig. 2. Translation from u to v, where h1 and h2 correspond to two different

translations.

Fig. 3. A specific image patch restricted by a translation.

H. Li et al. / Neurocomputing 83 (2012) 110–120 111

effective similarity measure by alternating pooling and filteringsteps [1,21,24,25]. The filtering operation measures the similaritybetween the input representations and a given set of templates,while the pooling operation combines the filtered outputs into asingle value. In this way, the filtering step induces discriminationwhile the pooling step induces invariance in the architecture. Thisarchitecture can lead to a balanced trade-off between invarianceand selectivity [24,26]. In addition, the dimensionality of theneural response depends only on the number of templates at thelayer which the neural response is being considered and isirrelevant to the size of the input image. For this reason, it candirectly measure the similarity between the images with differentsizes. The last but not the least is that the derived kernel canmeasure the similarity between images without the need forsegmentation.

A key semantic component in the derived kernel is a dictionaryof templates. Currently, templates are often obtained by randomlyextracting image patches from an image set [21]. Random sam-pling will lead to different similarities if the template sets are re-sampled, especially when the number of templates is small. On theother hand, a large number of randomly selected templates doesnot always lead to better similarity measure and the running timealso increases. An effective template selection method will enhancethe similarity measure ability of the derived kernel. Consequently,a template selection method is proposed in this paper. Theproposed method makes use of the discriminative information inthe image space and leads to a small number of templates. Thetheoretical analysis and experiments performed on real databaseshow that the proposed algorithm is adaptable to object recogni-tion task and can improve recognition accuracy and efficiency.

The remaining parts of the paper are organized as follows. InSection 2, we briefly review the necessary background for thederived kernel. In Section 3, we present a template selectionmethod and the necessary theory analysis of it. The proposedmethod can obtain compact template sets with better discrimina-tion ability. Experiments are given in Section 4 to demonstratethe effectiveness of the method. Finally, we give a conclusion inSection 5.

2. A review of derived kernel

In this section, we will give a brief introduction to the derivedkernel. The reader is encouraged to consult [21,26,27] for a moredetailed treatment.

2.1. Preliminaries

The ingredients needed to define the derived kernel are givenas follows [21]: (1) a finite architecture defined by nested patches(for example subdomains of the square in R2, see Fig. 1), (2) trans-formations from a patch to the next larger one (see Fig. 2),(3) function spaces defined on each patch, and (4) templatesconnecting the mathematical model to a real world setting.

Here we give the definition of the derived kernel in the case ofan architecture composed of three layers of patches u, v and Sq inR2 (see Fig. 1), with u� v� Sq. Assume that we are given afunction space on Sq, denoted by I Sq ¼ ff : Sq-½0;1�g (where fis the image of size Sq), as well as the function spaces Iu, Iv

defined on sub-patches u, v respectively. Then we define a finiteset Hu of transformations that are maps from u to v with h : u-v,and similarly Hv with h0 : v-Sq. The transformation can betranslation, scaling and rotation. In this paper, the sets oftransformations are assumed to be finite and limited to transla-tions for simplicity. Finally, the template sets T u � Iu and

T v � Iv are assumed to be discrete and finite. Templates can beseen as image patches frequently encountered in the daily life.

2.2. Derived kernel

The definition is based on a recursion which defines ahierarchy of local kernel. For the sake of simplicity, we describethe construction of the derived kernel in a bottom-up fashion:

(i)
The definition begins with a non-negative valued, normal-ized, reproducing kernel on Iu � Iu denoted by bK uðf ,gÞ. Wecan choose the inner product of square integrable functionson u, namely
Kuðf ,gÞ ¼Z

uf ðxÞgðxÞ dðxÞ,

where f ,gAIu. The normalized kernel bK u can be obtained by

bK uðf ,gÞ ¼Kuðf ,gÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Kuðf ,f ÞKuðg,gÞp : ð1Þ

We rule out the functions which satisfy Kuðf ,f Þ ¼ 0, in orderto make sense of the normalization.

(ii)
Then we define the first layer neural response of f at t
Nvðf ÞðtÞ ¼maxhAH

bK uðfJh,tÞ, ð2Þ

where f AIv, tAT u and H¼Hu. Note that transformationhAHu or Hv implicitly involves restriction. For example, iff AIv is an image patch of size v and hAHu, then fJh is apiece of the image patch of size u (see Fig. 3). Nvðf ÞðtÞ is the

H. Li et al. / Neurocomputing 83 (2012) 110–120112

best match of the template t in the image patch f . The neuralresponse can be interpreted as a vector in R9T u9 withcoordinates Nvðf ÞðtÞ, where 9T u9 is the cardinality of thetemplate set T u.The derived kernel on Iv � Iv is then defined as

Kvðf ,gÞ ¼/Nvðf Þ,NvðgÞSL2ðT uÞ

,

where f ,gAIv, and / � , �SL2ðT uÞ

is the L2 inner product withrespect to the uniform measure [21]. bK vðf ,gÞ can be obtainedby normalizing Kvðf ,gÞ according to (1).

(iii)
The above process repeats by defining the second layerneural response as
NSqðf Þðt0Þ ¼max

h0AHbK vðfJh0,t0Þ, ð3Þ

where in this case f AI Sq, t0AT v and H¼Hv. Consequently,the derived kernel on I Sq � ISq is given by

KSqðf ,gÞ ¼/NSqðf Þ,NSqðgÞSL2ðT vÞ

,

where f ,gAI Sq and / � , �SL2ðT vÞ

is the L2 inner product. Asbefore, the final derived kernel bK Sq is obtained by normal-izing KSq.

This process can continue if appropriate higher level patches aredefined. It is worth pointing out that, although Smale et al. havedrawn on the example of vision as an interpretation of the derivedkernel, the setting is general and is not limited to images [21].

3. Template selection

In [21], the templates are randomly sampled. However, thesetemplates may not fully reflect the structure characteristics of theobject to be recognized. Template selection is an open (andinvolved) problem that could include learning techniques [26].For the reasons above, we investigate the template selectionalgorithm which is suitable for object recognition task. Here, weconsider the derived kernel with three-layer architecture. There-fore, two template sets need to be constructed. Hereinafter, thetemplate tAT u is called the first layer template and the templatet0AT v is called the second layer template.

3.1. Motivation

In object recognition, the derived kernel must have theproperty that they are good at discriminating between objectsfrom different categories while simultaneously being invariant toimaging variations (such as those due to camera rotation or zoom,changes in illumination, etc.) as well as within class variability.For example, we can recognize a specific face among many,despite changes in viewpoint, scale, illumination or expression.That is to say, we must balance the trade-off between invarianceand selectivity. Smale [21] and Bouvrie [26] have suggested thatShannon entropy is a useful concept towards understanding thetrade-off between invariance and selectivity properties of thederived kernel. Their conclusion indicates that if we use enoughtemplates then the derived kernel will be maximally selective.Conversely, its invariance increase since some discriminativetemplates cannot be sampled [26]. In this paper, a small numberof templates is selected. This will increase the invariance accord-ing to the conclusion of Smale et al. On the other hand, the labelinformation of the training images is used. Consequently, thederived kernel based on our method has high discriminativepower. In this way, we can balance the trade-off betweeninvariance and selectivity.

The ability of machines to classify patterns relies on anassumption that classes occupy distinct regions in the featurespace. This tells us that we need to select the templates at whichthe between-class neural responses are separated maximally andmeanwhile the within-class neural responses are as close aspossible. Note that the second layer templates are large enoughto include more discriminative structure, so the supervisedmethod is designed for selecting the second layer templates. Firstof all, a discriminant criterion is given in Section 3.2.

3.2. Optimality criteria

Denote by Pv the initial template set

Pv ¼ fp1,p2, . . . ,pMg,

where pi ði¼ 1;2, . . . ,MÞ is the ith candidate template andPv � Iv. The training set is denoted by

I ¼ fI1,I2, . . . ,INg,

where Ij ðj¼ 1;2, . . . ,NÞ is the jth training image and I � I Sq.Suppose that the training images are divided into C classes. Thesecond layer neural response of training image Ij at a givencandidate template piAPv is denoted by NSqðIjÞðpiÞ.

For the kth class, the mean of the neural responses of thetraining images at pi is given by

mkðpiÞ ¼1

Nk

XNk

n ¼ 1

NSqðIknÞðpiÞ,

where Nk is the number of the images which belong to kth classand NSqðI

knÞðpiÞ is the neural response of the nth sample in the kth

class at pi. Let SwðpiÞ be the within-class scatter, i.e.

SwðpiÞ ¼1

N

XC

k ¼ 1

XNk

n ¼ 1

9NSqðIknÞðpiÞ�mkðpiÞ9:

The total mean of the neural responses of training images at pi isgiven by

mðpiÞ ¼1

N

XN

n ¼ 1

NSqðInÞðpiÞ,

where NSqðInÞðpiÞ is the neural response of the nth training imageat pi. Therefore the between-class scatter of the neural responsesat pi is defined as

SbðpiÞ ¼XC

k ¼ 1

Nk

N9mkðpiÞ�mðpiÞ9:

As mentioned before, our aim is to select the image patches atwhich the between-class scatters of the neural responses is largewhile the within-class scatters are small. Consequently, we cangive a possible criterion for selecting the templates as follows:

SðpiÞ ¼SbðpiÞ

SwðpiÞ: ð4Þ

Note that (4) has been incorporated into feature selection. In mostcases, criterion (4) can achieve the desired results. However, itshould be noted that the ratios of between-class scatters towithin-class scatters are relative values. Consequently, theycannot reflect the absolute differences in between-class scattersas well as within-class scatters. In some cases, the between-classscatter and within-class scatter have the same order of magni-tude. This means that there may be overlaps between differentclasses in such condition. So SbðpiÞ will lead to unreliablejudgement.

It is easy to prove that even if SbðpjÞ=SwðpjÞ4SbðpiÞ=SwðpiÞ, wecannot guarantee that SwðpjÞoSwðpiÞ. That is to say, a larger ratiodoes not always mean a smaller within-class scatter. Fig. 4 shows

0.999965 0.999970 0.999975 0.999980 0.999985 0.999990

0.99955

0.99960

0.99965

0.99970

0.99975

0.99980

0.99985

0.99990

0.99995

class 1

class 2

Fig. 4. The neural responses of artificial data set.


a set of artificial data, where abscissas and ordinates are neuralresponses at two different image patches. Let pa be the imagepatch corresponding to the abscissa, and po be the image patchcorresponding to the ordinate. In this condition, we havethat SwðpaÞ ¼ 3:33481� 10�6, SbðpaÞ ¼ 4:12525� 10�6, SbðpoÞ ¼

6:10402� 10�5 and SwðpoÞ ¼ 7:91333� 10�5, where six signifi-cant figures are retained. It is easy to find that we can correctlyclassify 10 samples into two classes by using the neural responsescorresponding to abscissas. However, we can obtain SbðpaÞ=SwðpaÞ

¼ 1:23703 and SbðpoÞ=SwðpoÞ ¼ 1:29641 respectively. This meansthat criterion (4) is not appropriate for this data set in whichbetween-class scatter and within-class scatter have the sameorder of magnitude. This argument has also been demonstratedon California Institute of Technology (Caltech) face database [28].To solve this problem, we can use within-class scatter as acriterion to measure the discriminative ability of the given imagepatches. So we may minimize the objective function SwðpiÞ, or

maxpi

1

SwðpiÞ: ð5Þ

However (5) does not work when SbðpiÞ is the dominant factor.For presenting a more versatile criterion, we combine with twocriterions mentioned above and maximize (4) and (5) simulta-neously. An adaptive combination is given as following:

maxpi

1

SwðpiÞþ

1

SwðpiÞ�

SbðpiÞ

SwðpiÞ

� �: ð6Þ

The mixture coefficient of (6) is determined by 1=SwðpiÞ. Thelarger the value of 1=SwðpiÞ is, the more useful is the patch. Theadvantage of (6) is that this criterion can be adjusted automati-cally. For two given image patches pi,pj in initial template set, wehave

1

SwðpjÞþ

1

SwðpjÞ�

SbðpjÞ

SwðpjÞ

1

SwðpiÞþ

1

SwðpiÞ�

SbðpiÞ

SwðpiÞ

¼

1þSbðpjÞ

SwðpjÞ

1þSbðpiÞ

SwðpiÞ

�SwðpiÞ

SwðpjÞ: ð7Þ

Eq. (7) means that if

1þSbðpjÞ

SwðpjÞ

1þSbðpiÞ

SwðpiÞ

�SwðpiÞ

SwðpjÞ41,

then we choose pj instead of pi, and vice versa. It is easy to findthat quantity (7) is dominated by SwðpiÞ=SwðpjÞ when the var-iances of ratios of between-class scatters and within-class scattersare small. On the contrary, it is dominated by ratios of between-class scatters and within-class scatters if the differences amongthe within-class scatters are small. Therefore the criterion pre-sented above can be used in various aspects of applications.

Finally, the score of the pi is denoted by

SðpiÞ9SwðpiÞþSbðpiÞ

S2wðpiÞ

: ð8Þ

Here, SðpiÞ is a criterion for template selection.

3.3. The algorithm

Given the above, the template selection algorithm can be givenas follows:

(i)
Selecting the first layer templates(1) An initial template set, Pu, is constructed, where Pu is a
large pool of image patches of size u randomly extractedfrom cropped images that include the instance of theobject to be recognized.

(2) We use the K-means algorithm to cluster all the imagepatches into 9T u9 clusters. Note that the image patches inPu are previously transformed into vectors.

(3) Taking the clustering centers as the first layer templates.

In this way, a small number of templates can be obtained. Inthis step, the size of the template is determined experimen-tally. In general, the first layer templates should be largeenough to include basic elements of the objects, e.g., cornersof the eyes.


(ii)
Selecting the second layer templates
(1) Sample a large number of candidate templates, denotedby piAPv ði¼ 1;2, . . . ,MÞ. They are obtained randomlyfrom the cropped image set C in which the objects fillmost of the images.

(2) Compute the score of each candidate template in Pv by(8). In this case, the label information of the trainingimages is used.

(3) Rank the candidate templates in descending order accord-ing to their scores. Consequently, we rewrite Pv as

Pv9Prv ¼ fp

r1,pr

2, . . . ,prMg,

where pri ði¼ 1;2, . . . ,MÞ is the ith image patch in Pr

v.(4) The ith image patch is selected as a template, if the

correlation coefficients between the ith image patch andthe selected templates are all smaller than a given thresh-old T. The correlation coefficient of two image patchespr

i ,prj APr

v is defined by

Cðpri ,p

rj Þ ¼

~Cðpri ,p

rj Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

~Cðpri ,p

ri Þ~Cðpr

j ,prj Þ

q ,

where

~Cðpri ,pr

j Þ ¼

NSqðI1Þðpri Þ

. . .

NSqðINÞðpri Þ

0B@

1CA

T NSqðI1Þðprj Þ

. . .

NSqðINÞðprj Þ

0BB@

1CCA:

It is easy to find that the correlation coefficient is relate tocandidate templates and training images. In addition, Tcan be determined experimentally. The detailed imple-mentation is shown as in Algorithm 1 (see Fig. 5), wherecount is the loop variable and d is the number of the finaltemplates.

Note that the derived kernel on Iv � Iv is the inner productof the first layer neural response of the image patch and thatof the second layer templates. So the second layer templateset is constructed based on the first layer template set T u

obtained in previous step. A small number of the first layertemplates will lead to the information reside in the secondlayer templates missing, while a large number of the firstlayer templates will increase the selectivity of the secondlayer templates. Since the second layer templates are

Fig. 5. The construction of the second layer template set.

represented by the first layer templates in the definition ofthe derived kernel. In practice, the first layer template selec-tion affects the second layer template selection, and we willdemonstrate this in the Section 4.

Remarks

(i)
In the derived kernel, the size of the first layer template issmall. These templates possess features shared by manyobjects. Our aim is to obtain a small number of representativeimage patches. Here K-means algorithm is used. The obtainedtemplates are learned form a large pool of image patches. Inthis way, the obtained template set has less redundancy andis discriminative. We have tested many clustering methodssuch as K-means, spectral clustering and manifold learning.These methods all perform better than random method.Particularly, the K-means algorithm is effective and simple.Note that it is unlikely that the cortex performs K-meansclustering, since there are more plausible models of corticalself-organization [29]. It would be an interesting topic todesign effective clustering method for template selection.
(ii)
As mentioned before, the second layer templates are largeenough to include more discriminative structure. A giventemplate may belong to a certain class. So we make use ofthe label information of the training samples. A feasiblemethod for template selection has been given in Algorithm1. It is easy to find that the template set we constructed issmall and has good discriminative ability. That is to say, thetrade-off between invariance and selectivity can be achieved.Consequently, the selected templates will lead to betterperformance in similarity measure.
3.4. Computational complexity

The derived kernel can be generalized to an n layer architec-ture given by sub-patches v1 � v2 � � � � � vm � � � � � vn ¼ Sq. Inthis case we use the notation Hm ¼Hvm and similarly T m ¼ T vm

[21]. Let Hm be a set of maps from vm to vmþ1. The globaltransformations can be defined recursively. For 1rmrn�1 set

Hgm ¼ fh : vm-Sq9h¼ h0Jh00,h0AHg

mþ1,h00AHmg,

where Hgn contains only the identity fI : Sq-Sqg [21]. Here, we

have v1 ¼ u, v2 ¼ v and v3 ¼ Sq. Then we can estimate thecomputational complexity of the template selection algorithm.

Proposition 1. The time complexity of the template selection

algorithm in derived kernel is

Oð9T 19ðNð9Hg199T 09þ9Hg

29ð9H19þ9Pv9ÞÞþItQ9Pu9ÞÞ, ð9Þ

where 9T 19 is the number of the first layer templates, N is the

number of the training images, 9T 09 is the cost of computing the

initial kernel, It is the number of the iterations in the K-means

algorithm, 9Pu9 is the cardinality of Pu and Q is the cost of

calculating the distance between an image patch and a centroid in

the K-means algorithm.

In addition, the number of operations required to compute thederived kernel is given by

X2

m ¼ 1

ð9Hgm99T m99T m�19þ9Hg

mþ199Hm99T m9Þ: ð10Þ

The proof of Proposition 1 is provided in Appendix A. Obviously,(10) is determined by the numbers of templates and transforma-tions [21]. The derived kernel based on the random extractionmethod requires a larger number of templates. Consequently, the


cost of it increases rapidly when the number of the test imagesincreases. Fortunately, the proposed method can obtain a smallnumber of effective templates. Apart from this, the templateselection process is executed only once in the object recognitiontask. In other words, the cost of selecting the templates canbe ignored when the number of the test images is large. So wecan conclude that the proposed method can accelerate thecalculation.

Table 1Recognition accuracies with different methods on the Yale database.

Method Accuracy (%) 9T v9 9T u9

Fisherfaces 70.96 N/A N/A

Orthogonal Laplacianfaces 78.11 N/A N/A

Random extraction 77.22 105 105

Fisher-based method 84.22 105 105

Our method 86.52 105 105

4. Experimental results and analysis

In this section we demonstrate the effectiveness of theproposed method on four standard databases. These databasesinclude the Yale face database [30], the Caltech face database, theFace Recognition Grand Challenge (FRGC) database [31] and theTU Darmstadt database [32]. In the experiments, all the imagepatches in Pu and Pv come from the cropped images. And thesecropped images are not included in the training or test sets [21].After removing the images from which the templates areobtained, a random subset with l images per class was takenwith labels to form the training set, and the rest of the databasewas considered to be the test set. The test image is identified bythe 1-Nearest Neighbor (1-NN) classifier. The performance of thederived kernel based on different template selection methods isevaluated in the following way. Holding the initial template setsPi

u and Piv (i¼ 1;2, . . . ,w, w is the number of the initial template

sets) fixed, we can obtain averaged accuracy Ai over q randomtraining set and test set splits. Then the performance of the givenmethod is indicate by

A¼1

w

Xw

i ¼ 1

Ai, ð11Þ

where A is the average accuracy over different initial templatesets. The experiments were conducted on a computer with 2 GBRAM and 2.66 GHz Intel Core 2 Duo processor and the code wasimplemented in MATLAB.

4.1. Experiments with the Yale face database

We first evaluate the performance of the proposed method onYale face database consisting of 165 grayscale images of 15individuals with variations in lighting condition and facial expres-sion. The original 243�320 pixels images were scaled to 98�128using Matlab’s imresize function. Fig. 6 displays the sample faceimages from this database.

We randomly selected one image from each individual toconstruct template sets. For each subject, two images werechosen randomly to form the training set (30 images), and therest images were used for testing (120 images). The cardinalities

Fig. 6. Sample images from

of the template sets are shown in Table 1, where 9T u9 and 9T v9denote the cardinalities of T u and T v respectively. In theproposed method, the size of the first layer template was set to9�9 and that of the second layer template was set to 39�39. Inaddition, the cardinalities of Pv and Pu were 450 and 1500,respectively. This experiment was carried out on q¼50 randomsplits of the database. The average accuracies over w¼10 trialsare shown in Table 1, where Fisher-based method selects thesecond layer templates according to (4) and constructs the firstlayer template set using K-means algorithm. As can be seen, ourmethod performs the best. The results indicate that the proposedcriterion is superior to standard criterion described in (4). Theresults also show that the derived kernel based on randomextraction performs comparably to Orthogonal Laplacianfaces[33] when dealing with these images with simple background.In particular, the Fisherfaces performs the worst [30]. The reasonfor this is that Fisherfaces is sensitive to translation. This experi-ment not only demonstrates the effectiveness of the proposedmethod but also the feasibility of using the derived kernel torecognize face from image.

Fig. 7 shows the effect of different template size selections onrecognition accuracy. The results show that the proposed methodconsistently outperforms random method for all template sizes.In practice, the template sizes can be chosen via cross validationor on a separate validation set distinct from the test set [21]. Inour method, the optimal size of the first layer template was 9�9pixels. The one-tail t-tests underline the significance of differ-ences between accuracies. The test statistics are shown in Table 2.Since t0:005ð9Þ ¼ 3:2498, it is easy to find that 9�9 is the bestvalue for the size of the first layer template when the size of thesecond layer template set is fixed to 39�39.

In Fig. 8, we show how the change in the first layer templateset selection affects the performance. In this case, T v in ourmethod was constructed by randomly extracting image patches.The size of the second layer template set, which was set to 375,remained unchanged. We can conclude from Fig. 8 that the firstlayer template selection method proposed is effective.

Furthermore, we compared running times of the two methodsat the same precision level. This experiment was carried out on

Yale face database.

Fig. 7. The comparison of accuracy with different template sizes on the Yale

database.

Table 2The test statistics.

Parameters Test statistics

7�7 & 9�9 3.795

11�11 & 9�9 4.655

13�13 & 9�9 10.44

15�15 & 9�9 21.30

17�17 & 9�9 26.25

50 60 70 80 9065

70

75

80

85

The number of the first layer templates

Rco

gniti

on a

ccur

acy

(%)

Our methodRandom extraction

Fig. 8. The effects of the change in the first layer template set on the performance.

Table 3Running time comparison on the Yale database.

Method Accuracy (%) Time (s) 9T v9 9T u9

Random extraction 78.00 311.31 105 105

Our method 79.50 150.23 20 40

Fig. 9. Recognition accuracy versus the number of the training samples on Yale

database.


one of the training-test splits. The results are shown in Table 3.Note that the data given here are the average results over fiveruns. Evidently, the results in Table 3 illustrate that the templatesets constructed by the proposed method are small. Thus, thederived kernel based on our method is faster.

The performance of the proposed method was also evaluatedby using training sets with different sizes. A random subset withlð ¼ 2;3,4;5Þ images per individual was taken with labels to formthe training set, and the rest of the database was considered to bethe test set. Recognition accuracy versus the number of thetraining samples is shown in Fig. 9 (note that 9T v9¼ 9T u9¼ 105). As can be seen, our method always performs better thanrandom extraction method, and the performance can be improvedwith more training images. Note that the derived kernel based onthe proposed method can achieve high accuracy when thetraining set is small.

4.2. Experiments with the Caltech face database

The Caltech face database consists of 450 color images of 27individuals. This database is substantially more challenging than

the Yale database, since the images demonstrate variations inlighting, expression and background. The size of the images in it is592�896 pixels. We pruned out individuals who have less thanfive photos to maintain uniformity within the database. Thepruned Caltech database is composed of 408 images of 19individuals. Six sample images from the pruned database areshown in Fig. 10. In our experiments, each image was convertedinto grayscale image and resized to 119�180 pixels.

Two images per individual were used to construct templatesets. And five images per subject were extracted randomly toform the training set. Consequently, there were 275 images fortesting. In the experiments, u was set to 15�15 and v was set to39�39. The cardinalities of template sets are given in Table 3.The performance of a pair of template sets (T v and T u) wasmeasured by the average accuracy over 30 training/test splits. Werepeated above process 10 times, and the average accuracies areshown in Table 4. Obviously, the performance of standardcriterion described in (4) is poor. As mentioned in previoussection, criterion described in (4) does not work when imageshave large within class variability. The results also indicate thatthe derived kernel based on the proposed method is effectiveenough to recognize objects in cluttered scenes. In addition, thisexperiment confirms that the derived kernel is robust to transla-tion and illumination variance. The reason for the illuminationinvariance is that the normalized cross correlation is used in thederived kernel.

Fig. 11 shows how the first layer template selection affects thesecond layer template selection, where K is the number of thecluster centers. It is easy to find that setting the size of the firstlayer template set too large or too small will decrease theaccuracy. Our method obtains the best performance whenK¼380. The reason for this phenomenon is that the number ofthe templates can balance the trade-off between the invarianceand selectivity. A small number of the first layer templates willlead to the information reside in the second layer templatesmissing. On the other hand, a large number of the first layertemplates will increase the selectivity.

4.3. Experiments with the FRGC database

In this subsection, we conduct experiments on the FRGC2.0 database which is one of the most challenging face databases.This database contains male and female face images of adultsfrom different races [31]. The subset used in this experimentconsists of 989 face images corresponding to 17 subjects. Theseimages were chosen from a subset called Spring 2003 Session.This subset consists of controlled images and uncontrolledimages. The controlled images were taken in a studio setting.They are full frontal facial images taken under two lightingconditions and with two facial expressions (smiling and neutral).The uncontrolled images were taken in varying illuminationconditions, e.g., hallways, atriums, or outside. These images also

Fig. 10. Sample images from Caltech face database.

Table 4Recognition accuracies with different methods on the Caltech face database.






Our method 84.19 190 380

100 150 200 25070

75

80

85

The number of the second layer templates

Rco

gniti

on a

ccur

acy

(%)

K=50K=380K=500

Fig. 11. The effects of Tu construction on the Tv construction.


contain two expressions, smiling and neutral. The size of thecontrolled images is 2272�1704 pixels and that of the uncon-trolled images is 1704�2272 pixels [31]. Fig. 12 shows someexample images, where the first row is composed of uncontrolledimages and the second row is composed of controlled images. Inthis experiment, the images were resized to one-tenth the size ofthe original images and converted into grayscale images.

In this experiment, 15 images per subject were extracted to formthe training set and 4 images (2 controlled and 2 uncontrolled) perindividual were extracted to construct template sets. The rest imageswere used for testing (666 images). The performance of template sets(T v and T u) was measured by the average accuracy over 30 samplesplits. In this experiment, u was 15�15 and v was 39�39. Theresults shown in Table 5 are averaged over 5 independent runs. Ascan be seen, the proposed method outperforms random extractionmethod and the Fisher-based method, since the number of thetemplates is small. Obviously, the performances of Fisherfaces andOrthogonal Laplacianfaces are very poor.

4.4. Experiments with the TU Darmstadt database

Here, we demonstrate the effectiveness of the proposedmethod in generic object recognition task. Generic object recog-nition is the task of classifying an individual object into a certain

category, thus it is also termed object categorization [34]. Thegeneric object recognition has the difficulty of the intra-class andinter-class variability. In the experiments, the TU Darmstadtdatabase was used [32]. This database has a total of 326 images,115 of which are motorbike images, 100 side views of cars (thereis only one car in each image) and 111 caw images. All cars haveroughly the same scale and occur in the center of the image. Themotorbike images are more varied and include everyday scenes ofpeople riding their motorbikes in cluttered environments. Someexample images from this database are shown in Fig. 13. It isimportant to note that the sizes of the images in this database arenot the same. The derived kernel method can measure thesimilarities between these images, since the dimensionality ofthe neural response is independent of the size of the input image.

Each image was converted into grayscale image and scaled.The final images were 0.5 times the size of original images. Fiveimages per category were used to construct template sets. Andthe number of the templates are given in Table 6. Ten images percategory were randomly extracted to form the training set, andthe remaining images were used as test images. For a pair oftemplate sets, its performance was measured by the averageaccuracy over 30 sample splits. In addition, we had u¼9�9 andv¼25�25. Table 6 shows the average accuracies obtained byrepeating the template selection procedure 10 times. The experi-ment results show that the proposed method is more effective.Since our method can achieve better performance with a smallnumber of templates. The availability of derived kernel in recog-nizing generic object is also proved by this experiment.

Finally, we show how the change in the second layer templateset selection affects the performance. In this case, the first layertemplate set was constructed randomly in the proposed method.The number of the first layer templates was fixed to 75. As shownin Fig. 14, the second layer template selection method proposed ismore effective. We find that the accuracy is lower than that inTable 6. The reason is that the first layer templates are extractedrandomly in this experiment. This also demonstrates the effec-tiveness of the first layer template selection method.

4.5. Discussion

To sum up, these experiments demonstrate that the proposedmethod can lead to more accurate similarity between images. Inthis paper, we make some comparisons between the templateselection method proposed and the state of the art methods.Through extensive experiments, we also find that the derivedkernel has better performance than other traditional methods onthe object recognition task, especially when the number oftraining samples is insufficient. In addition, the optimal size of

Table 5Comparison of recognition accuracy on the FRGC database.





Fisher-based 75.31 550 136

Our method 75.44 400 136

Fig. 13. Example images from TU Darmstadt database.

Fig. 12. Example images from FRGC face database.

Table 6Comparison of recognition accuracy on the TU Darmstadt database.




Our method 80.57 100 70

40 60 80 100 12070

75

80

The number of the second layer templates

Rco

gniti

on a

ccur

acy

(%)

Our methodRandom extraction

Fig. 14. The effects of change in the second layer template set on the performance.


the templates dependent on the particular task at hand. Usuallythe first layer templates should be large enough to include basicelement of the object, and the second layer templates should belarge enough to include more discriminative structure.

5. Conclusion

In this paper, an effective template selection method has beenproposed. It is designed to obtain compact template sets with large

discrimination power. Consequently, the derived kernel based on theproposed method is more suitable for object recognition. We experi-mentally compared the proposed method with the other methods onfour standard databases. The results show that the derived kernel


based on the method proposed can lead to increased recognitionperformance. In addition, we have demonstrated that derived kernelbased on the proposed method is powerful for object recognition incluttered scenes. Note that the proposed method could be incorpo-rated into other state of the art models (i.e., the model of objectrecognition developed at the Center for Biological and ComputationalLearning (CBCL) [1,2]). Although a new template is proposed in thispaper, template selection is still an open problem. We will study thetemplate selection method based on other machine learning methodssuch as manifold learning and non-negative matrix factorization inthe future.

Acknowledgment

This work was supported by the National Basic ResearchProgram of China (973 Program) (Grant no. 2011CB707100), bythe National Natural Science Foundation of China (Grant nos.61075116, 61172143, 11071058), Natural Science Foundation ofHubei Province (No. 2009CDB387, 2010CDA008) and Fundamen-tal Research Funds for the Central Universities (No. 2010ZD038).

Appendix A. Proof of proposition 1

The time complexity of the method is dominated by four parts:carrying out the K-means, computing the neural responses oftraining images at initial templates belonging to Pv, computingthe scores, and removing the redundant image patches.

It is well known that the time complexity of K-means algo-rithm is

OðItQ9Pu99T 19Þ,

where It is the number of the iterations, 9T 19 is the cardinality ofT 1, 9Pu9 is the cardinality of Pu and Q is cost of calculating thedistance between an image patch and a centroid. Note that Q isdetermined by the size of the first layer template.

Ignoring the cost of normalization and of pre-computing theneural responses of the initial templates, the number of requiredoperations for computing neural responses is given by

Nð9Hg199T 199T 09þ9Hg

299H199T 19þ9Hg299Pv99T 19þ9H299Pv9Þ,

where we denote for notational convenience the cost of comput-ing the initial kernel by 9T 09 and N is the number of the trainingimages. Thus, the complexity of it is

OðN9T 19ð9Hg199T 09þ9Hg

29ð9H19þ9Pv9ÞÞÞ:

The time complexities of the third part and the fourth part areOðN9Pv9Þ and Oð9T 299Pv9Þ, respectively. Note that 9T 29 is notlarger than N9Hg

299T 19 in the practical application.Therefore, the time complexity of the proposed method is

given by (9).

References

[1] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, T. Poggio, Robust objectrecognition with cortex-like mechanisms, IEEE Trans. Pattern Anal. Mach.Intell. 29 (2007) 411–426.

[2] T. Serre, L. Wolf, T. Poggio, Object recognition with features inspired by visualcortex, in: Proceedings of IEEE Conference on Computer Vision and PatternRecognition, 2005, pp. 994–1000.

[3] D.G. Lowe, Object recognition from local scale-invariant Features, in: Pro-ceedings of Conference on Computer Vision, 1999, pp. 1150–1157.

[4] H. Li, S. Xu, L. Li, Dim target detection and tracking based on empirical modedecomposition, Signal Process Image Comm. 23 (2008) 788–797.

[5] H. Li, Y. Wei, L. Li, Y.Y. Tang, Infrared moving target detection and trackingbased on tensor locality preserving projection, Infrared Phys. Technol. 53(2010) 77–83.

[6] Y.W. Chen, Y.Q. Chen, Invariant description and retrieval of planar shapesusing Radon composite features, IEEE Trans. Signal Process. 56 (2008)4762–4771.

[7] D. Xu, S. Chang, Video event recognition using kernel methods with multileveltemporal alignment, IEEE Trans. Pattern Anal. Mach. Intell. 30 (2008) 1985–1997.

[8] J. Shen, D. Tao, X. Li, Modality mixture projections for semantic video eventdetection, IEEE Trans. Circuits Syst. Video Techn. 18 (2008) 1587–1596.

[9] X. Li, Y. Pang, Deterministic column-based matrix decomposition, IEEE Trans.Knowledge Data Eng. 22 (2010) 145–149.

[10] Y. Pang, Z. Liu, N. Yu, A new nonlinear feature extraction method for facerecognition, Neurocomputing 69 (2006) 949–953.

[11] X. Tan, S. Chen, Z. Zhou, J. Liu, Face recognition under occlusions and variantexpressions with partial similarity, IEEE Trans. Inf. Forensics Security 4(2009) 217–230.

[12] Y. Tang, L. Li, X. Li, Learning similarity with multikernel method, IEEE Trans.Syst. Man Cybern. B Cybern. 41 (2011) 131–138.

[13] L. Jia, L. Kitchen, Object-based image similarity computation using inductivelearning of contour-segment relations, IEEE Trans. Image Process. 9 (2000) 80–87.

[14] C.Y. Yen, K.J. Cios, Image recognition system based on novel measures ofimage similarity and cluster validity, Neurocomputing 72 (2008) 401–412.

[15] P. Simard, Y. Le Cun, J.S. Denker, Efficient pattern recognition using a newtransformation distance, in: Adv. Neural Inf. Process. Syst., 1992, pp. 50–58.

[16] D.P. Huttenlocher, G.A. Klanderman, W.J. Rucklidge, Comparing images using theHausdorff distance, IEEE Trans. Pattern Anal. Mach. Intell. 15 (1993) 850–863.

[17] M. Zahedi, D. Keysers, T. Deselaers, H. Ney, Combination of tangent distanceand an image distortion model for appearance-based sign language recogni-tion, in: Proceedings of 27th Annual Meeting of the German Association forPattern Recognition, 2005, pp. 401–408.

[18] L. Wang, Y. Zhang, J. Feng, On the Euclidean distance of images, IEEE Trans.Pattern Anal. Mach. Intell. 27 (2005) 1334–1339.

[19] A. Selinger, R.C. Nelson, Improving appearance-based object recognition incluttered backgrounds, in: Proceedings of 15th International Conference onPattern Recognition, 2000, pp. 46–50.

[20] S. Rasool, M. Bell, Biologically inspired processing of Radar waveforms forenhanced delay-doppler resolution, IEEE Trans. Signal Process. 59 (2011)2698–2709.

[21] S. Smale, L. Rosasco, J. Bouvrie, A. Caponnetto, T. Poggio, Mathematics of theneural response, Found. Comput. Math. 10 (2010) 67–91.

[22] M. Riesenhuber, T. Poggio, Hierarchical models of object recognition incortex, Nat. Neurosci. 2 (1999) 1019–1025.

[23] T. Serre, M. Kouh, C. Cadieu, U. Knoblich, G. Kreiman, T. Poggio, A Theory ofObject Recognition: Computations and Circuits in the Feedforward Path ofthe Ventral Stream in Primate Visual Cortex, AI Memo 2005-036/CBCL Memo259, Massachusetts Institute of Technology, Cambridge, 2005.

[24] S. Smale, T. Poggio, A. Caponnetto, J. Bouvrie, Derived Distance: Towards aMathematical Theory of Visual Cortex, CBCL Paper, MIT, Cambridge, MA, 2007.

[25] M. Kouh, T. Poggio, A canonical neural circuit for cortical nonlinear opera-tions, Neural Comp. 20 (2008) 1427–1451.

[26] J. Bouvrie, Hierarchical Learning: Theory with Applications in Speech andVision, Ph.D. Thesis, Massachusetts Institute of Technology, 2009.

[27] J. Bouvrie, L. Rosasco, T. Poggio, On invariance in hierarchical models. in: Advancesin Neural Information Processing Systems, vol. 22, 2009, pp. 162–170.

[28] L. Fei-Fei, R. Fergus, P. Perona, Learning generative visual models from fewtraining examples: an incremental Bayesian approach tested on 101 objectcategories, Comput. Vis. Image Understand. 106 (2007) 59–70.

[29] T. Serre, M. Riesenhuber, J. Louie, T. Poggio, On the role of object-specificfeatures for real world recognition in biological vision, in: Proceedings ofWorkshop Biologically Motivated Computer Vision, 2002, pp. 387–397.

[30] P.N. Bellhumer, J. Hespanha, D. Kriegman, Eigenfaces vs. Fisherfaces: recogni-tion using class specific linear projection, IEEE Trans. Pattern Anal. Mach.Intell. 19 (1997) 711–720.

[31] P.J. Phillips, P.J. Flynn, T. Scruggs, K.W. Bowyer, J. Chang, K. Hoffman,J. Marques, J. Min, W. Worek, Overview of the face recognition grandchallenge, in: Proceedings of IEEE Conference on Computer Vision andPattern Recognition, 2005, pp. 947–954.

[32] B. Leibe, A. Leonardis, B. Schiele, Combined object categorization andsegmentation with an implicit shape model, in: ECCV Workshop on Statis-tical Learning in Computer Vision, 2004, pp. 17–32.

[33] D. Cai, X. He, J. Han, H.J. Zhang, Orthogonal Laplacianfaces for face recogni-tion, IEEE Trans. Image Process. 15 (2006) 3608–3614.

[34] A. Opelt, A. Pinz, M. Fussenegger, P. Auer, Generic object recognition withBoosting, IEEE Trans. Pattern Anal. Mach. Intell. 28 (2006) 416–431.

Hong Li received MS degrees in mathematics fromHuazhong University of Science and Technology,China, in 1986, and the PhD degree in pattern recogni-tion and intelligence control from Huazhong Univer-sity of Science and Technology, China, in 1999. She iscurrently a Professor in the School of Mathematics andStatistics at Huazhong University of Science and Tech-nology. She visited Chinese Academy of Sciences, HongKong Baptist University and University of Macao. Herresearch interests are approximation theory, computa-tional learning, signal processing and analysis, patternrecognition and wavelet analysis.


Yantao Wei received the BS degree in information and
computing science from Qingdao University of Science and Technology, China, in 2006, and the MS degree incomputational mathematics from Huazhong Univer-sity of Science and Technology, China, in 2008. He iscurrently pursuing PhD degree in control science andengineering at Institute for Pattern Recognition andArtificial Intelligence, Huazhong University of Scienceand Technology, China. His research interests includecomputational intelligence, computer vision, and pat-tern recognition.
Luoqing Li received the BSc degree from Hubei Uni-
versity, the MSc degree from Wuhan University, and the PhD degree from Beijing Normal University, China.He is currently with the Key Laboratory of AppliedMathematics, Hubei Province, and the Faculty ofMathematics and Computer Science, Hubei University,where he became a Full Professor in 1994. His researchinterests include approximation theory, wavelet ana-lysis, learning theory, signal processing, and patternrecognition. Dr. Li has been the Managing Editor of theInternational Journal on Wavelets, Multiresolution,and Information Processing.
Yuan Yuan is currently a Researcher (Full Professor) with the State Key Laboratoryof Transient Optics and Photonics and the Director of the Center for OPTicalIMagery Analysis and Learning (OPTIMAL), Xi’an Institute of Optics and PrecisionMechanics, Chinese Academy of Sciences, China.

Date post:	10-Sep-2016
Category:	Documents
Upload:	hong-li
View:	213 times
Download:	0 times

Similarity learning for object recognition based on derived kernel

Documents