arXiv:1605.07104v1 [cs.CV] 23 May 2016 · 2016-05-24 · arXiv:1605.07104v1 [cs.CV] 23 May 2016. a...

Generic Instance Search and Re-identification from One Example via Attributes andCategories

Ran Taoa,∗, Arnold W.M. Smeuldersa, Shih-Fu Changb

aISIS, University of Amsterdam, Science Park 904, Amsterdam, the NetherlandsbDepartment of Electrical Engineering, Columbia University, 500 W. 120th St., Mudd 1310, New York, USA

Abstract

This paper aims for generic instance search from one example where the instance can be an arbitrary 3D object like shoes, not justnear-planar and one-sided instances like buildings and logos. First, we evaluate state-of-the-art instance search methods on thisproblem. We observe that what works for buildings loses its generality on shoes. Second, we propose to use automatically learnedcategory-specific attributes to address the large appearance variations present in generic instance search. Searching among instancesfrom the same category as the query, the category-specific attributes outperform existing approaches by a large margin on shoes andcars and perform on par with the state-of-the-art on buildings. Third, we treat person re-identification as a special case of genericinstance search. On the popular VIPeR dataset, we reach state-of-the-art performance with the same method. Fourth, we extendour method to search objects without restriction to the specifically known category. We show that the combination of category-levelinformation and the category-specific attributes is superior to the alternative method combining category-level information withlow-level features such as Fisher vector.

This technical report is an extended version of our previous conference paper “Attributes and Categories for Generic InstanceSearch from One Example” (CVPR 2015).

Keywords: Instance search, Attribute

1. Introduction

In instance search, the objective is to retrieve all imagesof a specific object given a few query examples of that ob-ject [1, 2, 3, 4, 5]. We consider the challenging case of only1 query image and admitting large differences in the imagingangle and other imaging conditions between the query imageand the target images. A very hard case is a query specified infrontal view while the relevant images in the search set showa view from the back which has never been seen before. Hu-mans solve the search task by employing two types of generalknowledge. First, when the query instance is a certain class,say a female, answers should be restricted to be from the sameclass. And, queries in the frontal view showing one attribute,say brown hair, will limit answers to show the same attribute,even when the viewpoint is from the back. In this paper, weexploit these two types of knowledge to handle a wide varietyof viewpoints, illumination and other conditions for instancesearch.

In instance search, excellent results have been achieved byrestricting the search to buildings [1, 6, 7, 8]. Searching build-ings can be used in location recognition and 3D reconstruction.Another set of good results has been achieved in searching forlogos [9, 10, 11] for the estimation of brand exposure. And, [12]

∗Corresponding authorEmail address: [email protected] (Ran Tao)

searches for book and magazine covers. All these cases of in-stance search show good results for near-planar, and one-sidedobjects which are recorded under a limited range of imagingangles. In this work, we aim for broader classes of query in-stances. We aim to perform generic instance search from 1 ex-ample. Generic implies we consider arbitrary objects, and notjust one-sided objects. And, generic implies we aim to use oneapproach not specially designed for a certain kind of instances,such as RANSAC-based geometric verification for rigid andhighly textured objects. In our case, instances can be build-ings and logos, but also shoes, clothes and other objects. Inthis paper, we illustrate on a diverse set of instances, includingshoe, car, building and person. In this work, we treat person re-identification [13] as a special case of generic instance search,and address the problem using the same method as for otherkinds of instances.

The challenge in instance search is to represent the queryimage invariant to the (unknown) appearance variations of thequery while maintaining a sufficiently rich representation topermit distinction from other, similar instances. To solve this,most existing approaches in instance search match the ap-pearance of local spots [14, 15] in the potential target to thequery [2, 16, 7, 11, 17]. The quality of match in these ap-proaches between two images is the sum of similarities overall local descriptor pairs. The difference between the cited ap-proaches lies in the way local descriptors are encoded and inthe computation of the similarity. Good performance has beenachieved by this paradigm on buildings, logos and scenes from

Preprint submitted to arXiv

arX

iv:1

605.

0710

4v1

[cs

.CV

] 2

3 M

ay 2

016

a distance. However, when searching for an arbitrary objectwith a wider range of viewpoint variability, more sides, andpossibly having self-occlusion and non-rigid deformation, thesemethods are likely to fail as local descriptor matching becomesunreliable in these cases [18].

In this paper we propose to use automatically learned at-tributes [19, 20] to address generic instance search. Attributes,as higher level abstractions of visual properties, have beenshown advantageous in classification when training examplesare insufficiently covering the variations in the original featurespace [19, 21, 22], surely present in the one-example challeng-ing case. By employing attributes, we aim to be robust againstintra-instance appearance variations. Further, we optimize theattributes such that they are meanwhile discriminative amongdifferent instances. Concretely, in this paper, we learn a set ofcategory-specific non-semantic attributes that are optimized torecognize different instances of a certain category, e.g., shoes.With the learned attributes, an instance can be represented as aspecific combination of the attributes, and instance search boilsdown to finding the most similar combinations of attributes.

In order to address the possible confusion of the query withinstances from other categories, we further propose to sup-plement the learned category-specific attributes with category-level information. The category-level information are incorpo-rated to reduce the search space by filtering instances of othercategories. It is advantageous when there is only 1 query im-age, to use slightly more user provided information. In additionto the interactive specification of the object region in the queryimage, we require the specification of the category the queryinstance belongs to.

2. Related work

Most approaches in instance search rely on gatheringmatches of local image descriptors [23, 7, 2, 17, 11, 16], wherethe differences reside in the way the local descriptors are en-coded and the matching score of two descriptors is evaluated.Bag-of-words (BoW) [23, 7] encodes a local descriptor by theindex of the nearest visual word. Hamming embedding [2] im-proves upon BoW by adding an extra binary code to better de-scribe the position of the local descriptor in space. The match-ing score of a pair of descriptors is 1 if they are encoded tothe same word and the Hamming distance between binary sig-natures is smaller than a certain threshold. VLAD [24] andFisher vector [25] improve over BoW by representing the lo-cal descriptor with an extra residual vector, obtained by sub-tracting the mean of the visual word or the Gaussian compo-nent respectively. In VLAD and Fisher vector, the score oftwo descriptors is the dot product of the residuals when theyare encoded to the same word, and 0 otherwise. [17, 11] im-prove VLAD and Fisher vector by replacing the dot product bya thresholded polynomial similarity and an exponential similar-ity respectively to give disproportionally more credits to closerdescriptor pairs. [16] encodes a local descriptor by only con-sidering the directions to the visual word centers, not the mag-nitudes, outperforming Fisher vector on instance search. With

these methods, good performance has been achieved on build-ings, logos, and scenes from a distance. These instances can beconceived as near-planar and one-sided. For buildings, logos,and scenes from a distance the variation in the viewing angleis limited to a quadrant of 90 degrees at most out of the full360 circle. For limited variations in viewpoint, matches of lo-cal descriptors can be reliably established between the queryand a relevant example. In this work, we consider generic in-stance search, where the instance can be an arbitrary objectwith a wider range of viewpoint variability and more sides. Weevaluate existing methods for approximately one-sided instancesearch on this problem of generic instance search.

Attributes [19, 26, 20] have received much attention recently.They are used to represent common visual properties of dif-ferent objects. Attribute representation has been used for im-age classification [19, 21, 22]. Attributes have been shownto be advantageous when the training examples are insuffi-ciently covering the appearance variations in the original fea-ture space [19, 21]. Inspired by this, we propose to use attributerepresentation to address generic instance search, where thereis only 1 example available and there still exists a wide rangeof appearance variations.

Attributes have been used for image retrieval [27, 28, 29, 21,30]. In [27, 28, 29], the query is defined by textual attributesinstead of images and the goal is to return images exhibitingquery attributes. In the references, the query attributes need tobe semantically meaningful such that the query can be spec-ified by text. In this work, we address instance search givenone query image, which is a different task as the correct an-swers have to exhibit the same instance (not just the same at-tributes), and we use automatically learned attributes which asa consequence may or may not be semantic. [21, 30] considernon-semantic attributes for category retrieval, while this workaddresses generic instance retrieval.

The use of category-level information to improve instancesearch has been explored in [31, 32, 33]. [33] uses category la-bels to learn a projection to map the original feature to a lower-dimensional space such that the lower-dimensional feature in-corporates certain category-level information. In this work, in-stead of learning a feature mapping, we augment the originalrepresentation with additional features to capture the category-level information. In [32], Fisher vector representation is ex-panded with the concept classifier output vector of the 2659concepts from Large Scale Concept Ontology for Multimedia(LSCOM) [34]. In [31], a 1000-dimensional concept represen-tation [35] is utilized to refine the inverted index on the basisof semantic consistency between images. Both [31] and [32]combine category-level information with low-level representa-tion. In this work, we consider the combination of category-level information with category-specific attributes rather thana low-level representation. We argue this is a more princi-pled combination as the category-level information by defini-tion makes category-level distinction and the category-specificattributes are optimized for within-category discrimination.

Person re-identification is a well-studied topic [13, 36, 37],where the work mainly branches into two aspects, feature de-signing [38, 39, 40] and metric learning [41, 42, 43]. Among

2

the vast amount of work in literature, most related to this paperare papers focusing on building a good representation [38, 39,44, 45, 46, 40, 47, 48, 49]. [38] uses AdaBoost to select featuresfrom an ensemble of localized features. [39] encodes the localdescriptors using Fisher vector. [44] exploits the symmetry andasymmetry properties of human body to capture the cues on thehuman body only, pruning out background clutters. [45] learnshuman saliency in an unsupervised manner to find reliable anddiscriminative patches. [46] proposes to learn mid-level patchfilters that are viewpoint invariant and discriminative in differ-entiating identities. [40] employs a salient color names basedrepresentation. [47] records the maximal local occurrence ofa pattern to achieve invariance to viewpoint changes. [48] si-multaneously learns features and a similarity metric using deeplearning. [49] proposes to learn semantic fashion-related at-tribute representation from auxiliary datasets and adapt the rep-resentation to target datasets. In this work, we propose to learna non-semantic attribute representation without using auxiliarydata to handle the large appearance variations caused by view-point differences, illumination variations, deformation and oth-ers. Furthermore, in this paper, inspired by [50], we treat per-son re-identification as a special case of the generic instancesearch problem, where the instance of interest is now a specificperson, and address the problem using the same attribute-basedapproach as for other types of instance search, e.g., shoes andbuildings.

2.1. ContributionsOur work makes the following contributions. We propose

to pursue generic instance search from 1 example where the in-stance can be an arbitrary 3D-object recorded from a wide rangeof imaging angles. We argue that this problem is harder thanthe approximately one-sided instance search of buildings [7],logos [9] and remote scenes [2]. We evaluate state-of-the-artmethods on this problem. We observe what works best forbuildings loses its generality for shoes and reversely what worksworse for buildings may work well for shoes.

Second, we propose to use automatically learned category-specific attributes to handle the wide range of appearance vari-ations in generic instance search. Here we assume we know thecategory of the query instance which provides critical knowl-edge when there is only 1 query image. Information of thequery category can be given through interactive user interfaceor automatic image categorization (e.g., shoe, dress, etc.). Onthe problem of searching among instances from the same cat-egory as the query, our category-specific attributes outperformexisting instance search methods by a large margin when largeappearance variations exist.

Third, inspired by [50], we treat person re-identification asa special case of generic instance search, where the instance ofinterest is a specific person. On the popular VIPeR dataset [51],we reach state-of-the-art performance with the same attribute-based method.

As our fourth contribution, we extend our method to searchinstances without restricting to the known category. We pro-pose to augment the category-specific attributes with category-level information which is carried by high-level deep learn-

ing features learned from large-scale image categorization andthe category-level classification scores. We show that combin-ing category-level information with category-specific attributesachieves superior performance to combining category informa-tion with low-level features such as Fisher vector.

A preliminary version of the paper appeared as [52]. In thispaper, we include several new studies. First, we conduct an em-pirical study of the parameters of the attribute learning method.We also analyze the impact of the underlying features for at-tribute learning on the search performance. Using multiple fea-tures for learning, which as a whole can better capture the vari-ous types of visual properties than individuals, we improve theperformance over [52] substantially. And we treat person re-identification [13, 36, 37] as another special case of generic in-stance search where the query is a specific person with the sameattribute-based method. On the popular VIPeR dataset [51],competitive result is achieved, on par with the state-of-the-art.This demonstrates the generic capability of our attribute-basedinstance search algorithm.

3. The difficulty of generic instance search

The first question we raise in this work is how the state-of-the-art methods perform on generic instance search from 1 ex-ample where the query instance can be an arbitrary object. Canwe search for other objects like shoes using the same methodthat has been shown promising for buildings? To that end,we evaluate several existing instance search algorithms on bothbuildings and shoes.

We evaluate the following methods. ExpVLAD: [11] intro-duces locality at two levels to improve instance search fromone example. The method considers locality in the picture byevaluating multiple candidate locations in each of the databaseimages. It also considers locality in the feature space by ef-ficiently employing a large visual vocabulary for VLAD andFisher vector and by an exponential similarity function to givedisproportionally high scores on close local descriptor pairs.The locality in the picture was shown effective when search-ing for instances covering only a part of the image. And thethe locality in the feature space was shown useful on all thedatasets considered in the reference. Triemb: [16] proposes tri-angulation embedding and democratic aggregation. The trian-gulation embedding encodes a local descriptor with respect tothe visual word centers using only directions, not magnitudes.As shown in the paper, the triangulation embedding outper-forms Fisher vector [53]. The democratic aggregation assigns aweight to each local descriptor extracted from an image to en-sure all descriptors contribute equally to the self-similarity ofthe image. This aggregation scheme was shown better than thesum aggregation. Fisher: We also consider Fisher vector as ithas been widely applied in instance search and object catego-rization where good performance has been reported [54, 53].Deep-FC: It has been shown recently that the activations in thefully connected layers of a deep convolutional neural network(CNN) [55] serve as good features for several computer visiontasks [56, 57, 58]. VLAD-Conv: Very recently, [59] proposes to

3

(a)

(b)

Figure 1: (a) Examples of two buildings from Oxford5k, and (b)Examples of three shoes from our CleanShoes dataset. Thereexists a much wider range of viewpoint variability in the shoeimages.

apply VLAD encoding [54] on the output of the convolutionallayers of CNN for instance search.

Datasets. Oxford buildings dataset [7], often referred to asOxford5k, contains 5062 images downloaded from Flickr. 55queries of Oxford landmarks are defined, each by a query exam-ple. Oxford5k is one of the most popular datasets for instancesearch, which has been used by many works to evaluate theirapproaches. Figure 1a shows examples of two buildings fromthe dataset.

As a second dataset, we collect a set of shoe images fromAmazon1. It consists of 1000 different shoes and in total 6624images. Each shoe is recorded from multiple imaging anglesincluding views from front, back, top, bottom, side and someothers. One image of a shoe is considered as the query and thegoal is to retrieve all the other images of the same shoe. Al-though these images are with clean background as often seenon shopping websites, this is a challenging dataset mainly dueto the presence of considerably large viewpoint variations andself-occlusion. We refer to this dataset as CleanShoes. Fig-ure 1b shows examples of three shoes from CleanShoes. There

1The properties are with the respective owners. The images are shown hereonly for scientific purposes.

Figure 2: Performance of various state-of-the-art methods forinstance search measured in mean average precision%: Ex-pVLAD [11], Triemb [16], Fisher [54], VLAD-Conv [59] andDeep-FC [55]. For Fisher vector, we consider two versions.Fisher denotes the version with interest points and SIFT de-scriptors, and Fisher-D uses densely sampled RGB-SIFT de-scriptors. ExpVLAD achieves better performance than otherson Oxford5k, but gives lowest result on CleanShoes. On theother hand, Deep-FC obtains best performance on CleanShoes,but has lower result than others on Oxford5k.

is a shoe dataset available, proposed by [60]. However, thisdataset is not suited for instance search as it does not containmultiple images for one shoe. [61] also considers shoe images,but the images are well aligned, whereas the images in Clean-Shoes provide a much wider range of viewpoint variations.

Implementation details. For ExpVLAD, Triemb and Fisher,we use the Hessian-Affine detector [62] to extract interestpoints. The SIFT descriptors are turned into RootSIFT [6].The full 128D descriptors are used for ExpVLAD and Triemb,following [11, 16], while for Fisher, the local descriptor is re-duced to 64D using PCA, as the PCA reduction has been shownimportant for Fisher vector [54, 53]. The vocabulary size is20k, 64 and 256 for ExpVLAD, Triemb and Fisher respectively,following the corresponding references [11, 16, 54]. We ad-ditionally run a version of Fisher vector with densely sampledRGB-SIFT descriptors [63] and a vocabulary of 256 compo-nents, denoted by Fisher-D. For Deep-FC, we use an in-houseimplementation of the AlexNet [55] trained on ImageNet cat-egories, and take the `2 normalized output of the second fullyconnected layer as the image representation. For VLAD-Conv,we apply VLAD encoding with a vocabulary of 100 centers onthe conv5 1 responses of the VGGNet [64], following [59]. ForTriemb, Fisher, Fisher-D and VLAD-Conv, power normaliza-tion [65] and `2 normalization are applied.

Results and discussions. Figure 2 summarizes the results onOxford5k and CleanShoes. ExpVLAD adopts a large vocabularywith 20k visual words and the exponential similarity function.As a result, only close local descriptor pairs in the feature spacematter in measuring the similarity of two examples. This resultsin better performance than others on Oxford5k where close andrelevant local descriptor pairs do exist. However, on the shoeimages where close and true matches of local descriptors arerarely present due to the large appearance variations, ExpVLAD

4

achieves lowest performance. Both Triemb and Fisher obtainquite good results on buildings but the results on shoes are low.This is again caused by the fact that local descriptor matchingis not reliable on the shoe images where large viewing angledifferences are present. Triemb outperforms Fisher, consistentwith the observations in [16]. In this work, we do not con-sider the RN normalization [16] because it requires extra train-ing data to learn the projection matrix and it does not affect theconclusion we make here. Fisher-D works better than Fisheron CleanShoes by using color information and densely sam-pled points. Color is a useful cue for discriminating differentshoes, and dense sampling is better than interest point detectoron shoes which do not have rich textural patterns. However,Fisher-D does not improve over Fisher on Oxford5k. VLAD-Conv is in the middle on both sets. Deep-FC has lowest perfor-mance on buildings, but outperforms others on shoes.

Overall, the performance on shoes is much lower than on thebuildings. More interestingly, ExpVLAD achieves better per-formance than others on Oxford5k, but gives lowest result onCleanShoes. On the other hand, Deep-FC obtains best perfor-mance on CleanShoes, but has lower result than others on Ox-ford5k. We conclude that none of the existing methods workwell on both buildings, as an example of 2D one-sided in-stance search, and shoes, as an example of 3D full-view in-stance search.

4. Attributes for generic instance search

Attributes, as a higher level abstraction of visual proper-ties, have been shown advantageous in categorization whenthe training examples are insufficiently covering the appearancevariations in the original feature space [19, 21, 22]. In our prob-lem, there is only 1 example available and there still exists awide range of appearance variations. Can we employ attributesto address generic instance search?

In the literature, two types of attributes have been studied,manually defined attributes with names [20, 22] and automat-ically learned unnameable attributes [21, 66]. Obtaining man-ually defined attributes requires a considerable amount of hu-man efforts and sometimes domain expertise, making it hard toscale up to a large number of attributes. Moreover, the man-ually picked attributes are not necessarily machine-detectable,and not guaranteed to be useful for the task under considera-tion [21]. On the other hand, learned attributes do not needhuman annotation and have the capacity to be optimized forthe task [21, 66]. For some tasks, like zero-shot learning [22]and image retrieval by textual query [27], it is necessary to usehuman understandable attributes with names. However, in in-stance search given 1 image query, having attributes with namesis not really necessary. In this work, we use automaticallylearned attributes. Specifically, we focus on searching amonginstances known to be of the same category in this section us-ing automatically learned category-specific attributes.

Provided with a set of training instances from a certain cat-egory, we aim to learn a list of category-specific attributes anduse them to perform instance search on new (unseen) instancesfrom the same category. Concretely, given m training images

of n objects (m > n as each object has one or multiple exam-ples), the goal is to learn k attribute detectors. In the searchphase, the query image and the dataset images are representedby k-dimensional attribute detection scores, and the search isperformed by comparing the distances in the k-dimensional fea-ture space.

Analogous to the class-attribute mapping in attribute-basedcategorization [19, 20, 22], an instance-attribute mapping A ∈Rn×k is designed automatically. The challenge is how to obtaina useful A. As the goal in instance search is to differentiate dif-ferent instances, the attributes should be able to make distinc-tions among the training instances. On the other hand, as theattributes will be used later for instance search on new, unseeninstances, the learned attributes need to be able to generalizeon unseen instances. To that end, visually similar training in-stances are encouraged to share attributes. Attributes specific toone training instance are less likely to generalize on unknowninstances than those shared by several training instances. Andsharing needs to be restricted only among visually similar train-ing instances as the latent common visual patterns among visu-ally dissimilar instances are less likely to be present and de-tected on new instances even if they can be learned providedwith a high dimensional feature space. Besides, to make thebest out of the k attributes, it is desirable to have low redun-dancy among the attributes. Formally, taking the above consid-erations into account, we design A by

maximizeA

f1(A) + λ f2(A) + γ f3(A), (1)

where f1(A), f2(A) and f3(A) are defined as follows:

f1(A) =n∑i, j

‖Ai· − A j·‖22,

f2(A) = −n∑i, j

S i j‖Ai· − A j·‖22,

f3(A) = −‖AT A − I‖2F .

(2)

Ai·, the i-th row of A, is the attribute representation of the i-th instance. f1(A) ensures instance separability. S in f2(A) isthe visual proximity matrix, where S i j represents visual simi-larity between instance i and instance j, measured a priori incertain visual feature space. The similarity between two train-ing instances is computed as the average similarity betweenthe images of the two instances. f2(A) encourages similar at-tribute representations between visually similar instances, in-ducing shareable attributes. f3(A) penalizes redundancy be-tween attributes. λ and γ are two parameters of the objective.Larger λ encourages more attribute sharing among visually sim-ilar instances and larger γ penalizes more on the redundancy inthe learned attributes. This formulation was originally proposedin [21] for category recognition. Following [21], the optimiza-tion problem is solved incrementally by obtaining one columnof A, i.e., one attribute at each step. Next we briefly describethe optimization procedure.

The objective (Equation 1) can be rewritten as

maximizeA

Tr(AT PA) − γ‖AT A − I‖2F , (3)

5

Figure 3: Examples of two cars from the dataset Cars.

where P = Q − λL. Q is an n × n matrix with diagonal ele-ments being n − 1 and off-diagonal elements being −1. L is theLaplacian of S [67]. Initializing A as an empty matrix, A canbe learned incrementally, one column at one step, by

maximizea

aT Ra s.t. aT a = 1, (4)

where R = P − 2γAAT . The optimal a is the eigenvector of Rwith the largest eigenvalue. A is updated by A = [A, a] at everystep. In this work, each attribute, i.e., a, is binarized during theoptimization.

Attribute detectors. Once the instance-attribute mappingA has been obtained, the next step is to learn the attribute de-tectors. In this work, the attribute detectors are formulated aslinear SVM classifiers. To train the j-th attribute detector, im-ages of the training instances with Ai j > 0 are used as positiveexamples and the rest images are negative examples2.

Attribute representation. Given a new image, the attributerepresentation is generated by applying all the learned attributedetectors and concatenating the SVM classification scores. Theattribute representation is discriminative in distinguishing dif-ferent instances as it is optimized to be so when designing A.The attribute representation is invariant to the appearance vari-ations of an instance as the invariance is built in the attributedetectors which take all the images of one instance as either allpositive or all negative during learning.

4.1. Experiments

4.1.1. DatasetsEvaluation sets. The category-specific attributes as learned

are evaluated on shoes, cars and buildings. For shoes, thedataset CleanShoes described Section 3 is used. For cars, wecollect 1110 images of 270 cars from eBay, denoted by Cars.Figure 3 shows some images of two cars3. For buildings, adataset is composed by gathering all 567 images of the 55 Ox-ford landmarks from Oxford5k, denoted by OxfordPure. Wereuse the 55 queries defined in Oxford5k.

2We have also tried designing the instance-attribute mapping A with con-tinuous values and learning a regressor for each attribute. However, this is notbetter in terms of instance search performance.

3The properties are with the respective owners. The images are shown hereonly for scientific purposes.

=0.01

=0

=7 =2

Figure 4: The impact of the parameters of the attribute learn-ing algorithm (λ and γ) on the search performance, measuredin mean average precision. The experiments are conducted onCleanShoes. When there is no attribute sharing enforced be-tween instances (λ = 0) or there is large redundancy in thelearned attributes (γ = 0.01), the search performance is low. Itindicates the importance of enforcing attribute sharing and lowredundancy. The observation on the impact of λ holds whenfixing γ to other values (The same holds for the observation onγ.).

Training sets. To learn shoe-specific attributes, we collect2100 images of 300 shoes from Amazon. To train car-specificattributes, we collect 1520 images of 300 cars from eBay. Tolearn building-specific attributes, we use a subset of the largebuilding dataset introduced in [57]. We randomly pick 30 im-ages per class and select automatically the 300 classes that aremost relevant to OxfordPure according to the visual similarity.We end up with in total 8756 images as some URLs are brokenand some classes have less than 30 examples. For all shoes,cars and buildings, the instances in the evaluation sets are notpresent in the training sets.

4.1.2. Empirical parameter studyWe empirically investigate the effect of the two parameters of

the learning algorithm (λ and γ in Equation 1) on the search per-formance. We learn different sets of category-specific attributeswith different λ and γ values and evaluate the instance searchperformance. The study is conducted on the shoe dataset.

Fisher vector [53] with densely sampled RGB-SIFT [63] isused as the underlying representation to compute the visualproximity matrix S in Equation 2 and learn the attribute detec-tors. S is built as a mutual 60-NN adjacent matrix throughoutthe paper.

First, we study the effect of λ by fixing γ. An extreme caseis setting λ to be 0, which means no attribute sharing amongtraining instances. As shown in Figure 4 (left), when λ is 0,the search performance is much worse than when λ is from 1to 5, especially when the number of attributes is low. Whenthere is no sharing induced, the learned attributes on the train-ing instances cannot generalize well on the new instances in thesearch set. As long as sharing is enabled, the search perfor-mance is robust to the value of λ.

Second, we study the effect of γ by fixing λ. As can be seenfrom Figure 4 (right), when γ is small (0.01), which means largeredundancy in the learned attributes, the search performance isvery low, but stabilizes once γ is large enough.

6

number CleanShoes

Manual attributes 40 18.99

Learned attributes 40 39.44

Learned attributes 1000 56.57

Table 1: Comparison of learned attributes and manually definedattributes on shoe search. The performance is measured in meanaverage precision%.

The above study shows the importance of enforcing attributesharing and low redundancy during learning as well as the ro-bustness of the learning algorithm against the values of λ andγ, in terms of the instance search performance. In the rest ofpaper, we set λ and γ to be 2 and 7 respectively to be consistentwith the earlier version of the work [52].

4.1.3. Comparison with manual attributesWe compare the learned attributes with manually defined at-

tributes on shoe search. For manually defined attributes, we usethe list of attributes proposed by [68]. We manually annotatethe same 2100 training images. In the reference, 42 attributesare defined. However, we merge super-high and high of “up-per” and “heel height” because it is hard to annotate super-highand high as two different attributes. This results in 40 attributes.

Again, Fisher vector is used as the underlying representa-tion to learn attribute detectors. As shown in Table 1, with thesame number of attributes, the automatically learned attributeswork significantly better than the manual attributes. Moreover,automatically learned attributes are easily scalable, improvingperformance further. Figure 5 shows four automatically learnedattributes. Although the attributes have no explicit names, theydo capture common visual properties between shoes.

4.1.4. Empirical study of underlying feature representationIn theory, attributes can be learned from any underlying fea-

ture representation. In this section, we empirically evaluatethe impact of various underlying features for attribute learningon the instance search performance. We consider 5 differentfeature representations investigated in Section 3, i.e., Triemb,Fisher, Fisher-D, VLAD-Conv and Deep-FC. ExpVLAD is notincluded as it does explicitly form a vector representation to fa-cilitate the learning. The proximity matrix S is measured in thesame feature space as used for learning attributes.

First, we evaluate the attributes learned from single underly-ing features and compare with existing approaches. The resultsare summarized in Table 2. We observe that when the under-lying feature representation for attribute learning is based onsparse interest points, including Triemb and Fisher, the learnedattribute representation does not always improve the search per-formance over the original representation. However, when theunderlying feature representation is based on densely extractedvisual cues, including Fisher-D, VLAD-Conv and Deep-FC, theattribute representation always outperforms the underlying fea-ture representation by a large margin. This indicates that the

Figure 5: Four automatically designed attributes. Each row isone attribute and the shoes are the ones that have high responsefor that attribute. Although the automatically learned attributeshave no semantic names, apparently they capture sharing pat-terns among shoes. The first attribute represents high boots.The second describes the high heels. The third is probablyabout colorfulness. The last one is about openness. The firsttwo are also found in the manually defined attributes while theother two are novel ones discovered automatically.

mapping from the original feature representation to the attributerepresentation is selective. It selects the useful informationwhich is discriminative among different instances and invariantto the variations of the same instance, while discarding otherdisturbing information. We argue that a large amount of use-ful information has already been filtered by the internal selec-tion step of the interest point detector and therefore attributerepresentation learned on interest point based features does nothelp much. The attribute representation learned using VLAD-Conv achieves better performance than those learned from otherunderlying representations. On the shoe and car datasets, thelearned attribute representation significantly outperforms exist-ing approaches. Attributes are superior in addressing the largeappearance variations caused by the large imaging angle dif-ference present in the shoe and car images, even though theattributes are learned from other instances. The attribute repre-sentation also works well on the buildings. In addition, attributerepresentation has a much lower dimensionality than other rep-resentations.

Second, in Table 3, we investigate the effects of using mul-tiple underlying features for attribute learning. Again, the at-tribute representation outperforms the underlying feature rep-resentation significantly. Comparing Table 3 and Table 2, it isclear that the attribute representation learned on the combina-tion of multiple underlying features outperforms those learnedon single features. This demonstrates the advantage of usingmultiple underlying feature representations, which as a wholecan better capture the various types of visual properties thana single representation. Interestingly, combining the same un-derlying features and directly using them for instance searchwithout attributes does not necessarily improve over individualfeatures, which confirms again the advantage of attributes. The

7

dim CleanShoes Cars OxfordPure

Fisher-D [54, 63],VLAD-Conv [59] 92160 35.64 26.18 71.29

Fisher-D [54, 63],Deep-FC [55] 45056 41.55 22.65 69.41

VLAD-Conv [59],Deep-FC [55] 55296 36.25 26.25 69.84

Fisher-D [54, 63],VLAD-Conv [59],Deep-FC [55] 96256 39.04 25.58 71.42

Attributes(Fisher-D,VLAD-Conv) 1000 63.97 69.19 83.22

Attributes(Fisher-D,Deep-FC) 1000 67.45 59.96 78.66

Attributes(VLAD-Conv,Deep-FC) 1000 67.06 69.02 83.75Attributes(Fisher-D,VLAD-Conv,Deep-FC) 1000 67.87 71.74 83.06

Table 3: Performance in mean average precision% of combining multiple existing representations (top part of the table) and theattributes learned from multiple underlying features (bottom part). The learned attribute representation significantly outperformsthe underlying representation. Comparison with Table 2 shows that the attribute representation learned from multiple underlyingfeatures outperforms those learned on single features. Interestingly, combining the same underlying features and directly usingthem for instance search without attributes does not necessarily improve over individual features.

dim CleanShoes Cars OxfordPure

ExpVLAD [11] — 16.14 23.70 87.01Triemb [16] 8064 25.06 18.56 75.33

Fisher [54] 16384 20.94 18.37 70.81

Fisher-D [54, 63] 40960 36.27 20.89 67.41

VLAD-Conv [59] 51200 29.37 27.27 69.05

Deep-FC [55] 4096 36.73 22.36 59.48

Attributes(Triemb) 1000 19.83 28.15 71.58

Attributes(Fisher) 1000 17.67 31.21 69.33

Attributes(Fisher-D) 1000 56.57 51.11 77.36

Attributes(VLAD-Conv) 1000 63.19 63.99 82.86Attributes(Deep-FC) 1000 57.11 38.07 69.51

Table 2: Performance in mean average precision% of existingmethods (top part of the table) and the attributes learned fromsingle underlying features (bottom part). The attributes learnedfrom Fisher-D, VLAD-Conv or Deep-FC outperform existingmethods significantly on shoes and cars, and achieve compara-ble performance on buildings. Attributes learned from the un-derlying features that capture densely the visual cues (Fisher-D,VLAD-Conv and Deep-FC) are better than those learned fromthe underlying features based on sparse interest points (Triemband Fisher).

attribute representation learned on the combination of Fisher-D,VLAD-Conv and Deep-FC achieves best performance on shoesand cars, and close to best performance on buildings, improvingthe results reported in the earlier version of the work [52] from56.57% to 67.87% on CleanShoes, from 51.11% to 71.74% onCars, and from 77.36% to 83.06% on OxfordPure in mean av-erage precision.

5. Person re-identification as instance search

Person re-identification is the problem of identifying the im-ages in a database which depict the same person as in theprobe image. The probe image and the relevant images in thedatabase are usually captured by different cameras with differ-ent recording settings, causing large viewpoint and illumina-tion variations. Besides, a person might have different posesin different recordings and might be partially occluded. Allthese result in large intra-person variations, making person re-identification a challenging problem. In this work, we treat per-son re-identification as a specific person search problem, andaddress the problem using the attribute-based method presentedin Section 4.

Dataset and evaluation protocol. We use the VIPeRdataset [51]. It has been widely used for benchmark evalua-tion. It contains 632 pedestrians, each recorded by two cam-eras. One view is considered as the probe image and the goalis to identify the other view of the same person. The 632 pairsare randomly divided into two halves, one for training and onefor testing. The performance is evaluated using the Cumula-tive Match Characteristic (CMC) curve [51] which estimatesthe expectation of finding the correct answer in the top k re-sults. The experiment is repeated 10 times to report an averageperformance4.

Implementation details. 1000 attributes detectors arelearned using the training split. To learn the attributes, we em-

4We use the 10 divisions provided by [50]

8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

rank

matc

hin

g r

ate

Attributes (CN+CH+HOG+LBP+Fisher−D+Deep−FC+VLAD−Conv)CN+CH+HOG+LBP+Fisher−D+Deep−FC+VLAD−Conv

Figure 6: Performance on VIReR dataset [51], measured inmatching rate on top ranked images. The learned attributerepresentation significantly outperforms the original underlyingrepresentation.

rank=1 rank=5 rank=10 rank=20

Zheng et al. [50] 30.2 51.6 62.4 73.8

Ahmed et al. [48] 34.8 — — —

Chen et al. [42] 36.8 70.4 83.7 91.7Shi et al. [49] 31.1 68.6 82.8 94.9Liao et al. [47] 40.0 — 80.5 91.1

Paisitkriangkrai et al. [43] 45.9 — — —

Ours 43.6 71.6 82.2 90.7

Table 4: Comparison with state-of-the-art on VIPeRdataset [51] by correct matching rates(%). Although not beingspecialized for person, our method keeps up with the state-of-the-art for all ranks.

ploy multiple underlying features. We use the bag-of-word his-tograms on local color histograms (CH), local color namingdescriptors (CN), local HOG (HOG) and local LBP descrip-tor (LBP), provided by [50]5. Besides, we employ Deep-FC,Fisher-D and VLAD-Conv. Vocabularies with 16 componentsand 8 centers are used for Fisher-D and VLAD-Conv respec-tively. The visual proximity matrix S in equation 2 is built asa mutual 60-NN adjacent matrix, the same as in previous sec-tions.

Results. As shown in Figure 6, the learned attribute repre-sentation significantly outperforms the original underlying rep-resentation. The learned attributes can handle well the largeappearance variations. Table 4 summarizes the comparisonwith the state-of-the-art. Although the proposed attribute-basedmethod is not specially designed for person re-identification, itachieves good performance, on par with the state-of-the-art.

6. Categories and attributes for generic instance search

In this section, we consider searching for an instance froma dataset which contains instances from various categories. As

5http://www.liangzheng.com.cn/Project/project_fusion.html

the category-specific attributes are optimized to make distinc-tions among instances of the same category, they might not beable to distinguish the instance of interest from the instancesof other categories. In order to address the possible confusionof the query instance with instances from other categories, wepropose to use the category-level information also.

Ideally one could first categorize all the images in thedatabase and then search using category-specific attributesamong the images from the same category as the query. How-ever, as errors made in categorization are irreversible, wechoose to avoid explicit binary classification but augment theattributes with category-level information.

We consider two ways to capture the category-level informa-tion. First, we adopt the 4096-dimensional output of the secondfully connected layer of a CNN [55] as an additional feature, asit has been shown the activations of the top layers of a CNN cap-ture high-level category-related information [69]. The CNN istrained using ImageNet categories. Second, we build a generalcategory classifier to alleviate the potential problem of the deeplearning feature, namely the deep learning feature may bringexamples that have common elements with the query instanceeven if they are irrelevant, such as skins for shoes. Combiningthe two types of category-level information with the category-specific attributes, the similarity between a query q and an ex-ample d in the search set is computed by

S (q, d) = S deep(q, d) + S class(d) + S attr(q, d), (5)

where S deep(q, d) is the similarity of q and d in the deep learn-ing feature space, S class(d) is the classification response on dand S attr(q, d) is the similarity in the attribute space. The threescores are normalized to be [0, 1].

Datasets. We evaluate on shoes. A set of 15 shoes and intotal 59 images is collected from two fashion blogs6. Theseimages are recorded in streets with cluttered background, dif-ferent from the ‘clean’ images in CleanShoes. We consider oneimage of a shoe as the query and aim to find other images of thesame shoe. The shoe images are inserted into the test and vali-dation parts of the Pascal VOC 2007 classification dataset [70].The Pascal dataset provides distractor images. We refer to thedataset containing the shoe images plus distractors as Street-Shoes. Figure 7 shows two examples. To learn the shoe classi-fier, we use the 300 ‘clean’ shoes for attributes learning in Sec-tion 4 as positive examples and consider the training part of thePascal VOC 2007 classification dataset as negative examples.

Implementation details. As there is 1 query image, by man-ually annotation we only consider the object region to ensurethe target is clear. It is worthwhile to mention that althoughonly the object part in the query image is considered, we can-not completely get rid of skins for some shoes, as shown in Fig-ure 7. We use selective search [71] to generate many candidatelocations in each database image and search over these local ob-jects in the images as [11]. We adopt a short representation with

6http://www.pursuitofshoes.com/ and http://www.seaofshoes.com/. Theproperties are with the respective owners. The images are shown here onlyfor scientific purposes.

9

http://www.liangzheng.com.cn/Project/project_fusion.htmlhttp://www.liangzheng.com.cn/Project/project_fusion.html

Query Target Images

Figure 7: Examples of two shoes from StreetShoes. As thereis only 1 query example, by manual annotation, we only con-sider the object region to ensure the object to search is clear, asshown in the second column. The goal is to retrieve from animage collection the target images which depict the same shoe.Note large differences in scale and viewpoint between queryand target images.

StreetShoes

Deep(128D) 21.68

Fisher(128D) 9.38

Attributes(128D) 3.10

Deep + Fisher 19.76

Deep + Attributes 18.43

Deep + Classifier + Fisher 22.70

Deep + Classifier + Attributes 30.45

Table 5: Performance in mean average precision% on Street-Shoes. The proposed method of combining the category-specific attributes with two types of category-level informa-tion outperforms the combination of category-level informationwith Fisher vector.

128 dimensions. Specifically, we reduce the dimensionality ofthe deep learning features and the attribute representations witha PCA reduction. And for Fisher vectors, we adopt the whiten-ing technique proposed in [72], proven better than PCA. Wereuse the attribute detectors from Section 4.

Results and discussions. The results are shown in Table 5.On StreetShoes, the proposed method of combining category-specific attributes with two types of category-level informationachieves the best performance, 30.45% in mean average preci-sion. We observe that when considering deep features alone asthe category-level information, the system brings many exam-ples of skins. The shoe classifier trained on clean shoe imageshelp eliminate these irrelevant examples. We conclude that theproposed method of combining the category-specific attributeswith two types of category-level information is effective, out-performing the combination of category-level information withFisher vector. Figure 8 shows the search results of three queryinstances returned by the proposed method, two success casesand a failure case.

7. Conclusion

In this paper, we pursue generic instance search from 1 exam-ple. Firstly, we evaluate existing instance search approaches onthe problem of generic instance search, illustrated on buildingsand shoes, two contrasting categories of objects. We observethat what works for buildings does not necessarily work forshoes. For instance, [11] employs large visual vocabularies andthe exponential similarity function to emphasize close matchesof local descriptors, resulting in large improvement over othermethods when searching for buildings. However, the same ap-proach achieves worst performance when searching for shoes.The reason is that for shoes which have much wider range ofviewpoint variability and more sides than buildings, matchinglocal descriptors precisely between two images is not reliable.

Secondly, we propose to use category-specific attributes tohandle the large appearance variations present in generic in-stance search. We assume the category of the query is known,e.g., from the user input. When searching among instancesfrom the same category as the query, attributes outperform ex-isting approaches by a large margin on shoes and cars at the ex-pense of knowing the category of the instance and learning theattributes. For instance search from only one example, it maybe reasonable to use more user input. On the building set, thecategory-specific attributes obtain a comparable performance.

Thirdly, we consider person re-identification as a special caseof generic instance search where the query is a specific person.We show the same attribute-based approach achieves competi-tive performance, on par with the state-of-the-art in person re-identification.

Fourthly, we consider searching for an instance in datasetscontaining instances from various categories. We propose touse the category-level information to address the possible con-fusion of the query instance with instances from other cate-gories. We show that combining category-level informationcarried by deep learning features and the categorization scores

10

18 21 53

Query Image Query Segment Top 5 Results Missed Target Images in Top 5

113

None

Figure 8: Search results of three query instances, two success cases (the first two) and a failure case (the third one). Only thesegment is used as query. For the first instance, it has 5 relevant images in the search set, and 4 of them are returned in the top 5positions. For the second instance, there is only 1 relevant example in the search set and it is returned at the first position. For theinstance at the bottom, it has 3 relevant images and none of them are returned in the top 5. It is a very hard case, as the shoe ispartially visible and the majority of the query segment is about the bare feet. Images of bare footed people appear in the top results.The correct images are ranked at 18, 21 and 53, and they are actually retrieved based on wrong information.

with the learned category-specific attributes outperforms com-bining the category information with Fisher vector.

Going back to the experiments using attributes alone, the pro-posed same method achieves 67.87% in mean average preci-sion (mAP) on CleanShoes for shoe search (Table 3), 71.74%in mAP on Cars for car search (Table 3), 83.06% in mAP onOxfordPure for building search (Table 3) and 43.6% in match-ing rate at rank 1 on VIPeR for person search (Table 4), whilethe best performance of existing methods are 36.73% (Table 2),27.27% (Table 2), 87.01% (Table 2) and 45.9% (Table 4) re-spectively. The method is generic for instance search indeed.

Acknowledgments

This research is supported by the Dutch national programCOMMIT/.

References

[1] R. Arandjelovic, A. Zisserman, Multiple queries for large scale specificobject retrieval., in: British Machine Vision Conference, 2012.

[2] H. Jégou, M. Douze, C. Schmid, Hamming embedding and weak geomet-ric consistency for large scale image search, in: European Conference onComputer Vision, 2008.

[3] P. Over, G. Awad, J. Fiscus, G. Sanders, B. Shaw, Trecvid 2012 - anintroduction of the goals, tasks, data, evaluation mechanisms and metrics,in: The TREC Video Retrieval Evaluation (TRECVID), 2012.

[4] D. Qin, S. Gammeter, L. Bossard, T. Quack, L. Van Gool, Hello neighbor:accurate object retrieval with k-reciprocal nearest neighbors, in: IEEEConference on Computer Vision and Pattern Recognition, 2011.

[5] C.-Z. Zhu, H. Jégou, S. Satoh, Query-adaptive asymmetrical dissimilari-ties for visual object retrieval, in: International Conference on ComputerVision, 2013.

[6] R. Arandjelović, A. Zisserman, Three things everyone should know toimprove object retrieval, in: IEEE Conference on Computer Vision andPattern Recognition, 2012.

[7] J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Object retrievalwith large vocabularies and fast spatial matching, in: IEEE Conferenceon Computer Vision and Pattern Recognition, 2007.

[8] E. Gavves, C. G. Snoek, A. W. M. Smeulders, Visual synonyms for land-mark image retrieval, Computer Vision and Image Understanding 116 (2)(2012) 238–249.

[9] A. Joly, O. Buisson, Logo retrieval with a contrario visual query expan-sion, in: ACM Multimedia Conference, 2009.

[10] J. Revaud, M. Douze, C. Schmid, Correlation-based burstiness for logoretrieval, in: ACM Multimedia Conference, 2012.

[11] R. Tao, E. Gavves, C. G. M. Snoek, A. W. M. Smeulders, Locality ingeneric instance search from one example, in: IEEE Conference on Com-puter Vision and Pattern Recognition, 2014.

[12] X. Wang, M. Yang, T. Cour, S. Zhu, K. Yu, T. X. Han, Contextual weight-ing for vocabulary tree based image retrieval, in: International Conferenceon Computer Vision, 2011.

[13] S. Gong, M. Cristani, S. Yan, C. C. Loy, Person re-identification, Vol. 1,Springer, 2014.

[14] D. G. Lowe, Distinctive image features from scale-invariant keypoints,International Journal of Computer Vision 60 (2) (2004) 91–110.

[15] H. Bay, A. Ess, T. Tuytelaars, L. Van Gool, Speeded-up robust features(surf), Computer Vision and Image Understanding 110 (3) (2008) 346–359.

[16] H. Jégou, A. Zisserman, Triangulation embedding and democratic aggre-gation for image search, in: IEEE Conference on Computer Vision andPattern Recognition, 2014.

[17] G. Tolias, Y. Avrithis, H. Jégou, To aggregate or not to aggregate: Se-lective match kernels for image search, in: International Conference onComputer Vision, 2013.

[18] K. Mikolajczyk, C. Schmid, A performance evaluation of local descrip-

11

tors, IEEE Transactions on Pattern Analysis and Machine Intelligence27 (10) (2005) 1615–1630.

[19] A. Farhadi, I. Endres, D. Hoiem, D. Forsyth, Describing objects by theirattributes, in: IEEE Conference on Computer Vision and Pattern Recog-nition, 2009.

[20] C. H. Lampert, H. Nickisch, S. Harmeling, Learning to detect unseenobject classes by between-class attribute transfer, in: IEEE Conferenceon Computer Vision and Pattern Recognition, 2009.

[21] F. X. Yu, L. Cao, R. S. Feris, J. R. Smith, S.-F. Chang, Designingcategory-level attributes for discriminative visual recognition, in: IEEEConference on Computer Vision and Pattern Recognition, 2013.

[22] Z. Akata, F. Perronnin, Z. Harchaoui, C. Schmid, Label-embedding forattribute-based classification, in: IEEE Conference on Computer Visionand Pattern Recognition, 2013.

[23] J. Sivic, A. Zisserman, Video google: A text retrieval approach to objectmatching in videos, in: International Conference on Computer Vision,2003.

[24] H. Jégou, M. Douze, C. Schmid, P. Pérez, Aggregating local descriptorsinto a compact image representation, in: IEEE Conference on ComputerVision and Pattern Recognition, 2010.

[25] F. Perronnin, Y. Liu, J. Sánchez, H. Poirier, Large-scale image retrievalwith compressed fisher vectors, in: IEEE Conference on Computer Visionand Pattern Recognition, 2010.

[26] V. Ferrari, A. Zisserman, Learning visual attributes, in: Conference onNeural Information Processing Systems, 2008.

[27] B. Siddiquie, R. S. Feris, L. S. Davis, Image ranking and retrieval basedon multi-attribute queries, in: IEEE Conference on Computer Vision andPattern Recognition, 2011.

[28] A. Kovashka, D. Parikh, K. Grauman, Whittlesearch: Image search withrelative attribute feedback, in: IEEE Conference on Computer Vision andPattern Recognition, 2012.

[29] F. X. Yu, R. Ji, M.-H. Tsai, G. Ye, S.-F. Chang, Weak attributes for large-scale image retrieval, in: IEEE Conference on Computer Vision and Pat-tern Recognition, 2012.

[30] M. Rastegari, A. Farhadi, D. Forsyth, Attribute discovery via predictablediscriminative binary codes, in: European Conference on Computer Vi-sion, 2012.

[31] S. Zhang, M. Yang, X. Wang, Y. Lin, Q. Tian, Semantic-aware co-indexing for image retrieval, in: International Conference on ComputerVision, 2013.

[32] M. Douze, A. Ramisa, C. Schmid, Combining attributes and fisher vectorsfor efficient image retrieval, in: IEEE Conference on Computer Visionand Pattern Recognition, 2011.

[33] A. Gordoa, J. A. Rodrı́guez-Serrano, F. Perronnin, E. Valveny, Lever-aging category-level labels for instance-level image retrieval, in: IEEEConference on Computer Vision and Pattern Recognition, 2012.

[34] M. Naphade, J. R. Smith, J. Tesic, S.-F. Chang, W. Hsu, L. Kennedy,A. Hauptmann, J. Curtis, Large-scale concept ontology for multimedia,MultiMedia, IEEE 13 (3) (2006) 86–91.

[35] Large scale visual recognition challenge, http://www.imagenet.org/challenges/LSVRC/2010 (2010).

[36] A. Bedagkar-Gala, S. K. Shah, A survey of approaches and trends in per-son re-identification, Image and Vision Computing 32 (4) (2014) 270–286.

[37] R. Vezzani, D. Baltieri, R. Cucchiara, People reidentification in surveil-lance and forensics: A survey, ACM Computing Surveys (CSUR) 46 (2)(2013) 29.

[38] D. Gray, H. Tao, Viewpoint invariant pedestrian recognition with an en-semble of localized features, in: European Conference on Computer Vi-sion, 2008.

[39] B. Ma, Y. Su, F. Jurie, Local descriptors encoded by fisher vectors forperson re-identification, in: European Conference on Computer Visionworkshops, 2012.

[40] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, S. Z. Li, Salient color names forperson re-identification, in: European Conference on Computer Vision,2014.

[41] M. Hirzer, P. M. Roth, M. Köstinger, H. Bischof, Relaxed pairwiselearned metric for person re-identification, in: European Conference onComputer Vision, 2012.

[42] D. Chen, Z. Yuan, G. Hua, N. Zheng, J. Wang, Similarity learning onan explicit polynomial kernel feature map for person re-identification, in:

IEEE Conference on Computer Vision and Pattern Recognition, 2015.[43] S. Paisitkriangkrai, C. Shen, A. v. d. Hengel, Learning to rank in person

re-identification with metric ensembles, in: IEEE Conference on Com-puter Vision and Pattern Recognition, 2015.

[44] L. Bazzani, M. Cristani, V. Murino, Symmetry-driven accumulation oflocal features for human characterization and re-identification, ComputerVision and Image Understanding 117 (2) (2013) 130–144.

[45] R. Zhao, W. Ouyang, X. Wang, Unsupervised salience learning for personre-identification, in: IEEE Conference on Computer Vision and PatternRecognition, 2013.

[46] R. Zhao, W. Ouyang, X. Wang, Learning mid-level filters for personre-identification, in: IEEE Conference on Computer Vision and PatternRecognition, 2014.

[47] S. Liao, Y. Hu, X. Zhu, S. Z. Li, Person re-identification by local maximaloccurrence representation and metric learning, in: IEEE Conference onComputer Vision and Pattern Recognition, 2015.

[48] E. Ahmed, M. Jones, T. K. Marks, An improved deep learning archi-tecture for person re-identification, in: IEEE Conference on ComputerVision and Pattern Recognition, 2015.

[49] Z. Shi, T. M. Hospedales, T. Xiang, Transferring a semantic represen-tation for person re-identification and search, in: IEEE Conference onComputer Vision and Pattern Recognition, 2015.

[50] L. Zheng, S. Wang, L. Tian, F. He, Z. Liu, Q. Tian, Query-adaptive latefusion for image search and person re-identification, in: IEEE Conferenceon Computer Vision and Pattern Recognition, 2015.

[51] D. Gray, S. Brennan, H. Tao, Evaluating appearance models for recog-nition, reacquisition, and tracking, in: IEEE International Workshop onPerformance Evaluation for Tracking and Surveillance, 2007.

[52] R. Tao, A. W. M. Smeulders, S.-F. Chang, Attributes and categories forgeneric instance search from one example, in: IEEE Conference on Com-puter Vision and Pattern Recognition, 2015.

[53] J. Sánchez, F. Perronnin, T. Mensink, J. Verbeek, Image classificationwith the fisher vector: theory and practice, International Journal of Com-puter Vision 105 (3) (2013) 222–245.

[54] H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, C. Schmid, Ag-gregating local image descriptors into compact codes, IEEE Transactionson Pattern Analysis and Machine Intelligence 34 (9) (2012) 1704–1716.

[55] A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deepconvolutional neural networks, in: Conference on Neural InformationProcessing Systems, 2012.

[56] A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson, CNN features off-the-shelf: an astounding baseline for recognition, in: IEEE Conferenceon Computer Vision and Pattern Recognition Workshops, 2014.

[57] A. Babenko, A. Slesarev, A. Chigorin, V. Lempitsky, Neural codes forimage retrieval, in: European Conference on Computer Vision, 2014.

[58] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies foraccurate object detection and semantic segmentation, in: IEEE Confer-ence on Computer Vision and Pattern Recognition, 2014.

[59] J. Y.-H. Ng, F. Yang, L. S. Davis, Exploiting local features from deepnetworks for image retrieval, in: IEEE Conference on Computer Visionand Pattern Recognition Workshops, 2015.

[60] T. L. Berg, A. C. Berg, J. Shih, Automatic attribute discovery and char-acterization from noisy web data, in: European Conference on ComputerVision, 2010.

[61] X. Shen, Z. Lin, J. Brandt, Y. Wu, Mobile product image search by au-tomatic query object extraction, in: European Conference on ComputerVision, 2012.

[62] M. Perdoch, O. Chum, J. Matas, Efficient representation of local geom-etry for large scale object retrieval, in: IEEE Conference on ComputerVision and Pattern Recognition, 2009.

[63] K. van de Sande, T. Gevers, C. G. M. Snoek, Evaluating color descriptorsfor object and scene recognition, IEEE Transactions on Pattern Analysisand Machine Intelligence 32 (9) (2010) 1582–1596.

[64] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning Rep-resentations, 2015.

[65] F. Perronnin, J. Sánchez, T. Mensink, Improving the fisher kernel forlarge-scale image classification, in: European Conference on ComputerVision, 2010.

[66] V. Sharmanska, N. Quadrianto, C. H. Lampert, Augmented attribute rep-resentations, in: European Conference on Computer Vision, 2012.

12

http://www.imagenet. org/challenges/LSVRC/2010http://www.imagenet. org/challenges/LSVRC/2010

[67] U. Von Luxburg, A tutorial on spectral clustering, Statistics and comput-ing 17 (4) (2007) 395–416.

[68] J. Huang, S. Liu, J. Xing, T. Mei, S. Yan, Circle & search: Attribute-awareshoe retrieval, ACM Transactions on Multimedia Computing, Communi-cations, and Applications (TOMM) 11 (1) (2014) 3.

[69] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional net-works, in: European Conference on Computer Vision, 2014.

[70] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,A. Zisserman, The PASCAL Visual Object Classes Challenge2007 (VOC2007) Results, http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.

[71] J. R. R. Uijlings, K. van de Sande, T. Gevers, A. W. M. Smeulders, Se-lective search for object recognition, International Journal of ComputerVision 104 (2) (2013) 154–171.

[72] H. Jégou, O. Chum, Negative evidences and co-occurences in image re-trieval: The benefit of pca and whitening, in: European Conference onComputer Vision, 2012.

13

http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.htmlhttp://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html

1 Introduction2 Related work2.1 Contributions

3 The difficulty of generic instance search4 Attributes for generic instance search4.1 Experiments4.1.1 Datasets4.1.2 Empirical parameter study4.1.3 Comparison with manual attributes4.1.4 Empirical study of underlying feature representation

5 Person re-identification as instance search6 Categories and attributes for generic instance search7 Conclusion

Date post:	12-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

arXiv:1605.07104v1 [cs.CV] 23 May 2016 · 2016-05-24 · arXiv:1605.07104v1 [cs.CV] 23 May 2016. a...

Documents