+ All Categories
Home > Documents > BLasso for object categorization and retrieval: Towards interpretable visual models

BLasso for object categorization and retrieval: Towards interpretable visual models

Date post: 11-Sep-2016
Category:
Upload: ahmed-rebai
View: 216 times
Download: 0 times
Share this document with a friend
13

Click here to load reader

Transcript
Page 1: BLasso for object categorization and retrieval: Towards interpretable visual models

Pattern Recognition 45 (2012) 2377–2389

Contents lists available at SciVerse ScienceDirect

Pattern Recognition

0031-32

doi:10.1

n Corr

E-m

alexis.jo

journal homepage: www.elsevier.com/locate/pr

BLasso for object categorization and retrieval: Towards interpretablevisual models

Ahmed Rebai n, Alexis Joly, Nozha Boujemaa

INRIA Rocquencourt, Domaine de Voluceau, Rocquencourt - B.P. 105, Le Chesnay 78153, France

a r t i c l e i n f o

Article history:

Received 10 November 2010

Received in revised form

26 July 2011

Accepted 26 November 2011Available online 13 December 2011

Keywords:

Object categorization

Feature selection

Boosting

Lasso

Sparsity

Interpretability

03/$ - see front matter & 2011 Elsevier Ltd. A

016/j.patcog.2011.11.022

esponding author. Tel.: þ33 139635196; fax:

ail addresses: [email protected], ahmed

[email protected] (A. Joly), [email protected]

a b s t r a c t

We propose a new supervised object retrieval method based on the selection of local visual features

learned with the BLasso algorithm. BLasso is a boosting-like procedure that efficiently approximates the

Lasso path through backward regularization steps. The advantage compared to a classical boosting

strategy is that it produces a sparser selection of visual features. This allows us to improve the

efficiency of the retrieval and, as discussed in the paper, it facilitates human visual interpretation of the

models generated. We carried out our experiments on the Caltech-256 dataset with state-of-the-art

local visual features. We show that our method outperforms AdaBoost in effectiveness while

significantly reducing the model complexity and the prediction time. We discuss the evaluation of

the visual models obtained in terms of human interpretability.

& 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Object recognition in still images has been widely studied in thefield of computer vision [1–3]. Tasks may involve classification,retrieval and even detection. Classification, also called object cate-

gorization, consists in classifying images into categories by givingthem the labels of the objects they contain. For each object category,images are attributed the value 1 or 0 depending on whether or notthey contain the object. Following the same idea, we can performobject retrieval. Users can query the system to retrieve imageswhich contain the objects they are looking for. This should be basedupon an efficient and fast index structure to ensure a reasonableresponse time, particularly when dealing with large databases and/or complex models. The detection task is trickier because it mustprovide answers to the two following questions: how many objectsare there? Where are they? One may indicate a bounding box inwhich the object is localized. A way better is to specify a segmentedregion which defines the boundaries of the object itself.

Object recognition involves many challenges starting with thedefinition of the training dataset and the choice of the visualdescriptors, moving to the way the computer learns and theconstruction of a reliable classifier. Some researchers prefer to usea bottom-up approach tracing the very low information in the

ll rights reserved.

þ33 139635674.

[email protected] (A. Rebai),

(N. Boujemaa).

signal and trying to interpret it so as to get powerful models.Others find that it is more intuitive to use a top-down approach[4–6]. They try to characterize the signal by exploiting theirknowledge of what the objects are. This raises issues about visualstimulus and how we humans recognize things [7]. For example,is the contextual information always useful? Does it help recogni-tion when scale change occurs or does it make the learning error-prone in the presence of occlusion and clutter? As a rule of thumb,using varied backgrounds during the training improves thegeneralization ability of the classifier [8]. From now on, we willplace ourselves in the context of weak learning. A training imageis labeled as a whole sample. It will take the label þ1 if it containsthe object, �1 if not. This is also known as multiple instancelearning. It deals with uncertainty of instance labels. An image isviewed as a bag of multiple features which are the local visualsignatures. The bag will have only one label according to whetheror not it includes at least one positive instance. It follows that it isonly certain for a negative bag that there are no objects. Using aweak learning approach will also help to train images withoutmuch knowledge about the objects inside so there will be no needto construct a ground truth per object location.

An object is viewed as a tangible concept. We believe thatgood recognition comes with a good description, specially onethat uses multi-criteria such as shape, texture, scale and color.Image descriptors are indeed the raw material and the basic datafor learning. In order to cover the difference in the nature ofthe objects to be learned and at the same time the intra-classvariability of the same object, a multiple description scheme is

Page 2: BLasso for object categorization and retrieval: Towards interpretable visual models

A. Rebai et al. / Pattern Recognition 45 (2012) 2377–23892378

needed. It is then up to the learning algorithm to choose a descriptoror a combination of many descriptors that best suits a given category.Interpretability could derive from this fact. Interpretability aims toreduce the semantic gap that exists between human knowledge andthe computational representation of the models learned. Computermodels are usually too abstract for users to understand where badresults come from. By generating interpretable models, we somehowcreate a link between the numerical representation of objects andour visual representation. Not only does interpretability enhance ourunderstanding of outputted results but it is also a very effective toolfor user interactivity. It allows users to comprehend what the genericmodel is composed of and to choose, in different situations, thevisual patches that best matches their needs. Therefore, interpret-ability is a means to achieve genericity. Users can perform objectretrieval in large database collections which may contain hetero-geneous data from different sources.

The paper provides four main contributions. First of all, wedefine image features that can be easily interpreted in order tohelp to produce sparse models. Second, we apply the idea of theLasso technique [9] to a multi-instance learning scheme throughthe use of a modified version of BLasso. The third contributionconsists in applying the principle of the Lasso to a discriminativeapproach for categorical object classification and retrieval. Thelast contribution is about the models generated. They are inter-pretable and flexible, thus allowing for user interactivity.

The next section briefly review related works. Then, in Section3, we describe the algorithm in more detail. After that, we give anoverview on the models produced. Section 5 presents the experi-ments and discusses the results obtained. Finally, we set out ourconclusions in Section 6.

2. Related works

This paper proposes a new supervised object retrieval methodbased on visual local features selection learned with a modifiedversion of the BLasso algorithm. We demonstrate that our methodgives equivalent or better prediction performance than AdaBoostwhile simplifying the object class model. BLasso is an innovativemachine learning algorithm which efficiently constrains the lossfunction. BLasso was introduced by Zhao and Yu [10]. It mixestwo successful learning methods: boosting and Lasso. The boost-ing mechanism was proposed by Schapire [11] in 1990. Sincethen, many algorithms have emerged [12–16] and boosting hasbecome one of the most successful machine learning techniques.The underlying idea is to combine many weak classifiers – calledhypotheses – in order to obtain one final ‘‘strong’’ classifier. Boostingis an additive model which builds up one hypothesis after another byre-weighting the data for the next iteration—increasing the weightsof misclassified images and decreasing those of well classified ones.This concept helps to generate different hypotheses, putting empha-sis on misclassified examples, typically those located near thedecision boundary in the feature space. In addition to that, boostingis able to build a model containing hypotheses of different natures inone learning stage. That is, the feature selection mechanism canprocess features which belong to different image descriptors. By theterm ‘‘feature selection’’, we mean the process of selecting the mostdiscriminant local signatures of the image.

Boosting has been considered as a stagewise gradient descentmethod in an empirical cost function, particularly, AdaBoost uses theexponential loss [13,17]. Although it is an intuitive algorithm,boosting may overfit the training data, particularly when it runsfor a large number of iterations T in high dimensional and noisy data[17,18]. Moreover, a large value of T implies a long prediction time.On the other hand, setting T to a small value may lead to under-fitting. Therefore, the model may be non-discriminant, inconsistent

and might not cover the variability inside the category itself. Theboosting procedure can also be qualified as oblivious as it alwaysfunctions in a forward manner aiming to minimize the empiricalloss. Although the concept of re-weighting is interesting, at aniteration tþ1, we have no idea whether the t previous generatedhypotheses are good enough or not versus the model complexity.

Tibshirani observed that the ordinary least squares minimizationtechnique is not always satisfactory since the estimates often have alow bias but a large variance. In 1996, he came out with Lasso [9]which shrinks or sets some coefficients to zero. Lasso stands for leastabsolute shrinkage and selection operator. The idea has two goals:first to gain more interpretation by focussing on relevant predictorsand, secondly to improve the prediction accuracy by reducing thevariance of the predicted values. Lasso minimizes the L2 lossfunction penalized by the L1 norm on the parameters. This is aquadratic programming problem with linear inequality constraintsand it is intractable when the vector of parameters is very large.

In the literature, some efficient methods have been proposed tosolve the exact Lasso namely the least angle regression by Efronet al. [19] and the homotopy method by Osborne et al. [20]. Thesemethods were developed specifically to solve the least squaresproblem (i.e. using L2 loss). They work well where the number ofpredictors is small. However, they are not adapted to nonparametricand classification tasks. The advantage of BLasso (Boosted Lasso) liesin its ability to function with an infinite number of predictors andwith various loss functions. Unlike the boosting standard, and inorder to approximate Lasso solutions, BLasso adds a backward stepafter each iteration of boosting. Thus, one is able to build upsolutions with a coordinate descent manner and then take a lookback at the consistency of these solutions regarding the modelcomplexity. It has been demonstrated [10] that BLasso solutionsconverge to the Lasso path, hence favoring sparsity.

In this paper, we use image features that are easily under-standable by humans to help to produce sparse models. Sparsity ispreferable because it reduces the model complexity and subse-quently the prediction time. Moreover, the features used are mappedto their exact geometric locations in the training images. Therefore,the models generated represent true real entities of what is describedand they are not a vague approximation of the image content as it isusually the case with a discriminative training. Our choice hereallows the learning algorithm to concentrate on the most useful partsof the object. Furthermore, using a multi-instance approach has theluxury of unsupervised learning where the algorithm tries to findhidden structure and relations between data. It gives indeed morefreedom to the algorithm to select background features wheneverthey turn out to be useful to characterize the category. In addition tothat, the models generated are extensible if we ever want to useadditional training data. They are also shrinkable and can bemodified according to the needs of a human operator. Users canquery the retrieval engine using only the visual features that theythink are the best for their purpose.

3. Multiple-instance learning with the BLasso mechanism

Boosted Lasso is a machine learning tool which generatessparse models. Producing sparser solutions helps researchers andusers to understand what the model is composed of, but this isnot guaranteed unless each individual feature contributing to thefinal model is interpretable. We begin by presenting the imagerepresentation chosen.

3.1. Image representation

Most recent and effective recognition techniques [21,22,16] arebased on classifiers learned on high-dimensional representations

Page 3: BLasso for object categorization and retrieval: Towards interpretable visual models

Fig. 1. Image representation.

A. Rebai et al. / Pattern Recognition 45 (2012) 2377–2389 2379

embedding large numbers of local visual features. A feature is avector signature characterizing a local area of the image. From thispoint of view, an image is considered to be a very large vector V ofvariables ðFkÞ1rkrM (cf. Fig. 1). The distance between any feature Fki

belonging to the image Ii and any image Ij of the training set isdefined by

dðFki,IjÞ ¼ min1rk0rMj

dðFki,Fk0 jÞ

For each Fki, the images Ij are then sorted with increasing distancesand the best threshold rk on the distance is chosen according to theclassification error. Image retrieval is then performed by means ofrange queries on the selected features of the model. A matchingkernel is applied afterwards. Note that the vector V may containheterogeneous features, that is, features obtained with differenttypes of description and/or with different dimensions thus resultingon various feature spaces. Even though, the principle describedearlier is still applicable.

For the sake of interpretability, the image representation isimportant because it is the key to build the components of themodel. Take for example a bag-of-features representation and youwill soon notice that there is no chance to visualize what is beinglearned. Indeed, the final model is composed of centers of clustersin a high dimensional space and we have no idea whether thesecenters are related to tangible parts of the objects (i.e. eye, tooth,finger, etc.) or if they are just a numerical representation of arandom combination of some of these parts. On the other hand,the learning algorithm also plays a central role to allow us tocomprehend the object model. Methods based on kernels likeSVM cannot be easily interpreted visually.

3.2. Training with BLasso

BLasso has the same advantages as boosting. In fact, it can dealwith a large number of predictors and different loss functions. Itcan also perform variable selection given multiple image descrip-tors. The success of this algorithm comes from its capacity toconverge to Lasso. The next paragraph briefly reviews the Lasso.

3.2.1. Lasso

Let b¼ ðb1, . . . ,bj, . . . ÞT be the vector of parameters to estimate

(initially zero). We denote by Sk ¼ ðIk,lkÞ a training image Ik labeledwith lkAf�1;1g. S¼ fS1, . . . ,SNg represents the set of all thetraining data. The Lasso loss function G can be written as

Gðb,lÞ ¼XN

n ¼ 1

LðSn,bÞþl � JbJ1 ð1Þ

where Lð�,�Þ denotes an empirical loss function (originally, Lassowas introduced for regression using the L2-loss), JbJ1 ¼

Pj9bj9 is

the L1 norm of the vector b and lZ0 is the parameter controlling

the amount of regularization applied to the estimate. Using the L1

norm shrinks some coefficients and sets others exactly to zero,putting emphasis on the most important features. This choice forregularization (i.e. L1 norm) is the origin of success of the method.In fact, L1 regularization is the minimal possible convex that canlead to sparse solutions, and guarantees, at the same time, thatthe optimization problem is still convex (assuming that the lossfunction Lð�,�Þ is convex). Furthermore, and in order to get sparsesolutions with an efficient shrinkage tradeoff, l usually takes amoderate value since a large one may set these coefficients toexactly zero, leading to the null model. On the other hand, settingl to zero reverses the Lasso problem to minimizing the unregu-larized empirical loss. The general Lasso estimate bl is defined by

bl ¼minb

Gðb,lÞ ð2Þ

Minimizing G leads to sparse and interpretable models but it isprohibitive to solve given a very large or an infinite number ofbase learners.

3.2.2. BLasso

Since the exact Lasso minimization is not tractable, BLassotries to find the same solutions as Lasso with more cautious steps.Indeed it works with both forward and backward steps. Forwardsteps are used to minimize the empirical loss. On the other hand,backward steps minimize the regularization. In fact, at eachiteration, a coordinate bj is selected and updated by a small stepsize 7e (with e40). It has been shown that it is preferable tochoose a very small step size so that BLasso can approximate theLasso path perfectly. In practice, e should always be less than 0.1.Algorithm 1 gives an overview on the BLasso mechanism. Ourlearning algorithm will be reviewed in more detail in Section3.2.5. Note that a forward step consists in minimizing the currentempirical loss. It changes one variable in the vector of parametersb by adding a value o¼ 7e. On the other hand, a backward stepconsists in finding the step (i.e. one of the previous forward steps)that leads to the minimal empirical loss. That is, to each non-nullcoordinate bi, add the value �signðoÞo while keeping all theother coordinates unchanged, then compute the empirical loss ji.After processing all non-null coordinates, find the variable thatled to the minimal loss.

i ¼ arg mini

ji

Algorithm 1. BLasso

1.

Initialization: b¼ 0

Make a forward step and initialize l

2. Backward and forward steps:

Find the backward step that leads to the minimal empiricalloss.

if the step decreases the Lasso loss then take it.

else make a forward step and relax l if necessary

3. Repeat step 2 until lr0.

3.2.3. Definition of a weak classifier

The final classifier (also called strong classifier) is a weightedsum of the weak classifiers learned during training. A weakhypothesis hk (i.e. weak classifier) represents a coordinate of baselearners. Its weight bk is strictly positive if it was chosen at leastonce during the boosting process and remains zero if not. In thecontext of object categorization, the weak hypothesis hk is a localimage signature Fk with a particular description r and which hasan optimal radius rk [16]. hk is viewed as a hypersphere centeredon the local image feature Fk (cf. Fig. 1). For a test image x, hk will

Page 4: BLasso for object categorization and retrieval: Towards interpretable visual models

A. Rebai et al. / Pattern Recognition 45 (2012) 2377–23892380

output þ1 if the distance between x and Fk is less than rk and �1otherwise

hkðxÞ ¼þ1 if dðx,FkÞ ¼min

idðvi,r,FkÞork

�1 otherwise

(ð3Þ

with vi,r being a vector signature with the description r andwhich belongs to the image x. Here, the choice of the minimaldistance is arguable. In fact, it is a good measure because it takesinto account objects detected in a very low scale (i.e. the object isonly described by one or two image features). However, it is notrobust in the presence of outliers, noisy data and/or when thedescriptors used are not good enough. On the other hand, thechoice of using hyperspheres gives an easy interpretation to imagefeatures by comparison with a representation that uses a set ofhyperplanes or decision stumps for example. It also guarantees a fastclassification since we just need to compare the distance dðx,FkÞ to rk.

The radius rk of the hypersphere is computed such that theclassification error of Fk is as minimal as possible. In order tocompute the optimal rk, we need to precompute a minimal distancematrix between any image feature and the training images them-selves. Let O¼ fe1, . . . ,eEg be the set of all types of descriptions usedand Vi,r ¼ fFi,r,k; 1rkrMi,rg the set of the local signatures belong-ing to the image Ii according to the description er. Now, 8iAf1, . . . ,Ng and rAf1, . . . ,Eg 8kAf1, . . . ,Mi,rg.

Minimal distance matrix: 8jAf1, . . . ,Ng compute the minimaldistance di,r,k,j between Fi,r,k and the image Ij (i.e. the closestdistance to Ij)

di,r,k,j ¼ min1rk0rMj,r

dðFi,r,k,Fj,r,k0 Þ

Sorting: Let s be a permutation such that

di,r,k,sð1Þr � � �rdi,r,k,sðNÞ

Consequently, images are sorted increasingly according totheir distances to Fi,r,k

Radius computation: Select the index Z where the sum of imagelabels is maximum

Z ¼ arg max1rZrN

XZj ¼ 1

lsðjÞ ð4Þ

The hypothesis radius ri,r,k is then given by

ri,r,k ¼di,r,k,sðZÞ þdi,r,k,sðZþ1Þ

2ð5Þ

3.2.4. Building an object model with a boosting-like procedure

Since the minimization of the loss function with a very largenumber of base learners is hardly practical, boosting tries to findthe solution with an iterative procedure. The maximum numberof iterations is fixed a priori. At each iteration, a weak hypothesisis chosen such that the strong classifier converges to the optimalsolution. For classification tasks, various convex loss functionshave been used such as the exponential loss, logit loss, binomialdeviance, etc. In our implementation, we used the exponentialloss function as in AdaBoost. Let T be the maximum number ofiterations. At an iteration tþ1rT, one minimizes

Lðtþ1ÞAB ðbjÞ ¼

XN

n ¼ 1

expð�ln � Fðtþ1ÞbjðInÞÞ ð6Þ

where Fðtþ1ÞbjðInÞ ¼ bj � htþ1ðInÞþ

Ptk ¼ 1 bk � hkðInÞ is the set of

ensembles of base learners.

bðtþ1Þ¼ bðtÞ þbj � 1j ð7Þ

where 1j is the jth standard basis vector with all 0’s except for 1 in

the jth coordinate. After a boosting iteration, the training data are

re-weighted. Initially, all image weights are set to 1=N (N isthe number of the training images). Weights are updated so asto emphasize the misclassified images. The optimal solution to

minimizing LAB is b j such that

bj ¼1

2log

1�Etþ1

Etþ1ð8Þ

where

Etþ1 ¼XN

n ¼ 1ln a htþ 1 ðIn Þ

wðtþ1Þn ð9Þ

is the weighted training error (wn is the weight of the image In).AdaBoost runs until it reaches T iterations and stops earlier if E¼ 0or EZ0:5.

Boosting has been interpreted as a gradient descent method.Eq. (8) gives the optimal bj that let the algorithm converge as fastas possible (i.e. steepest descent). This formulation is utilized inAdaBoost. On the other hand, other varieties of forward stagewiseadditive modeling algorithms take more steps to converge butusually outperform the steepest descent in prediction. ForwardStagewise Fitting (FSF, also called e-boosting) is one example. Itadds new coefficients to the previous set with an infinitesimalfixed step size e40. Yet it is unclear what criteria FSF optimizes.At each iteration, a coordinate is chosen and updated by 7e. Thefact that e is very small imposes a local shrinkage on the variables.Hastie et al. [23] (Section 16.2) showed that forward stagewisecan sometimes (but not always) approximate the effect of Lasso.Similar observations were noticed by Zhao et al. [10]. Theirsimulations concluded that FSF local regularization does notconverge to the Lasso path in general. They also added that FSFsolutions are less sparse than Lasso. To remedy this problem, theyintroduced the concept of backward steps which minimize theLasso loss. In other words, they take into account the regulariza-tion term in Eq. (1). By allowing backward steps, the algorithmgoes back and forth in order to optimize the trade-off betweenpenalty and empirical loss.

For our purpose, forward steps always add a positive value.Consequently, backward steps are those that shrink the model bysubtraction. The choice made here is justified by the fact that thedistance matrix defined earlier is computed only for the featuresbelonging to the positive examples. Therefore, a selected hypothesisrepresents a visual feature that contributes to build the model of theobject category. It must have a positive weight thus. The reason ofnot computing the distances between negative features and thetraining images is that it does not help neither for user interactivity(i.e. ambiguity to define what a negative feature is) nor for genericity(a negative feature may be representative to one dataset but couldnot generalize to other databases). However, it is possible to learn anegative model against the object category. This will be discussed inSection 5.3.

3.2.5. The algorithm

BLasso adds a backward step to take into account the regular-ization term in Eq. (1). Forward steps, however, are chosen so asto minimize the empirical loss of the training samples. Both ofthese steps use the same loss function. In our implementation,and in order to minimize the empirical loss, we used a weightingscheme as in AdaBoost. In fact, adopting this strategy is moreappealing. First because it is extremely faster and second, itbenefits from the weight change. The slowness to overfitting ofAdaBoost has been observed during the last years. This property isincontestable. One of the reasons is that, at each iteration,AdaBoost gives more attention to the misclassified observationsby increasing their respective weights. In order to take advantageof this principle, we decided to keep the same mechanism for

Page 5: BLasso for object categorization and retrieval: Towards interpretable visual models

A. Rebai et al. / Pattern Recognition 45 (2012) 2377–2389 2381

selecting forward steps in BLasso. At an iteration t, we computethe score si,r,k of the base learner ðFi,r,k,ri,r,kÞ based on the imageweights

si,r,k ¼maxm

Xmc ¼ 1

wsðcÞ � lsðcÞ

Then, we select the base learner which obtains the highest score. Infact, when dealing with a very large or an infinite number of baselearners, it is impractical or even impossible to try to minimize theloss function directly. In their seminal work, Opelt et al. [16] used arepresentation with infinite base learners. The radius ri,r,k of eachweak hypothesis is not fixed a priori but takes into account theweight changes during AdaBoost iterations. It is computed after thealgorithm selects the feature Fi,r,k. Eq. (4) becomes then

Z ¼ arg max1rZrN

XZj ¼ 1

wsðjÞ � lsðjÞ ð10Þ

In our implementation, since each feature Fi,r,k contributes to thefinal model by at most one hypothesis, we averaged out the differentradii – of a given feature – which were computed duringforward steps.

Algorithm 2 summarizes our proposed feature selectionmechanism. It has one input parameter x40 which is used as atolerance level (x should be set as small as possible).

Algorithm 2. The proposed learning algorithm

1.

Initialization: set b¼ 0 and t¼1

� Set wð1Þn ¼ 1=N for nAf1, . . . ,Ng� Train the classifier and find the best hypothesis hð1Þk (k is the

index which corresponds to the kth entry of the vector b)� b

ð1Þ¼ e � 1k

� Calculate the initial regularization parameter

l1 ¼1

eXN

n ¼ 1

LðSn,0Þ�XN

n ¼ 1

LðSn,bð1ÞÞ

!ð11Þ

� Set the active index set Ið1ÞA ¼ fkg

2. Backward and forward steps

Find the backward step that leads to the minimal empiricalloss.

j ¼ arg minjA IðtÞ

A

XN

n ¼ 1

LðSn,bðtÞ�e � 1jÞ ð12Þ

This step is taken if it helps to decrease the Lasso loss. In otherwordsIf GðbðtÞ�e � 1

j,ltÞ�Gðb

ðtÞ,ltÞr�x then

bðtþ1Þ¼ bðtÞ�e � 1

j; ltþ1 ¼ lt

Otherwise, we force a forward step and relax l if necessary.

� Update weights

wðtþ1Þn ¼

wðtÞn � expð�e � ln � hðtÞk IðnÞÞ

t ð13Þ

where t is a normalization constant such thatPN

n ¼ 1

wðtþ1Þn ¼ 1

� Train the classifier and get the best hypothesis hðtþ1Þk

� bðtþ1Þ¼ b

ðtÞþe � 1k

� ltþ1 ¼min½lt ,1=eðPN

n ¼ 1 LðSn,bðtÞ�PN

n ¼ 1 LðSn,bðtþ1Þ�xÞ�

� Iðtþ1ÞA ¼ IðtÞA [ fkg

3.

Increase t by one and repeat steps 2 and 3 until lt r0

3.3. Prediction

The strong classifier may involve hypotheses originating fromvarious types of descriptions ðeci

Þ1r irG with GrE. It can bewritten as

H¼X

j

wjhj,ec1þX

j

wjhj,c2þ � � � þ

Xj

wjhj,ecG

In this paper, we aim to compute the classification value of agiven test image x from a retrieval point of view. There are twodifferent ways to proceed: either by means of range queries(which supposes that the database is indexed with an efficientmulti-dimensional index structure) or by using an exhaustivesearch. In any case, we need to compute the output of every weakhypothesis hj,eci

which is represented by the couple ðFk,eci,rk,eciÞ.

In the exhaustive mode, we begin by computing the minimaldistance between x and Fk,eci

. Only the features that belong to x

and with the description eciare concerned.

dj ¼ minFz,eci

A xdðFz,eci

,Fk,eciÞ

The output of hj,eciis given then by the formulation in Eq. (3) and

the classification value of x is the weighted sum of all the outputsof the weak classifiers. The higher this value, the more likely theimage to contain an object.

On the other hand, when using an index structure, we proceeddifferently. Rather than predicting each image separately, allimages in the database are predicted at the same time. In fact,we begin by initializing the prediction values of all the testimages to zero. After that, for each weak hypothesis hj,eci

, weperform a range query over the database. That is, we query thesearch engine to retrieve all the images that fall into the hyper-sphere defined by hj,eci

. Consequently, these images will beattributed the value þ1 as an output of the weak classifier. Theirrespective prediction values, initially zeros, will be incrementedeach by the weight of this weak classifier. On the other hand,images that are not returned by the system will automatically beattributed the value �1 and their prediction values will bedecremented each by the same weight of this classifier. Afterlooping through all the weak hypotheses, each test image ends uphaving a classification value.

4. Towards interpretable visual models

The usefulness of our method is to allow for user interactivity.After being constructed, a model can be visualized by users assmall image patches. Each patch corresponds to a local imagefeature, particularly, the description region of an interest point.The visual representation takes into account the window size ofthe descriptor as well as the principle orientation of the interestpoint. However, one may omit rotating the patches to theorientation of interest points in order not to introduce anotherlevel of difficulty to users. Indeed, humans are not familiar withupside-down positions, and rotating images adds complexity andslows down the understanding of objects’ components. Ourprimary objective is to keep the local features as interpretableas possible. In addition to that, and for the representation size, wechose to normalize all the patches to a constant width. Tworeasons can justify this choice. First, it is better for users when asoftware interface presents choices (i.e. object parts) with acertain regularity—from an ergonomic point of view. Second,normalizing sizes has the same effect of a zoom-in or a zoom-out,therefore it preserves the knowledge of the scale of detection.This idea is illustrated in Fig. 2.

Object parts can be grouped together whenever they partici-pate to from an understandable concept which can be easily

Page 6: BLasso for object categorization and retrieval: Towards interpretable visual models

0.450.5

0.55Composition of the skeleton model

A. Rebai et al. / Pattern Recognition 45 (2012) 2377–23892382

described by a word or a phrase. From this perspective, an objectmodel can have various interpretations. In other words, thecomponents of the model are fixed but every person may perceivethem differently. It is a subjective matter and it depends on thecenter of interest of the human operator. Fig. 3 gives an exampleof a human-skeleton model. As we can see, the model can beviewed differently (cf. Fig. 4) according to the concepts chosen.This is what interpretability is. There are no standards fixed apriori. Moreover, the representation suggested here is beneficialfor both researchers and end-users. In fact, and for researchers,they can study what the model is composed of. Sometimes,scientists cannot explain their expertise. Take for example abiologist who is looking for the visual relations and similaritiesthat characterize a given vegetation species. Based on a robustdescriptor for the task and assuming that the opposite category inthe training set is well defined, there is a high chance that thisscientist discovers new interesting visual clues. For computer-science researchers, they might be interested in improving amodel of an object category by making it more generic for searchengines or more specific for a particular purpose. On the otherhand, end-users may profit from the model representation fromanother perspective. They can query the system only by selectingthe most representative features and with whatever proportionthey choose. For instance, assuming that there is a car model andthere are no models for tires, a user who is looking for imagescontaining tires can choose only the relevant patches from the carmodel (which is supposed to include some features to representtires). A second case is retrieving objects with a particular context.Here, users have to select some background patches to define the

Fig. 2. Normalizing the representation of the patches.

Fig. 3. Example of an object model:

object’s context. For example, retrieving a goose in the air, inwater or on land.

5. Experiments

We carried out our experiments on Caltech-256 [24]. Itcomprises a diversity of objects (natural/man-made, rigid/deformable). Some objects are taken with a white background,other objects are taken within their context. However, theseobjects are almost always centered in images. Caltech-256 hasthe advantage of comprising many object categories. Therefore,the inter-class variability is very high which makes this databasewell suited to our objectives. In fact, having a multitude ofcategories allows us to measure the interpretability for variousobjects. Moreover, when dealing with object-based retrieval andfor any object category, the average precision measure would bevery low according to a random sorting of all the images in thedatabase (i.e. low bias with a random prediction). We used theaverage precision as the criterion to evaluate our experiments.We constructed our dataset as follows: for each object category,half of the images are used for training purposes and half of themfor testing. This kept 15 360 images to be used for prediction. Theseparation of the dataset is completely random. The training

112.human-skeleton category.

0.050.1

0.150.2

0.250.3

0.350.4

HeadUpper body

Lower body

Background

0.050.1

0.150.2

0.250.3

0.350.4

0.45

HeadVertebral column

Ribcage

PelvisLimbs

Background

Composition of the skeleton model

Fig. 4. Different interpretations of the same model.

Page 7: BLasso for object categorization and retrieval: Towards interpretable visual models

A. Rebai et al. / Pattern Recognition 45 (2012) 2377–2389 2383

images belonging to the counter-class (negative samples) are alsochosen randomly from the other categories. For each trainingclass dataset, we kept the same number for positive and negativeimages. This decision is sometimes viewed as problematical sincefor a small number of positive training images, the number of thenegative ones will not be sufficient to cover the opposite categories.But in reality, it could be a good criterion to measure the capacity –of a given object class model – to generalize and to judge itsdiscriminating ability. In our experiments, we relied on two success-ful state-of-the-art image descriptors: SIFT and SURF. For each imageand for each type of description, we kept a maximum of 400 imagesignatures, that is a maximum of 800 signatures per image. Theparameters of BLasso were tuned as follows:

x¼ 10�6 and e¼ 1

80

This choice is made based on the experiments in [10]. For AdaBoost,we fixed the number of iterations T¼500.

5.1. BLasso versus AdaBoost

In order to evaluate the L1-regularization and to test whether itis of interest in generating sparser models than standard boostingprocedures, we compared our method to AdaBoost. In fact, we

Table 1Comparing BLasso and AdaBoost. A random selector is given as a reference (500

hypotheses per category).

Category Average precision Number of

features

Random BLasso AdaBoost BLasso AdaBoost

Group 1

005.baseball-glove 0.0833 0.2263 0.1692 37 40

008.bathtub 0.0313 0.0438 0.0373 23 500

077.french-horn 0.1407 0.1628 0.1037 20 31

137.mars 0.2509 0.3270 0.2520 12 108

145.motorbikes-101 0.5296 0.8068 0.7749 76 93

194.socks 0.0116 0.0160 0.0088 20 134

234.tweezer 0.0675 0.0808 0.0732 13 77

235.umbrella-101 0.0225 0.0530 0.0335 25 123

238.video-projector 0.0054 0.0159 0.0071 4 106

255.tennis-shoes 0.0083 0.0123 0.0065 20 92

Average over 93 categories 0.0218 0.0400 0.0308 18.6 98.5

Group 2

025.cactus 0.0499 0.0247 0.0187 21 18

037.chess-board 0.4520 0.7293 0.4832 37 7

053.desk-globe 0.0316 0.0882 0.0611 27 15

130.license-plate 0.0974 0.4111 0.0940 27 7

200.stained-glass 0.0164 0.1025 0.0980 39 13

218.tennis-racket 0.1269 0.2190 0.0758 25 18

225.tower-pisa 0.0396 0.2561 0.1968 36 21

237.vcr 0.0476 0.0434 0.0374 29 24

250.zebra 0.0438 0.2868 0.1912 36 5

252.car-side-101 0.0159 0.2751 0.2301 33 13

Average over 35 categories 0.0558 0.1356 0.0922 30.3 15.2

Group 3

003.backpack 0.0508 0.1294 0.1341 18 33

011.billiards 0.0131 0.0646 0.0968 40 284

201.starfish-101 0.0138 0.0140 0.0248 4 23

232.t-shirt 0.0767 0.1490 0.1523 98 356

Average over 113

categories

0.0199 0.0317 0.0499 16.5 102.2

Group 4

043.coin 0.0080 0.0347 0.0486 38 12

177.saturn 0.5623 0.4741 0.5252 39 25

Average over 15 categories 0.0836 0.0808 0.0989 27.7 17.9

All

Average over the 256

categories

0.0292 0.0518 0.0516 19.8 84

carried out two experiments. The first one consisted in comparingthe global performance of each algorithm according to theaverage precision and to the prediction time. On the other hand,the second experiment placed emphasis on how each algorithmbehaves when reducing the total number of features selected inthe model. These results were also likened to a random selectorwhich builds a model from 500 hypotheses for each category. Theweights of these hypotheses are all equal to 1/500 and theirrespective radii are defined by Eq. (5). Note that the results of therandom selection is an average over three runs.

5.1.1. Average precision

Table 1 gives the results obtained for 26 object categories. It alsoshows the average performance measured over these categories aswell as the average over all database object categories. On the rightside of the table, we see the number of hypotheses used in eachobject model for both algorithms. The results are divided into fourgroups. The first and the second groups are composed of objectcategories where BLasso outperforms AdaBoost in AP, whereasthe third and the fourth groups contain object categories whereAdaBoost is better. The common point between the first and thethird groups is that the number of features selected by BLasso isfewer than that selected by AdaBoost—which is not the case for thesecond and the fourth groups. On average, BLasso outperformsAdaBoost in AP and selects fewer features to build the model.

The good behavior we seek from BLasso is illustrated by the firstgroup: higher precision with fewer features. The second and thethird groups are rather ordinary because we usually expect a higherperformance when the model involves a large number of features.The two first groups also reflect the stop condition problem ofAdaBoost. In fact, AdaBoost sometimes stops prematurely because ofone hypothesis that fits the data well (i.e. the classification error is

Table 2Comparing BLasso versus AdaBoost according to the prediction time.

Category Prediction time (s)

BLasso AdaBoost

003.backpack 25.7 43.4

011.billiards 32.7 167.1

201.starfish-101 10.8 42.3

232.t-shirt 159.8 574.0

Average 68.4 282.3

Average over the 256 categories 32.6 132.7

0

0.05

0.1

0.15

0.2

0.25

0.3

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Average over the 256 categories

RandomAdaBoost

BLasso

Fig. 5. Precision–recall curve (average over the 256 object categories).

Page 8: BLasso for object categorization and retrieval: Towards interpretable visual models

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0 10 20 30 40 50 60 70 80 90 100

Ave

rage

pre

cisi

on

Average number of features per category in the model

Average over the 256 categories

RandomAdaBoost

BLasso

Fig. 6. Increasing the model size.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Average over the 256 categories

Using counter-class modelsWithout counter-class models

Fig. 7. Adding a counter-class: precision–recall curve.

Table 3AP measured in three cases: SIFT only, SURF only and the combination of both.

Category Average precision

SIFT SURF

Increase

145.motorbikes-101 0.6349 0.7061

177.saturn 0.289 0.0278

218.tennis-racket 0.0722 0.0675

232.t-shirt 0.096 0.0899

251.airplanes-101 0.5368 0.5751

253.faces-easy-101 0.8515 0.7857

Average over 119 categories 0.0590 0.0464

Decrease

146.mountain-bike 0.1243 0.0417

184.sheet-music 0.191 0.1776

204.sunflower-101 0.4251 0.0781

224.touring-bike 0.0809 0.1286

234.tweezer 0.0406 0.101

235.umbrella-101 0.0775 0.0166

Average over 137 categories 0.0285 0.0247

All

Average over the 256 categories 0.0426 0.0348

A. Rebai et al. / Pattern Recognition 45 (2012) 2377–23892384

zero). On the other hand, it is sometimes forced to stop after reachingthe limit of 500 hypotheses. The prediction time results of the thirdgroup are presented in Table 2. They highlight our gain in timecompared to the loss in precision shown in Table 1. Note that theprediction method used here is exhaustive. This choice does notfavor real-time interactive search applications but it was made herein order to compare the real performances of both algorithmswithout introducing any bias from index structures and rangequeries. It is worth pointing out that the prediction time is linearwith respect to the model size. For the random selector we men-tioned earlier, it is equal to 823.7 s per category. Finally, the fourthgroup (i.e. 043.coin and 177.saturn) behaves very badly compared toAdaBoost—with more features selected and a lower AP. This case isvery rare (only 15 categories) which explains the fact that, onaverage, BLasso is better both in average precision and predictiontime. Yet the precision–recall curve, given by Fig. 5, shows thatAdaBoost has a slightly better precision than BLasso for the firstimages returned then BLasso becomes better.

Number of features

SIFTþSURF SIFT SURF SIFTþSURF

0.8068 41 75 76

0.4741 34 1 39

0.219 9 24 25

0.149 52 48 98

0.7218 35 45 56

0.8641 78 61 58

0.0816 16.4 16.1 22.7

0.0511 32 29 15

0.1746 29 55 27

0.2149 15 28 5

0.0892 17 22 27

0.0808 4 82 13

0.053 19 11 25

0.0258 15.9 15.9 17.3

0.0518 16.1 16.0 19.8

Table 4Adding a counter-class model (using BLasso).

Category Object Object þ

Counter-class

Increase

044.comet 0.049 0.3864

067.eyeglasses 0.0417 0.2801

129.leopards-101 0.2464 0.7867

177.saturn 0.4741 0.8226

182.self-propelled-lawn-mower 0.0549 0.3089

234.tweezer 0.0808 0.3311

Average over 181 categories 0.0525 0.0877

Decrease

037.chess-board 0.7293 0.3814

086.golden-gate-bridge 0.1185 0.0873

154.palm-tree 0.0736 0.0224

230.trilobite-101 0.4152 0.3699

253.faces-easy-101 0.8641 0.7697

256.toad 0.0572 0.0242

Average over 75 categories 0.0500 0.0370

All

Average over the 256 categories 0.0518 0.0728

Page 9: BLasso for object categorization and retrieval: Towards interpretable visual models

A. Rebai et al. / Pattern Recognition 45 (2012) 2377–2389 2385

5.1.2. The effect of increasing the model size

In this experiment, we computed the performance of BLasso,AdaBoost and the random selector for different model sizes. Infact, we made several prediction runs each with a limited numberof features in the class model. At a given run k, we only keptthe first k-hypotheses that had the strongest weights. We variedk from 1 to 100 and we registered the AP results as well asthe number of features mki used by the category i (mkirk and1r ir256). Note that mki is equal to k when the model comprisesat least k hypotheses and it is strictly inferior otherwise. Afterthat, and for each k, we computed the average number ofhypotheses used

qk ¼1

256

X256

i ¼ 1

mki

Fig. 8. Illustration of the stop condition problem in AdaBoost with two object categorie

patches of 137.mars category. (B-1) Model learned with BLasso. (B-2) Model learned w

The results are represented in Fig. 6. It gives a global overview onhow each algorithm behaves. This experiment, while not givingprecisely the optimization path driven by BLasso, demonstrates thenice behavior of the algorithm. In fact, we notice that when theprogress of AP curve becomes nearly constant in AdaBoost after acertain number of hypotheses, BLasso stops and it has a slightlyhigher precision. This highlights the fact that the algorithmjudges the overall model consistency after each single step: anyhypothesis to be added must significantly contribute in increasingthe performance.

5.2. Combining multiple visual features

Using various description types is a major key to improvingthe prediction accuracy. Our experiments were carried out using

s. (A-1) Model learned with BLasso. (A-2) Model learned with AdaBoost. (A) Visual

ith AdaBoost. (B) Visual patches of 037.chess-board category.

Page 10: BLasso for object categorization and retrieval: Towards interpretable visual models

A. Rebai et al. / Pattern Recognition 45 (2012) 2377–23892386

only two types of descriptors. The primary objective here was notto reach high score levels but rather to reveal the joint collabora-tion between the description entities. In this experiment, weused only BLasso. AP results are shown in Table 3. We clusteredthe results into two groups. The first group comprises the objectclasses where both descriptors combined together achieve higherthan each single descriptor whereas the second group is com-posed of classes where the combination works worse than at leastone of the descriptors. It is worth mentioning that for some classobjects, SIFT performs far better than SURF and vice versa. Table 3presents eight categories where SIFT is better and four othercategories where SURF is. On average, SIFT is better than SURF andthe collaboration of both is even better. One may notice that, on

average, the total number of features selected when using the twodescriptors is greater than when each descriptor operates individu-ally. However, we estimate that these results are satisfactory sincethe increase in AP (i.e. 30%) is higher than the increase in thenumber of features (i.e. 26%).

5.3. Adding a counter-class model

Learning what is not the object may turn out to be a useful clueto know what an object is. From this perspective, apart from theobject class model, we added a counter-class model. That is, welearned the negative class against the object class. The final classifieris the resulting sum from both models: HðxÞ ¼H1ðxÞþH2ðxÞ whereH1 is the strong classifier of the object category and H2 is the strongclassifier of the opposite model.

H1ðxÞ ¼X

i

aihiðxÞ and H2ðxÞ ¼X

j

bjh0jðxÞ

Take a weak classifier h0 belonging to the counter-class and supposethat it is associated with the feature F 0 and the discriminative radiusR0. This classifier will output

h0ðxÞ ¼�1 if dðF 0,xÞrR0

þ1 otherwise

(

The experiment covered all the object categories. The precisionrecall curve, given in Fig. 7, shows a performance increase. For some

Fig. 9. Visual patches of 224.touring-bike category. (a) Model learned with SIFT descrip

descriptors.

object classes, the AP significantly increased or decreased. Forothers, it remained nearly constant. Table 4 illustrates some of theresults where large changes occurred.

In fact, the changes recorded in AP may happen for variousreasons. One of these is that the object does not always cover anentire test image. Sometimes, it constitutes only a small region. Inthis particular case, when adding a counter-class model for animage where there exists one or many negative objects—apartfrom the positive object we are looking for, weak classifiers of thecounter-class model (i.e. h0jðÞ) will output �1 because they will findobjects inside their discriminant radii. In other words, saying thatthere are negative objects in the image does not necessarily meanthat there are no positive ones. This behavior, although correct,raises a serious problem by neglecting the object we are lookingfor. Given a test image It, four possible scenarios could occur:

tor.

It contains at least one object:J if dðF 0,ItÞrR0 ) bad classification;J if dðF 0,ItÞ4R0 ) good classification;

(b)

It does not contain any objects:J if dðF 0,ItÞrR0 ) good classification;J if dðF 0,ItÞ4R0 ) bad classification.

Other reasons are discussed in the next section.

5.4. Discussion

In this section, we present some of the visual patches repre-senting the models learned. These patches are regions extractedaround the selected interest points with a representation thattakes into account the window size used in the description.Furthermore, the patches are presented in a descending orderaccording to their prediction weights. We will first compare thefeature selection between AdaBoost and BLasso. Fig. 8A showsboth models of the 137.mars class. We notice that among thelarge number of features selected by AdaBoost, many arerepeated. However, each feature corresponds to an independenthypothesis with its own discriminant radius. This observationdoes not exclude the fact that these components are highly

Model learned with SURF descriptor. (c) Model learned with SIFT and SURF

Page 11: BLasso for object categorization and retrieval: Towards interpretable visual models

A. Rebai et al. / Pattern Recognition 45 (2012) 2377–2389 2387

correlated, thus resulting a poor prediction compared to the 12features outputted by BLasso. In addition to that, this casedemonstrates the failure of AdaBoost to stop earlier and preventoverfitting. On the other hand, and with the 037.chess-boardcategory (cf. Fig. 8B), AdaBoost fails to generalize because itstopped very quickly. As a result, AdaBoost prediction performspoorly compared to BLasso. We can state that in many casesBLasso explains the most interesting and reliable features of theobject class. Although some features selected by BLasso seem tobe very similar to us, they differ slightly by a geometric transfor-mation. The question raised here then relates to the descriptorstrength rather than the learning algorithm.

Fig. 10. Illustration of the case where adding a counter-class ameliorates prediction.

respective counter-classes. (A-1) Positive model. (A-2) Counter-class model. (A) 067.ey

category.

Learning object categories and image features are definitelydescriptor-related. Theoretically, adding more descriptions shouldgive better results. This is, in general, true as was revealed by theaverage measure on all the object categories (cf. Table 3). However,this was not the case for all the object classes. In fact, some classesunderwent a big reduction in the model size (i.e. 146.mountain-bikeand 204.sunflower-101) which corresponds to a 20% loss in APversus a 61% gain in reducing the number of features. For others,there is no clear answer: quantitatively, numbers do not reveal anyimprovement and qualitatively, interpreting an increase or adecrease in a descriptor performance by relying on visual patchesis not obvious, mainly when these patches are parts from the object

Visual patches of 067.eyeglasses and 177.saturn categories presented with their

eglasses category. (B-1) Positive model. (B-2) Counter-class model. (B) 177.saturn

Page 12: BLasso for object categorization and retrieval: Towards interpretable visual models

A. Rebai et al. / Pattern Recognition 45 (2012) 2377–23892388

category (cf. Fig. 9). Expanding the model visually may however turnout to be a useful clue to interpret the counter-class features effecton prediction.

Defining the opposite category for learning a visual concept isby no means trivial. We may wonder about the opposite of amotorbike or the opposite of the American flag. The choice ofnegative examples must certainly take into account the composi-tion of the test database. In addition, this choice must be as variedas possible in order to cover all possibilities. Yet it is a statisticallearning and the training data distribution may differ from data tobe predicted. To illustrate the idea, we selected four categories, twoof which showed a remarkable improvement in prediction results.Fig. 10 presents the two successful categories: 067.eyeglasses and177.saturn. For these categories, we notice a high contrast betweenthe background and the parts of the object. This creates a high edgeresponse. It follows that the corresponding signatures are verydistinctive versus the signatures of the counter-class. In addition tothat, the shapes included in the opposite model are easily distin-guishable from those belonging to the category, and we believethat is the answer to the increase in performance. On the otherhand, both positive and negative models of 086.golden-gate-bridgecategory have some similarity (cf. Fig. 11). Some patches indeedlook homogeneous and almost uniform (i.e. norms of gradients arevery small). This fact made it more difficult to discern betweenwhat the object is and what it is not. Finally, the last examplewe present is the model of the 150.octopus category. Here againwe notice that negatives bear a resemblance to positives sincethe descriptors used are robust to the negative transformation(cf. Fig. 11 B-3). This example of 150.octopus is a failure casewhere the negative model overfitted the training data. That is, thenegative images which were randomly chosen formed a smallset (not large enough to cover variability) and were in a sense‘‘similar’’ to the object category. Thus, the learning algorithmconcentrated on discerning between these examples rather thanon generalizing.

Fig. 11. Illustration of case where adding a counter-class decreases results. Visual pa

respective counter-classes. (A-1) Positive model. (A-2) Counter-class model. (A) 086.go

Inverted colors of some patches belonging to the counter-class model. (B) 150.octopus

6. Conclusions

We presented a new supervised learning technique for objectcategorization and retrieval. The algorithm is a boosting-likeprocedure which includes more cautious steps. Unlike standardboosting which is considered to be oblivious, this method makes abackward step after each forward iteration. This fact constrainsthe loss function and outputs sparser object models. Sparsity isappreciated in computer vision, particularly for generic objectcategories. It highlights the important characteristics of an objectby attributing higher weights to the corresponding features andsetting the other features to zero. Thus, sparsity allows us tobetter understand what the computer exactly learns. In fact, it is amajor key to interpretability. This paper discussed the use of theBLasso algorithm for feature selection. In our experiments, wecompared the performance of BLasso against the AdaBoost usingthe Caltech-256 dataset. Results show that, on average, BLassooutperforms AdaBoost. The reason is that the stop condition forAdaBoost is unclear and problematic. Choosing a large number ofhypotheses normally leads to a better precision score but this isnot always the case. Not only is the prediction prohibitive interms of computation time but it may also sometimes decreasedue to overfitting. BLasso solved this trade-off by adding an extraconstraint to efficiently stop after learning the variability of theclass leading to more interpretable models. Experimental resultsclearly demonstrated this fact and showed that it is possible toexplain results by visualizing models.

In future work, we will carry out more in-depth tests on themodels generated by BLasso. We will use various descriptions andwe will make use of sparsity to give interpretations and improvethe overall efficiency. We also intend to encourage user inter-activity by creating a computer interface which allows users tobuild their own models from the visual patches displayed. Thesepatches will faithfully reproduce their related type of description.Furthermore, other regularization methods may be used as

tches of 086.golden-gate-bridge and 150.octopus categories presented with their

lden-gate-bridge category. (B-1) Positive model. (B-2) Counter-class model. (B-3)

category.

Page 13: BLasso for object categorization and retrieval: Towards interpretable visual models

A. Rebai et al. / Pattern Recognition 45 (2012) 2377–2389 2389

alternatives to BLasso to test their performance, namely the groupLasso. This will allow us to focus on some images regions morethan others but it necessitates additional annotations.

References

[1] L.G. Roberts, Machine Perception of Three-Dimensional Solids, OutstandingDissertations in the Computer Sciences, Garland Publishing, New York, 1963.

[2] K. Grauman, T. Darrell, The pyramid match kernel: discriminative classifica-tion with sets of image features, in: International Conference on ComputerVision, 2005, pp. 1458–1465.

[3] G. Wang, Y. Zhang, L. Fei-Fei, Using dependent regions for object categoriza-tion in a generative framework, in: CVPR ’06: Proceedings of the 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition,IEEE Computer Society, Washington, DC, USA, 2006, pp. 1597–1604doi:http://dx.doi.org/10.1109/CVPR.2006.324.

[4] G.J. Agin, T.O. Binford, Computer description of curved objects, in: IJCAI ’73:Proceedings of the 3rd International Joint Conference on Artificial Intelli-gence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1973,pp. 629–640.

[5] M.A. Fischler, R.A. Elschlager, The representation and matching of pictorialstructures, IEEE Transactions on Computations 22 (1) (1973) 67–92doi:http://dx.doi.org/10.1109/T-C.1973.223602.

[6] T.A. Cass, Polynomial-time geometric matching for object recognition, Inter-national Journal of Computer Vision 21 (1–2) (1997) 37–61 doi:http://dx.doi.org/10.1023/A:1007971405872.

[7] M.J. Tarr, H.H. Blthoff, Is human object recognition better described by geon-structural-descriptions or by multiple-views? 1995.

[8] J. Ponce, T.L. Berg, M. Everingham, D.A. Forsyth, M. Hebert, S. Lazebnik,M. Marszalek, C. Schmid, B.C. Russell, A. Torralba, C.K.I. Williams, J. Zhang,A. Zisserman, Dataset issues in object recognition. in: Toward Category-LevelObject Recognition, Lecture Notes in Computer Science, vol. 4170, Springer,2006, pp. 29–48.

[9] R. Tibshirani, Regression shrinkage and selection via the Lasso, Journal of theRoyal Statistical Society Series B 58 (1) (1996) 267–288. URL: /http://www.ams.org/mathscinet-getitem?mr=1379242S.

[10] P. Zhao, B. Yu, S. Lasso, Journal of Machine Learning Research 8 (2007)2701–2726.

[11] R.E. Schapire, The strength of weak learnability, Machine Learning 5 (2)(1990) 197–227 doi:http://dx.doi.org/10.1023/A:1022648800760.

[12] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line

learning and an application to boosting, Journal of Computer and SystemSciences 55 (1) (1997) 119–139.

[13] J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statisticalview of boosting, Annals of Statistics 28. URL: /http://citeseerx.ist.psu.edu/

viewdoc/summary?doi=10.1.1.30.3515S.[14] Y. Singer, Leveraged vector machines. in: Advances in Neural Information

Processing Systems, vol. 12, MIT Press, 2000, pp. 610–616.[15] K. Tieu, P. Viola, Boosting image retrieval, International Journal of Computer

Vision 56 (1–2) (2004) 17–36.[16] A. Opelt, M. Fussenegger, P. Auer, Generic object recognition with boosting,

IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (3) (2006)

416–431 member-Pinz, Axel. doi:http://dx.doi.org/http://dx.doi.org/10.1109/TPAMI.2006.54.

[17] G. Ratsch, T. Onoda, K.-R. Muller, Soft margins for adaboost, Machine Learning42 (3) (2001) 287–320 doi:http://dx.doi.org/10.1023/A:1007618119488.

[18] A.J. Grove, D. Schuurmans, Boosting in the limit: maximizing the margin oflearned ensembles, in: Proceedings of the Fifteenth National Conference onArtificial Intelligence, 1998, pp. 692–699.

[19] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Annals ofStatistics 32 (2004) 407–499.

[20] M.R. Osborne, B. Presnell, B.A. Turlach, On the Lasso and its dual, Journal ofComputational and Graphical Statistics 9 (2) (2000) 319–337. URL: /http://

www.jstor.org/stable/1390657S.[21] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid

matching for recognizing natural scene categories, in: CVPR ’06: Proceedingsof the 2006 IEEE Computer Society Conference on Computer Vision andPattern Recognition, IEEE Computer Society, Washington, DC, USA, 2006,

pp. 2169–2178 doi:http://dx.doi.org/10.1109/CVPR.2006.68.[22] K. Grauman, T. Darrell, The pyramid match kernel: efficient learning with

sets of features, Journal of Machine Learning Research 8 (2007) 725–760.[23] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data

Mining, Inference and Prediction, 2nd ed., Springer, 2009 URL: /http://www-stat.stanford.edu/�tibs/ElemStatLearn/S.

[24] G. Griffin, A. Holub, P. Perona, Caltech-256 Object Category Dataset, TechnicalReport 7694, California Institute of Technology, 2007.

Ahmed Rebai received his PhD in computer science from the University of Paris-Sud 11, France, in 2011. He previously obtained a master degree and an engineeringdegree in telecommunications from the Higher School of Communication of Tunis, Tunisia, in 2007. His research interests include image analysis, content-based imageretrieval and object recognition.

Alexis Joly is a permanent research scientist at INRIA Rocquencourt in France. His topics of interests include content-based image and video retrieval, visual objects miningand large scale similarity search issues. He received his Engineer degree in Telecommunication from the National Institute of Applicative Sciences (INSA Lyon, France) in2001 and his PhD degree in Computer Science from the University of La Rochelle (France) in 2005. During his PhD, he collaborated with the french national institute ofaudiovisual (INA) and developed a patended TV monitoring system working on huge datasets. In 2005, he worked as visitor researcher at Tokyo National Institute ofInformatics and then joined the IMEDIA team at INRIA Rocquencourt. He was involved in numerous European initiatives (MUSCLE NoE, VITALAS IP, TRENDS STREP andCHORUS CA) as well as national projects covering different application areas such as audio-visual archives, photo stocks agency and biodiversity. In 2007 and 2008, he co-organized CIVR and TRECVID video copy detection evaluation campaigns which were the first international events related to this topic.

Nozha Boujemaa is Director of the INRIA Saclay Ile-de-France research center. She obtained her PhD degree in Mathematics and Computer Science in 1993 (Paris V) andher ‘‘Habilitation �a Diriger des Recherches’’ in Computer Science in 2000 (University of Versailles). She has been graduated previously with a Master degree with Honorsfrom University of Tunis. Her topics of interests include multimedia content search, image analysis, pattern recognition and machine learning. Her research activities areleading to next generation of multimedia search engines and affect several applications domains such as audio-visual archives, internet, security, biodiversity, etc. Pr.Boujemaa has authored more than 100 international journal and conference papers. She has served on numerous scientific program committees in internationalconferences (WWWMultimedia Track, ACMMultimedia/MIR, ICPR, IEEE ICIP, IEEE Fuzzy systems, IEEE ICME, CIVR, CBMI, RIAO, etc.) in the area of visual informationretrieval and pattern recognition.


Recommended