Adaptable Similarity Search using Non-Relevant Information

Adaptable Similarity Search using Non-RelevantInformation

Ashwin [email protected]

Rahul [email protected]

Sugata [email protected]

IBM India Research Lab, Hauz Khas, New Delhi 110016, India

Abstract

Many modern database applications requirecontent-based similarity search capability innumeric attribute space. Further, users’ no-tion of similarity varies between search ses-sions. Therefore online techniques for adap-tively refining the similarity metric based onrelevance feedback from the user are neces-sary. Existing methods use retrieved itemsmarked relevant by the user to refine the sim-ilarity metric, without taking into accountthe information about non-relevant (or un-satisfactory) items. Consequently items indatabase close to non-relevant ones continueto be retrieved in further iterations. In thispaper a robust technique is proposed to incor-porate non-relevant information to efficientlydiscover the feasible search region. A de-cision surface is determined to split the at-tribute space into relevant and non-relevantregions. The decision surface is composedof hyperplanes, each of which is normal tothe minimum distance vector from a non-relevant point to the convex hull of the rel-evant points. A similarity metric, estimatedusing the relevant objects is used to rank andretrieve database objects in the relevant re-gion. Experiments on simulated and bench-mark datasets demonstrate robustness and su-perior performance of the proposed techniqueover existing adaptive similarity search tech-niques.

Keywords : Information retrieval, Database brows-ing, Database navigation, Ellipsoid query processing,Relevance feedback, Non-relevant judgement

Permission to copy without fee all or part of this material isgranted provided that the copies are not made or distributed fordirect commercial advantage, the VLDB copyright notice andthe title of the publication and its date appear, and notice isgiven that copying is by permission of the Very Large Data BaseEndowment. To copy otherwise, or to republish, requires a feeand/or special permission from the Endowment.

Proceedings of the 28th VLDB Conference,Hong Kong, China, 2002

1 Introduction

In many modern database applications, it is neces-sary to be able to pose queries in terms of similarityof data objects rather than relational operations likeequality or inequality. Examples include finding setsof stocks that behave in approximately the same (orfor that matter opposite) way in a temporal database[16]; searching for structurally similar proteins from aspatial database which can cause a particular medi-cal condition [13]; retrieval of 3D objects from a CADdatabase [2], finding similar objects based on contentfrom multimedia databases containing audio, image orvideo [11]. Also, several approximation schemes havebeen developed to efficiently process similarity queriesusing multidimensional indexing structures [19, 1, 3].In this paper, we focus on accuracy improvement ofsimilarity based retrieval of database objects using rel-evance feedback.

To support similarity based modern database appli-cations, multidimensional attribute or feature vectorsare extracted from the original object, and stored inthe database. Given a query object (associated with itan attribute vector), database objects whose attributevectors are most similar to the query vector are re-trieved for the user. Usually k top matches are re-trieved. For text applications, vector space model withcosine similarity metric is being widely used [17],[4].The cosine similarity is defined as, S(~x, ~q) = ~x.~q

||~x||||~q|| ,where ~u.~v stands for inner-product of ~u and ~v, and||~v|| denotes the magnitude of ~v. On the other hand,in metric space model of information retrieval, second-order (L2) distance metrics are typically used. Thesecond-order L2 norm is a quadratic distance met-ric and is defined as D(~x, ~q, Q) = (~x− ~q)T Q(~x− ~q).Imposing constraints on the structure of the matrixQ, metrics like Euclidean (Q is the identity matrix),weighted Euclidean (diagonal Q) and generalized Eu-clidean (symmetric Q with off-diagonal entries) are ob-tained. Examples of application of quadratic distancemetrics include Euclidean metric in discrete wavelettransform (DWT) based attribute space for time-seriesstock data [16], color histogram space for color im-age databases [11], Fourier descriptors (FD) for shapedatabases [2]; weighted Euclidean metric for multime-

dia object retrieval [18], and generalized Euclidean dis-tance metric for spatial databases [10]. Applicabilityof L2 norms requires that the users’ information needbe modeled by a compact convex set in the featurespace.

The query by example paradigm is typically used forsimilarity based search and retrieval from databases.In many emerging database applications, the notion ofsimilarity cannot be predetermined, and needs to varyacross search sessions to satisfy information need of theuser (it is not likely that a user would be able/willingto supply the similarity metric). For example in anonline car shopping scenario, a buyer starting with aninitial query (e.g. Jeep Cherokee), may be interestedin “weekend getaway vehicle” (based on cargo capac-ity, wheelbase, and torque considerations), whereas an-other buyer starting with the same initial query mayinstead be looking for an inexpensive family vehicle(based on engine size, weight, and price considera-tions). In these scenarios the retrieval processes pro-ceeds as follows. The system presents a set of objectsfrom the database to the user based on an initial sim-ilarity metric. The user expresses his liking/dislikingof the retrieved set of objects. Based on this rele-vance feedback from the user, the similarity metric isadapted, and a new set of objects that are likely tobe more relevant to the user are retrieved. This pro-cess continues until the user is satisfied. Most of theexisting systems [18, 10] use only relevant objects torefine the similarity metric. Since information aboutthe non-relevant objects is not used, other databaseobjects close to the non-relevant objects are also typi-cally retrieved.

In this paper we propose a novel means of incorpo-rating non-relevant judgement for improving the per-formance of similarity based retrieval. Non-relevantinformation is not used for ranking similar items inour approach. We, instead use non-relevant informa-tion to define a feasible region in the feature space.Relevant objects are used to estimate the parametersof the similarity metric. Similar objects are rankedand retrieved from within the feasible region only.

The remainder of this paper is organized as fol-lows. Existing work that attempted to incorporatenon-relevant judgement in similarity retrieval is pre-sented in Section 1.1. A mathematical formulation ofour approach is presented in Section 2. The proposedsolution and retrieval algorithm is presented in Sec-tion 3. The performance of the proposed approachis experimentally demonstrated using simulated andbenchmark datasets in Section 4. Finally we summa-rize our contributions and mention some future direc-tions for research in Section 5.

1.1 Related Work

We now analyze approaches considered in the past toincorporate non relevant information. Relevance feed-

back was first introduced by Rocchio in the context ofincremental (iterative) text categorization using key-words [17]. Each document is associated with an at-tribute vector, each component (term) of which rep-resents the normalized frequency of occurrence of thecorresponding keyword. Rocchio presented the follow-ing formula to compute the new query,

~qnew = α~qcurr +β

|G|∑~xi∈G

~xi − γ

|B|∑~yi∈B

~yi (1)

where ~qcurr is the current query vector, G is the set ofrelevant objects and B the set of non relevant objects.Values for α, β and γ are empirically chosen for thedocument set. Documents in the database are rankedbased on their cosine distances from the estimatedquery ~qnew. The parameter values that give good over-all performance for a document set requires careful finetuning. Even in case of 2-D feature space, it is easyto construct cases where for a fixed set of parameters,a large number of non-relevant judgements move thequery vector toward the non-relevant region. Sing-hal et.al. [5] propose a dynamic query zoning schemefor learning queries in Rocchio’s framework by using arestricted set of non-relevant documents in place of theentire set of non-relevant documents. In a recent study[6], Dunlop concludes that with Rocchio’s formula us-ing non-relevance feedback, the results show behaviorsthat can hardly be justified and vary widely. Hencerecent approaches have chosen to ignore non-relevantdocuments [12]. Even though Rocchio’s formula is pri-marily designed for use in the context of vector spacemodels, it is straightforward to extend the formula formetric space models [10]. Problems with unpredictableperformance similar to the vector space model still per-sist.

Nastar et al. [15] use two non-parametric densityestimates to model the probability distribution for therelevant and non-relevant classes. The relevance scoreof a given database object is defined as,

RelScore(x) =Prob(~x ∈ relevant)

Prob(~x ∈ nonrelevant). (2)

The intuition behind this formula is that points in fea-ture space either having large probability of being rel-evant or small probability of being non-relevant shouldreceive high relevance scores. It is easy to see that eventhough the probabilities obtained through maximumlikelihood estimates are well behaved, the relevancescore being a ratio is not guaranteed to give consistentresults. For example consider a point in the featurespace with a small estimated probability of being rel-evant and with nearly zero estimated probability ofbeing non-relevant, this point will receive a very highrelevance score which may far exceed that of a pointwhose probability of being relevant is high (≈ 1) andprobability of being non-relevant is a small non zero

value.A similar heuristic has also been proposed by

Brunelli et al. [7] where they suggest the following dis-tance function,

D(~x,G,B) = D(~x,G)[D(~x,G)D(~x,B)

]γ

(3)

D(~x, S) =∑~y∈S

||~x− ~y||2. (4)

The database points that are either close to relevantobjects or far from non-relevant objects should havesmall net distance D. Since the distances D are ob-tained through summation over all relevant or non-relevant objects, the net distance is sensitive to thesize of relevant and non-relevant sets. Also the result-ing distance D function is not well behaved and canlead to unpredictable results.

Our key contribution in this paper is the use of non-relevant judgements to delineate the relevant region inthe feature space, ensuring that the restricted searchspace does not contain any non-relevant objects. Rele-vant judgements are used to estimate a similarity met-ric which then is used to rank and retrieve databaseobjects in the relevant region. In machine learning lit-erature, decision trees [8] are routinely used to achievea partitioning of the feature space. Inducing a deci-sion tree on the relevant and non-relevant objects mayresult in multiple disconnected relevant regions. Dis-connected relevant regions are clearly incompatible forranking using a single quadratic distance metric. Onecan envision a technique wherein a similarity metricis estimated independently for each relevant region.However it would be difficult to obtain robust esti-mates for the similarity metrics given the small numberof relevant judgements in each of the relevant regions.Decision trees are also known to perform poorly withsmall training sets.

Support Vector Machines are popularly used tosolve 2-class classification problems [9]. SVMs trans-form the original feature space using a kernel (usuallya Gaussian kernel) and estimate a optimal hyperplaneto separate the two classes in the new feature space.The distance from the hyperplane is used as a measureof belonging to a class. The local nature of the map-ping results in the ranking function (in the originalfeature space) having bumps at relevant objects andbeing relatively flat elsewhere, to attenuate this effecta large and representative training set is essential.

To overcome these issues, the partitioning of thefeature space in the proposed algorithm is achieved byusing a piecewise linear decision surface that separatesthe relevant and non-relevant objects. Each of the hy-perplanes constituting the decision surface is normalto the minimum distance vector from a non-relevantpoint to the convex hull of the relevant points. Our al-gorithm robustly estimates the hyperplanes that con-

Symbol DefinitionD The databasend Dimensionality of the databaseG Relevant set retrieved in the current

iteration~v Goodness scores assigned by user to

relevant objects. (1 if user only iden-tifies relevant objects).

B Non-relevant set retrieved in the cur-rent iteration

k Number of data objects retrieved periteration

~qcurr, (~qopt) Query center for the current itera-tion, optimal solution with currentset of feedback

Qcurr, (Qopt) Inverse covariance matrix for the cur-rent iteration, optimal solution withcurrent set of feedback

d(., ., .) Generalized Euclidean distance func-tion

Table 1: Symbol usage

stitute the decision surface even when the size of feed-back is small. The relevant region is obtained as theresult of intersection of half spaces and hence forms aconvex subset of the feature space. This ensures thatwe can use any similarity metric to rank and retrievedatabase objects inside the relevant region. Since theestimated relevant region is convex and the quadraticdistance metric is a convex function on the featurespace, we are ensured that there are no ’pockets’ inthe feature space where unpredictable relevance scoresare possible.

We experimentally demonstrate the effectiveness ofthe proposed algorithm to improve precision and re-call. In over 50% of the experiments there is a signif-icant performance improvement over using only rele-vant objects. In rest of the experiments improvementwas not visible because relevant sets were compactenough compared to the distribution of non-relevantpoints in their neighborhood. In small fraction of theexperiments, there is an inconsequential (around 0.05)performance degradation due to approximation of thefeasible region using piecewise linear surfaces.

2 Problem formulation

At each iteration of relevance feedback the judgementsprovided by the user constitute a relevant set G anda non-relevant set B. If the user provides different de-grees of desirability for the relevant objects, then thisinformation is available as a vector of goodness scores~v. If user only marks relevant objects then the good-ness scores for all relevant objects are set to 1. Allobjects seen by the user and marked neither relevantnor non-relevant are not considered for further com-

putation.Let ~x and ~q represent the feature vectors corre-

sponding to a database object and the estimated cen-ter of the query1. Then a quadratic distance functionto measure the distance of ~x from ~q is defined as

d(~x, ~q,Q) = (~x− ~q)T Q(~x− ~q) (5)

MindReader [10] estimates the parameters (~q,Q) tominimize the total distance of objects in the relevantset G. This can be written as,

min~q,Q

∑~xi∈G

vid(~xi, ~q,Q). (6)

Subject to:det(Q) = 1 (7)

Consider the ellipsoid E defined by the estimated pa-rameters (~qopt, Qopt) and radius equal to the largestvalue of dopt distance for relevant objects. k items nowneed to be presented to the user to obtain the next setof relevance judgements. The items to be presentedare obtained by expanding E till k database items areenclosed inside the ellipsoid. However, during this ex-pansion, the exclusion of non-relevant objects is notguaranteed.

To ensure that non-relevant objects are not re-trieved, we formulate a new optimization problemwith additional constraints as follows, (abbreviatingd(~x, ~q,Q) as d(~x))

min~q,Q,c

∑~xi∈G

vid(~xi) (8)

Subject to :

∀~x ∈ G, d(~x) ≤ c (9)∀~x ∈ B, d(~x) > c (10)

|{~x : ~x ∈ D, d(~x) ≤ c}| ≥ k (11)c > 0 (12)

det(Q) = 1 (13)

Let the optimal solution be (~qopt, Qopt) and the opti-mal distance function be dopt. Consider the ellipsoidE defined by (~qopt, Qopt) with a radius correspondingto the largest value of dopt for relevant objects (we canalso use copt as the radius for E). Equations 9 and 10ensure that E partitions the feature space with all rel-evant points inside and non-relevant points outside.Equation 11 ensures that E captures enough objects topresent to the user, and along with Equation 10 thisalso ensures that no non-relevant objects are shownto the user. The minimization of distances of relevantobjects ensures that relevant objects are ranked higherthan other items in the database.

Whereas the above formulation is sufficient to uti-

1The symbols used are defined in Table 1.

lize non-relevant information effectively, a straightfor-ward solution of Equation 8 is difficult to obtain. Theformulation is a quadratic optimization problem withquadratic constraints. Considering the quadratic na-ture of the constraints it is likely that the feasible re-gion for the parameters (~qopt, Qopt) is not convex, lead-ing to numerous local minima. Also constraint Equa-tion 11 involves items from the database. When thisconstraint is expanded, we get an additional constraintfor each database item. This makes the problem veryexpensive to solve in the current form.

To decrease the computation required, we simplifythe above formulation as follows. The problem is firstsplit into two independent subproblems,

Subproblem 1 Find a decision boundary that sepa-rates the relevant from non-relevant objects. Theboundary should be sufficiently close to the non-relevant objects to maximize the size of the rele-vant region.

Subproblem 2 Find a distance function that mini-mizes the total distance of relevant objects.

We approximate the decision boundary by a piecewiselinear surface. This reduces computation time and alsoallows the convexity constraint for the relevant regionto be easily incorporated. The second subproblem isthe same as MindReaders’ formulation and hence theirresults hold in this case.

It is easy to see that the convex hull (CH) of therelevant points is one of the many possible piecewiselinear surfaces that satisfy Equations 9 and 10. Tosatisfy Equation 11 we need to “expand” the convexhull so as to obtain k database items inside. Notethat during expansion we must also ensure that Equa-tions 9 and 10 hold for the new surface. This processis not efficient since items from the database need tobe accessed in each expansion step.

2.1 Proposed solution

Rather than using an incremental scheme to refinethe decision surface, we create a decision surface thatmaximizes the size of relevant region. For each non-relevant point (~b) we create a hyperplane (H) normalto the shortest distance vector from ~b to CH. H ispositioned a small distance (ε > 0) from ~b. Figure 1illustrates an example. The positive halfspace of eachsuch H contains the relevant objects and the negativehalfspace contains the non-relevant example. The rel-evant region R is obtained as the intersection of thepositive halfspaces obtained for all non-relevant ob-jects. A distance metric dnew estimated using onlythe relevant objects is then used to present top k ofthose database items that belong toR. Actual compu-tation of CH is not necessary, the point in CH closestto ~b is obtained by solving an optimization problem

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

ε > 0

relevantobjects

non−relevant object

ConvexHull

non−

rele

vant

reg

ion

seperating hyperplane

rele

vant

regi

onFigure 1: Illustrates feature space partitioning usingone non-relevant object, see Section 4.3.1 and Figure 8for an realistic example.

(Equation 14). Note that in some casesRmay not sat-isfy Equation 11, i.e, there are fewer than k databaseitems in R. Since this result is the best possible witha piecewise linear surface, a set of size less than k canbe presented to the user, in our experiments we obtainthe remaining objects by picking top ranked objectsfrom the non-relevant partition.

3 Proposed Algorithm

In this section we describe the proposed method. Asdescribed in Section 2.1, the proposed algorithm hastwo independent steps. The first step is that of ob-taining a surface that separates the relevant fromnon-relevant examples thereby partitioning the featurespace into relevant and non-relevant regions, this stepis detailed in Section 3.1. The second step involvesusing relevant examples to estimate a distance metric,this step is detailed in Section 3.2. The final step in-volves ranking and retrieval of items from the databaseto present to the user and is detailed in section 3.3.

3.1 Partitioning the feature space

The proposed method separates the relevant and non-relevant objects with a piecewise linear surface. Eachof the non-relevant objects is used to create a hyper-plane as described in Section 2 and illustrated in Fig-ure 1. For each non-relevant point ~bi ∈ B, the closestpoint ~pi in the convex hull CH of the relevant points iscomputed as follows. Let G = [~g1, . . . , ~g|G|], columnsof G represent the feature vectors of the relevant ob-jects in G. The vector ~pi can be written as a linearcombination of the relevant points as ~pi = G~λ, where~λ=[λ1, . . . , λ|G|]T and

∑j λj = 1. The computation of

~pi can hence be formulated as,

minλ

|G~λ−~bi|2 (14)

subject to

|G|∑j=1

λj = 1 (15)

∀j, λj ≥ 0 (16)

Equation 14 is a convex quadratic problem with linearconstraints. We use the reduced gradient method out-lined in Appendix A to obtain the optimal value of ~λ.The closest point on the hull ~pi can now be obtained.The corresponding hyperplane (Hi) can be representedas in Equation 17.

Hi = { ~x : (~pi −~bi) · (~x−~bi) = ε } (17)

where ~a ·~b = ~aT~b

|~a| |~b| (18)

H+i = { ~x : (~pi −~bi) · (~x−~bi) > ε } (19)

H−i = { ~x : (~pi −~bi) · (~x−~bi) < ε } (20)

Here, ε is a small positive constant. In cases where thenon-relevant point~bi lies inside CH, the closest point ~pi

will equal ~bi, no hyperplanes are constructed for suchcases. Each hyperplane Hi partitions the feature spaceinto a positive halfspace (H+

i , Equation 19) contain-ing the relevant objects and a negative halfspace (H−

i ,Equation 20) containing the non-relevant point~bi. Theintersection of the positive halfspaces H+

i defines therelevant region and the union of negative halfspacesH−

i defines the non-relevant region.

3.2 Estimating the similarity metric

The parameters of the distance metric in Equation 5are estimated using only the relevant objects (G) alongwith their associated goodness scores (~v). The param-eters (~q,Q) are estimated independent of the featurespace partitions. This permits the usage of any schemelike MindReader [10] or MARS [18] to estimate thenew set of parameters.

The estimates used by MindReader [10] are as fol-lows,

~qnew =

∑~xi∈G vi~xi∑~xi∈G vi

(21)

Qnew = det(C)1

nd C−1 (22)

C is the weighted covariance matrix of the relevantobjects, given by

C = [cjk], cjk =∑~xi∈G

vi(xij − qj)(xik − qk) (23)

MARS [18] is a special case of MindReader, where

Qnew = [Qjj ], Qjjα1

σj2

(24)

Input: G, B the set of relevant and non-relevantobjects.Output:The next set of k data objects.∀bi ∈ B {

Find ~pi by solving (14)

Hi defined as (~pi −~bi) · (~x−~bi) = ε, (Eqn. 17)}Compute (~qnew,Qnew) using Equations 21 and 22.∀~xi ∈ D {

if ( (~pi −~bi) · (~x−~bi) > ε , ∀bi ∈ B ){/* xi lies in the relevant region */Disti = d(~xi, ~qnew, Qnew), Equation. 5

} else {/* xi lies in the non-relevant region */Disti = ∞

}}Return top k objects in D in increasing order oftheir distances (Dist).

Figure 2: Algorithm to retrieve top k relevant databaseitems.

σj2 being the variance of the jth feature over all objects

in G.

3.3 Ranking and retrieval

Figure 2 illustrates one iteration of the retrieval pro-cess. Given a vector ~x representing an item in thedatabase, it is determined if ~x belongs to the relevantregion by checking if ~x lies in the positive halfspace foreach hyperplane Hi. Distances for items in the rele-vant region are computed using the estimated distancefunction dnew. k items having the smallest distancesare presented to the user for further feedback.

4 Experimental Setup

We tested our relevance feedback algorithm incor-porating non-relevant objects on synthetic and realdatasets. The experiments demonstrate that our al-gorithm effectively uses the non-relevant objects torestrict the search space. Many non-relevant objectsthat would be retrieved by conventional algorithms arerejected, improving the retrieval accuracy.

Four datasets were used for the experiments, onedataset was synthetically generated and the otherthree were real. In our experiments, feedback is pro-vided by labeling a retrieved object as either relevantor non-relevant. To enable simulation of real users,following assumptions are made. Each object in adatabase is associated with a class. Note that theseclass labels are used for the purpose of simulation only,in a real world setting each user has a distinct notionof the set of objects matching her requirements andhence assigning class labels is not useful. For eachsimulation run, an object class is chosen to be the tar-get class. An object from the target class is used to

start the relevance feedback loop. The user’s feedbackis simulated by labelling objects from the target classas relevant and labelling objects from other classes asnon-relevant. We also assume that the user providesfeedback on a fixed number of retrieved objects, theobjects to be labelled picked randomly from the re-trieved set.

As discussed in Section 1, changing the structureof the Q matrix of the quadratic distance function,we obtain different distance metrics. In practice thereis a tradeoff between increased flexibility and the ro-bustness with which parameters can be estimated. Wechose MARS over the MindReader metric as the num-ber of parameters to be estimated is an order smallerand hence can be more robustly estimated when thereare very few (< nd) relevant objects. To demonstratethat it is still feasible to use MindReader for smalldimensions, we use the generalized ellipsoid distancemetric for experiments with the synthetic 2D dataset.

4.1 Datasets

A brief description of the datasets used for our exper-iments follows,

• Synthetic 2D dataset To make analysisand visualization easier, we synthesized a two-dimensional dataset. This dataset is plotted inFigure 3. There are 400 relevant points and 2500non-relevant points.

• Digits dataset The public domain PEN dataset2for pen-based recognition of handwritten digitshas 10 classes, each class represents a digit. Digitsare represented by 16 dimensional attribute vec-tors. We use 100 objects from each class to createour dataset.

• Letter recognition The Letter RecognitionDataset 2 has 26 classes corresponding to the let-ters of the English alphabet. Letters are repre-sented by 16 attributes, we choose 100 objectsfrom each class to constitute the dataset.

• CAR dataset This dataset consists of auto-mobile specifications extracted from online carstores. A car is represented by 24 numerical at-tributes. Price, torque, weight are examples ofthe attributes present. We do not use nominal at-tributes such as transmission (takes values man-ual or automatic). We use the vehicle type (ex.Midsize Sedan, Fullsize SUV, Sport Hatchback) asthe class label for a car. We choose 21 vehicletypes, each having atleast 25 distinct cars to con-stitute a dataset of 1270 cars.

2UCI Machine Learning Repository,http://www.ics.uci.edu/˜mlearn/MLRepository.html

0 0.2 0.4 0.6 0.8

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Relevant regionCenter Relevant Nonrelevant

Figure 3: Synthetic 2D dataset

1 2 3 4 5 6 7 80.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Relevance feedback iterations

Pre

cisi

on a

t 0.8

rec

all

Proposed Mindreader

Figure 4: Precision at 0.8 recallin successive iterations of rele-vance feedback.

1 2 3 4 5 6 7 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45


dist

ance

Proposed Mindreader

Figure 5: Euclidean distancebetween centers of estimatedand target similarity metricsat successive iterations of rele-vance feedback.

1 2 3 4 5 6 7 80.3

0.4

0.5

0.6

0.7

0.8

0.9

1


dot p

rodu

ct

Proposed Mindreader

Figure 6: Dot product of ma-jor axis of target and esti-mated similarity metrics at suc-cessive iterations of relevancefeedback.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

−0.5

0

0.5

1

1

1

3

3

5

5

7

7

Mindreader Algorthm

desiredcentersiter# 1iter# 3iter# 5iter# 7

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

−0.5

0

0.5

1

1

1

3

3

5

5

7

7

Proposed Method

desiredcentersiter# 1iter# 3iter# 5iter# 7

Figure 7: Ellipses represent retrievedregions necessary to achieve 80% re-call at different iterations of relevancefeedback. For the proposed algorithmthe actual retrieved region is obtainedby intersecting the ellipse with half-spaces, see Section 4.3.1 and Figure 8.

4.2 Measuring retrieval effectiveness

Precision and recall are standard metrics used for mea-suring retrieval effectiveness. Recall and precision aredefined as

recall =|relevant ∩ retrieved|

|relevant| (25)

precision =|relevant ∩ retrieved|

|retrieved| (26)

4.3 Results and Discussion

4.3.1 Results for Synthetic 2D dataset

In this section we compare performance of the pro-posed retrieval algorithm using generalized ellipsoiddistance metric with MindReader for the Synthetic2D dataset. In Figure 4, the precision obtained at0.8 recall in successive iterations is compared. Us-ing non-relevant objects to restrict the search spacesignificantly improves precision. Figures 7 provide avisual representation of the distance metrics for thetwo experiments. An ellipse represents the region tobe retrieved to achieve 0.8 recall. The target ellipserepresents all objects of the target class.

In the case of the proposed algorithm, the actualretrieved region is obtained by intersecting the ellipsewith the estimated relevant region. A set of hyper-

planes each corresponding to a non-relevant point isdetermined as detailed in Section 3. The relevant re-gion is then obtained as the intersection of the positivehalf spaces of each of the hyperplanes (Figure 8). Theparameters of the MindReader metric estimated us-ing the relevant objects is used to rank objects in therelevant region. Referring to Figure 8, each separat-ing plane corresponds to a non-relevant object. Theunshaded area represents the relevant region obtainedas the intersection of the positive half-spaces. Theshaded area forms the non-relevant region. The solidellipse represents the distance metric to achieve 0.8 re-call and corresponds to the ellipses drawn in Figure 8.The intersection of the interior of the solid ellipse andthe unshaded area represents the retrieved region toachieve the specified recall.

We now compare the convergence of the estimateddistance metrics to the target metric. We chose twoparameters to evaluate convergence.

1. Query point movement: This parameter captureshow quickly the centers of the estimated distancemetric converge to the target center. Figure 5plots the Euclidean distance between the esti-mated centers and the target center for the twoalgorithms. The faster convergence of the pro-posed algorithm is evident.

2. Alignment: This parameter is used to compare

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Relevance feedback iter# 1

target 0.8 recall separating planerelevant non−relevant

(a)0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9



(b)0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9



(c)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9



(d)0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9



(e)0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9



(f)

Figure 8: Restricted search space and similairty metric at different iterations of relevance feedback for 2d dataset.

the speed of convergence of the matrix parameterQ (of the distance metric) to the target Q. Thedot product of the major axis of the estimatedand the major axis of the target ellipse indicatesthe degree of alignment of the two metrics. Thedot products for the two algorithms are comparedin Figure 6. Here again the proposed algorithmshows faster convergence. Note that even thoughthe difference in dot products is quite small thelarger difference in angle is manifested in Figure 7.

4.3.2 Real Dataset

We now demonstrate the improvement in performanceof the proposed algorithm using the weighted Eu-clidean distance metric over the MARS algorithm onthree real datasets.

4.3.3 Improvement in average precision

Figures 9(a),9(b),9(c) plot the average precision forthe three datasets. The size of relevance feedback periteration is 15, the feedback provided is accumulatedover successive iterations. Precision is computed from100 top ranked objects. The precision statistics for adataset are computed using trials conducted with eachclass in the dataset as the target class. For a partic-ular target class, trials are repeated using each objectin the target class as the starting point for the rele-vance feedback loop. For the same set of experiments,Figures 9(d),9(e),9(f) plot the average precision after6 relevance feedback iterations with increasing number

of retrieved objects. This is equivalent to computingprecision with increasing recall values. The proposedalgorithm shows a consistent improvement in precisionfor all datasets, both with relevance feedback itera-tions and with large number of retrieved objects.

In Figure 10, we plot the average size of the rele-vant partition computed by the proposed algorithm.In the Digits dataset with 10 classes, each class consti-tutes 10% of the dataset, whereas in the Letter dataseteach class constitutes ≈ 4% of the whole dataset. Theaverage size of a class in the CAR dataset is ≈ 5%.These variations in the size of the target class as frac-tion of the dataset lead to differences in the size ofthe estimated relevant regions in the above plot. Theaccuracy of the estimated relevant region is shown inFigure 11. The fraction of objects from the targetclass that lie in the estimated relevant region servesas a measure of accuracy of the partioning scheme.It is clear from the above plots that the proposed al-gorithm effectively utilizes non-relevant objects to ac-curately constrain the search space. The accuracy ofthe estimated relevant region improves with relevancefeedback iterations demonstrating that the proposedalgorithm is able to refine the relevant region to matchthe true distribution. The relatively poor accuracy inthe case of the Letter dataset is due to the the mul-timodal distribution of the objects in a class, i.e. theobjects in a class do not constitute a convex set.

1 2 3 4 5 6−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

cisi

on

Iteration of relevance feedback

Precision (top 100) with 15 relevance feedback/iteration

MARS, Mean Proposed Algorithm, Mean Mean +/− standard deviation

(a) Letter dataset, 2600 trials

1 2 3 4 5 6−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

Pre

cisi

on




(b) Digits dataset, 1000 trials

1 2 3 4 5 6−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

cisi

on




(c) CAR dataset, 1270 trials

25 50 75 100 125 150−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

Pre

cisi

on

Number of objects retrieved to measure precision

Relevance feedback iteration# 6 with 15 feedback/iteration


(d) Letter dataset, 2600 trials

25 50 75 100 125 150−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

Pre

cisi

on




(e) Digits dataset, 1000 trials

25 50 75 100 125 150−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

Pre

cisi

on




(f) CAR dataset, 1270 trials

Figure 9: Precision for MARS and the proposed algorithm (Section 4.3.3). (a)-(c) plots precision in top 100retrieved objects at successive relevance feedback iterations, (d)-(f) compares precision measured with increasingnumber of retrieved objects at the 6th relevance feedback iteration.

4.3.4 Improvement in precision, (0.4 recall, 50feedback per iteration)

Here, the parameter of interest is the improvement Iin precision achieved by the proposed algorithm overMARS after the same number of relevance feedbackiterations with the same starting point.

I = PrecisionProposed − PrecisionMARS . (27)

The range of I over all experiments is split into a fixednumber of bins. Each experiment is assigned to a binbased on the value of I for the experiment.

In order to accurately quantify the improvement,we plot in Figures 13(a),13(b),13(c) the improvementin precision I against the precision achieved by MARS(PrecisionMARS) for the same test. In Figure 13(a)we see that the percentage of cases where I > 0.3reduces with increase in PrecisionMARS . This isexpected since the scope for improvement decreasesat higher precision values, such cases are shifted tosmaller bins resulting in an increase in the % of testsshowing smaller improvements (0−0.3) at larger valuesof precision.

4.3.5 Improvement in precision at differentrecall values across multiple tests

These experiments are similar to those of Section 4.3.4except that the experiments are repeated for differ-ent values for recall. The improvement in precision

I (Equation 27) is binned to create a histogram foreach value of recall. Consider Figure 14(c), at largervalues of recall there is an increase in percentage ofcases where the proposed algorithm gives lower preci-sion (I < 0). Suppose that the objects in the targetclass are not contained in a convex set, i.e. objectsfrom other classes overlap with the target class. Apartition estimated by the proposed algorithm in thiscase will incorrectly prune some relevant objects caus-ing lower precision than MARS (which is not preventedfrom retrieving those examples at large recall). Thepercentage of cases with I = 0 (no improvement) islarger at small recall values (Figure 14(c),14(b)). Inmost of these cases MARS achieves a precision of 1and hence no further improvement is possible.

4.3.6 Improvement in precision with differentsizes of feedback (fixed recall)

In real systems users typically provide only a smallnumber of relevance judgements at each iteration.Hence a retrieval algorithm must perform consistentlywith very small sizes of feedback. We performed exper-iments with number of relevance judgements varyingfrom 5% to 50% of the size of target class. Note thatboth relevant and non-relevant feedback count as rele-vance judgements. Plots in Figure 15 show percentageof cases with I > 0 and percentage cases with I < 0.

The improvement achieved by the proposed algo-rithm at all feedback sizes is clear. Larger feedback

1 2 3 4 5 60.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


Siz

e of

rel

evan

t par

titio

n as

frac

tion

of d

atas

et s

ize

CAR dataset, average over 1270 trials Letter dataset, average over 2600 trialsDigits dataset, average over 1000 trials

Figure 10: Plots the average size ofrelevant region for the proposed al-gorithm with 15 feedback providedper iteration (Section 4.3.3). Thefraction of database objects in therelevant partition is used as an es-timate for the size of the relevantregion.

1 2 3 4 5 60.65

0.7

0.75

0.8

0.85

0.9

0.95

1


Fra

ctio

n of

targ

et c

lass

obj

ects

in r

elev

ant p

artit

ion

CAR dataset, average over 1270 trials Letter dataset, average over 2600 trialsDigits dataset, average over 1000 trials

Figure 11: Plots the average ac-curacy of the relevant region esti-mated by the proposed algorithm(Section 4.3.3). The fraction of ob-jects from the target class that lie inthe estimated relevant region deter-mines the accuracy.

1 2 3 4 5 6

0.02

0.04

0.06

0.08

0.1

0.12


Tim

e in

sec

onds

Letter dataset, 10 feedback/iteration

MARS training Proposed Algorithm trainingMARS ranking Proposed Algorithm ranking

Figure 12: Comparison of averagerunning times for MARS and theproposed algorithm (Section 4.3.7).Training time represents the execu-tion time for parameter estimation.Ranking time is the time required tocompute and sort similarity scoresfor all database objects.

0.02 0.21 0.41 0.61 0.8

−20

−10

0

10

20

30

40

50

60

70

80

Precision for MARS

% o

f tes

ts

−1.00 to −0.30−0.30 to −0.000 difference 0.00 to 0.30 0.30 to 1.00

(a) Letter dataset

0.04 0.23 0.43 0.62 0.810

10

20

30

40

50

60

70

80

90

100

Precision for MARS

% o

f tes

ts


(b) Digits dataset

0.01 0.21 0.4 0.6 0.8−20

−10

0

10

20

30

40

50

60

Precision for MARS

% o

f tes

ts


(c) CAR dataset

Figure 13: Plots illustrate the histogram of I for different values of the precision of MARS. Refer to Section 4.3.4.

sizes lead to more cases with I > 0. This demonstratesthat the proposed algorithm has successfully utilizedadditional non-relevant information to accurately re-fine the search space. The percentage of cases withI > 0 increases with iterations of relevance feedbackdemonstrating the effectiveness proposed algorithm.

4.3.7 Execution time of the proposed algo-rithm

Figure 12 plots the time required for parameter esti-mation and for ranking and retrieval using the Letterdataset with 10 feedback per iteration on a 1000Mhz,Intel PIII processor. Since feedback is accumulatedover iterations, the size of the training set increaseswith relevance feedback iterations. The MARS algo-rithm has the least processing requirement and showsno visible performance degradation for both parameterestimation and ranking with larger training sets. Thetraining time for the proposed algorithm grows rapidlywith the number of relevant objects in the training set.This is due to the complexity of the constrained opti-mization procedure. Since an object in the database isevaluated against hyperplanes corresponding to eachof the non-relevant objects, the retrieval time growslinearly with the number of non-relevant objects in

the training set.

5 Conclusion

We have proposed a novel technique for improving theaccuracy of adaptable similarity based retrieval by in-corporating negative relevance judgement, and demon-strated excellent performance and robustness of theproposed scheme with a large number of experiments.The experiments also demonstrate that the proposedscheme improves performance when the size of feed-back is small. Ad-hoc techniques have been proposedand studied in the past for using both relevance andnon-relevant judgements during similarity based re-trieval. Past studies have reported that such methodsfrequently lead to unfavorable performance, becauseincompatible information conveyed by the relevanceand non-relevance judgements are combined to derivethe ranking function. Instead we have proposed a two-step approach, where non-relevant objects in conjunc-tion with relevant objects have been used to define thefeasible search space. The ranking function, estimatedusing only the relevant objects was used to retrievetop k matches from inside the feasible search region.This enables the search to explicitly move away fromthe non-relevant region, while keeping close to the rel-

0.11 0.31 0.41 0.71 0.91

−20

0

20

40

60

80

100

Relevance feedback iter# 5

Recall

% te

sts

<0 difference0 difference 0.01−0.31 0.31−0.61

(a) Letter dataset (iteration# 5)

0.11 0.31 0.41 0.71 0.91

0

20

40

60

80

100


Recall

% te

sts


(b) Digits dataset (iteration# 5)

0.11 0.31 0.4 0.7 0.9

−20

0

20

40

60

80

100


Recall

% te

sts


(c) CAR dataset (iteration# 5)

Figure 14: Illustrates the histogram of I for different recall values. In all the above plots, each bar representsthe % of tests where the difference in precision (I) falls in a specified range. Bars corresponding to negativedifferences shown inverted for clarity.

5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

70

% te

sts

Number of relevance judgements per iteration

Relevance feedback iteration# 1

>0 (improvement) <0 (deficient)

(a) Letter dataset (iteration# 1)

5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

70

80

% te

sts




(b) Digits dataset (iteration# 1)

5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

% te

sts




(c) CAR dataset (iteration# 1)

5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

70

80

90

% te

sts




(d) Letter dataset (iteration# 3)

5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

70

80

% te

sts




(e) Digits dataset (iteration# 3)

5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

% te

sts




(f) CAR dataset (iteration# 4)

Figure 15: Effect of size of feedback on precision. Plots % cases where the difference PrecisionProposed −PrecisionMARS is postive (improvement), negative (deficient) for different sizes of feedback. Precision computedat 0.4 recall.

evant region. Note that our proposed method does notdepend on database-specific parameter tuning. More-over, it is usable on top of existing schemes, e.g., Min-dReader and MARS.

Implementation of the proposed ellipsoid query pro-cessing with search space pruning on multidimensionalindexing structures is of further interest to us for im-proving the processing speed. Our ongoing work in-cludes dimensionality reduction to address inadequatenumber of relevance judgements in high-dimensionalfeature space during a typical search session, and queryexpansion with non-relevant information so that multi-modal (multiple disjoint ellipsoids) information need ofa user can be supported.

References

[1] M. Ankerst, B. Braunmuller, H.P. Kriegel, andT. Seidl. Improving adaptable similarity query

processing by using approximations. In Proc. ofVLDB, New York, NY, 1998.

[2] S. Berchtold, and H.-P. Kriegel. S3: Similaritysearch in CAD database systems. In Proc. of SIG-MOD, pages 564–567, Phoenix, AZ, 1997.

[3] C. Bohm, B. Braunmuller, F. Krebs, and H.-P. Kriegel. Epsilon grid order: An algorithm for thesimilarity join on massive high-dimensional data.In Proc. of SIGMOD, pages 379–388, Santa Bar-bara, CA, 2001.

[4] C. Buckley, and G. Salton. Optimization of rele-vance feedback weights. In Proc. of SIGIR, pages351–357, Seattle, WA, 1995.

[5] A. Singhal, M. Mitra, and C. Buckley. Learningrouting queries in a query zone. In Proc. of SIGIR,Philadelphia, PA, 1997.

[6] M.D. Dunlop. The effect of accessing non-matching documents on relevance feedback. InACM Trans. on Information Systems, 1997.

[7] R. Brunelli, and O. Mich. Image retrieval by ex-amples. In IEEE Trans. on Multimedia, volume 2,number 3, pages 164–171, 2000.

[8] J. Quinlan. C4.5: Programs for Machine Learning,San Mateo, CA: Morgan Kaufmann, 1993.

[9] C. J. C. Burges. A tutorial on support vector ma-chines for pattern recognition. Data Mining andKnowledge Discovery, volume 2, pages 121–167,1998.

[10] Y. Ishikawa, R. Subramanya, and C. Faloustos.MindReader: Query databases through multipleexamples. In Proc. of VLDB, 1998.

[11] M. Flickner, H. Sawhney, W. Niblack, and J. Ash-ley et al. Query by image and video content: TheQBIC system. IEEE Computer, pages 23–32, vol-ume 28, number 9, 1994.

[12] M. Iwayama. Relevance feedback with a smallnumber of relevance judgements: Incremental rel-evance feedback vs. document Clustering. InProc. of SIGIR, pages 10–16, Athens, Greece, 2000.

[13] H.-P. Kriegel, and T. Seidl. Approximation-basedsimilarity search for 3-D surface segments. GeoIn-formatica Journal. Kluwer Academic Publishers,1998.

[14] D. G. Luenberger. Introduction to linear and non-linear programming. Addison-Wesley PublishingCompany, Stanford University, 1973.

[15] C. Meilhac, and C. Nastar. Relevance feedbackand category search in image databases. In Proc. ofIEEE Conf. Multimedia Computing Systems, Flo-rence, Italy, 1999.

[16] D. Rafiei, and A. Mendelzon. Similarity basedqueries for time-series data. In Proc. of SIGMOD,pages 13–25, Phoenix, AZ, 1997.

[17] J. Rocchio. Relevance feedback in informationretrieval. The SMART retrieval system – exper-iments in Automatic Document Processing, pages313–323, 1971.

[18] Y. Rui, T. Huang, and S. Mehrotra. Content-based image retrieval with relevance feedback inMARS. In Proc. of ICIP, 1997.

[19] Y. Sakurai, M. Yoshikawa, R. Kataoka, and S. Ue-mura. Similarity search for adaptive ellipsoidqueries using spatial transformation. In Proc. ofVLDB, Rome, Italy, 2001.

A Reduced gradient algorithm

Consider the minimization problem:

minλ

(G~λ−~b)T (G~λ−~b) (28)

Subject to : ~e T ~λ = 1 and λi ≥ 0, ∀i (29)

where ~e = [1 . . . 1]T . The problem can be rewritten as,

minλ

12~λT D~λ−HT~λ (30)

Subject to : ~e T ~λ = 1 and λi ≥ 0, ∀i (31)

where D = 2GT G and H = 2~b T G.Let n = |G|. Specifying any n − 1 λi’s uniquely

determines the value of the nth λ (using Equation 31).Hence we split ~λ into (λy, ~λz), where ~λz is an indepen-dent (n− 1) sized vector and λy is a scalar dependenton λz. λy is chosen to be one of the strictly positivecomponents of ~λ. See [14] for a detailed description ofthe algorithm. The problem now reduces to :

minλy,~λz

12~λT D~λ−HT~λ (32)

Subject to : λy + ~e T ~λz = 1 , λy ≥ 0, ~λz ≥ 0 (33)

where ~e = [11 . . . 1]T . We now use a modified steep-est descent method using the reduced gradient. Thereduced gradient at a point λ = (λy, ~λz) is obtained as

r = ∇~λzf(λy, ~λz)−∇λy

f(λy, ~λz)B−1C (34)

The centroid of the relevant points whose correspond-ing ~λ is given by ~λinitial = [ 1

n1n . . . 1

n ], can be used asa feasible starting point for the iterative process.One iteration of the reduced gradient method is as fol-lows :

1. Compute r(~λ) using Equation 34

2. Let ∆λzi=

{ −rziif rzi

< 0 or λzi> 0

0 otherwiseIf ∆~λz = 0, then return current ~λ as the solution,else find ∆λy = −B−1C∆~λz.

3. Find α1, α2, α3 so that,

α1 = max{α : λy + α∆λy ≥ 0}α2 = max{α : ~λz + α∆~λz ≥ 0}α3 = min{α1, α2, α

′}

where α′ = −∆~λT (D~λ−H)

∆~λT D∆~λ.

Set ~λ = ~λ + α3∆~λ.

4. If α3 < α1 then goto step 2 else incorporate λy

into ~λz and mark one of the strictly positive λz’sas λy.

Date post:	16-Jan-2022
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times