Scalable and Distributed Similarity Searchgtsat/collection/p2p proximity... · metric space...

MASARYK UNIVERSITY

FACULTY OF INFORMATICS

} w�� Æ�� !"#$%&'()+,-./012345<yA|Scalable and Distributed

Similarity Search

PH.D. THESIS

Michal Batko

Brno, May 2006

Acknowledgement

I would like to thank my supervisor Pavel Zezula for guidance, in-sight and patience during this research.

iii

Abstract

This Ph.D. thesis concerns the problem of distributed indexing tech-niques for similarity search in metric spaces. Solutions for efficientevaluation of similarity queries, such as range or nearest neighborqueries, existed only for centralized systems. However, the amountof data produced in digital form grows exponentially every year andthe traditional paradigm of one huge database system holding all thedata seems to be insufficient. The distributed computing paradigm– especially the peer-to-peer data networks and GRID infrastructure– is a promising solution to the problem, since it allows to employvirtually unlimited pool of computational and storage resources.

Nevertheless, the centralized indexing similarity searching struc-tures cannot be directly used in the distributed environment andsome adjustments and design modifications are needed. In this the-sis, we describe a distributed metric space based index structure,which was, as far as we know, the very first distributed solution inthis area. It adopts the peer-to-peer data network paradigm and im-plements the basic two similarity queries – the range query and thek-nearest neighbors query. The technique is fully scalable and cangrow easily over practically unlimited number of computers. It isalso strictly decentralized, there is no “global” centralized compo-nent, thus the emergence of hot-spots is minimized. The propertiesof the structure are verified experimentally and we also provide acomprehensive comparison of this method with another three dis-tributed metric space indexing techniques that were proposed so far.

Supervisor: prof. Ing. Pavel Zezula, CSc.

v

Keywords

index structures

distributed computing

scalability

peer-to-peer networks

metric space

similarity search

range query

nearest neighbor query

vii

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 The Similarity Search Problem . . . . . . . . . . . . . . . . . 9

2.1 The Metric Space . . . . . . . . . . . . . . . . . . . . . . 112.2 Distance Measures . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Minkowski Distances . . . . . . . . . . . . . . . 142.2.2 Quadratic Form Distance . . . . . . . . . . . . . 152.2.3 Edit Distance . . . . . . . . . . . . . . . . . . . . 162.2.4 Tree Edit Distance . . . . . . . . . . . . . . . . . 172.2.5 Jacard’s Coefficient . . . . . . . . . . . . . . . . . 182.2.6 Hausdorff Distance . . . . . . . . . . . . . . . . . 192.2.7 Time Complexity . . . . . . . . . . . . . . . . . . 19

2.3 Similarity Queries . . . . . . . . . . . . . . . . . . . . . . 202.3.1 Range Query . . . . . . . . . . . . . . . . . . . . 202.3.2 Nearest-Neighbor Query . . . . . . . . . . . . . 212.3.3 Reverse Nearest-Neighbor Query . . . . . . . . 222.3.4 Similarity Join . . . . . . . . . . . . . . . . . . . . 232.3.5 Combinations of Queries . . . . . . . . . . . . . 242.3.6 Complex Similarity Queries . . . . . . . . . . . . 24

3 The Scalability Challenge . . . . . . . . . . . . . . . . . . . . 273.1 Basic Partitioning Principles . . . . . . . . . . . . . . . . 28

3.1.1 Ball Partitioning . . . . . . . . . . . . . . . . . . . 293.1.2 Generalized Hyperplane Partitioning . . . . . . 293.1.3 Excluded Middle Partitioning . . . . . . . . . . . 30

3.2 Avoiding Distance Computations . . . . . . . . . . . . . 313.2.1 Object-Pivot Distance Constraint . . . . . . . . . 313.2.2 Range-Pivot Distance Constraint . . . . . . . . . 333.2.3 Pivot-Pivot Distance Constraint . . . . . . . . . . 363.2.4 Double-Pivot Distance Constraint . . . . . . . . 373.2.5 Pivot Filtering . . . . . . . . . . . . . . . . . . . . 39

3.3 Dynamic Index Structures . . . . . . . . . . . . . . . . . 41

1

3.3.1 M-tree . . . . . . . . . . . . . . . . . . . . . . . . 413.3.2 D-index . . . . . . . . . . . . . . . . . . . . . . . 463.3.3 Scalability Experiments . . . . . . . . . . . . . . 50

3.4 Research Objective . . . . . . . . . . . . . . . . . . . . . 524 Distributed Index Structures . . . . . . . . . . . . . . . . . . 55

4.1 Scalable and Distributed Data Structures . . . . . . . . . 564.1.1 Distributed Linear Hashing . . . . . . . . . . . . 574.1.2 Distributed Random Tree . . . . . . . . . . . . . 584.1.3 Distributed Dynamic Hashing . . . . . . . . . . 604.1.4 Scalable Distributed Order Preserving Data Struc-

tures . . . . . . . . . . . . . . . . . . . . . . . . . 604.2 Unstructured Peer-to-Peer Networks . . . . . . . . . . . 61

4.2.1 Napster . . . . . . . . . . . . . . . . . . . . . . . 624.2.2 Gnutella . . . . . . . . . . . . . . . . . . . . . . . 62

4.3 Distributed Hash Table Peer-to-Peer Networks . . . . . 634.3.1 Content-Addressable Network . . . . . . . . . . 644.3.2 Chord . . . . . . . . . . . . . . . . . . . . . . . . 65

4.4 Tree-based Peer-to-Peer Networks . . . . . . . . . . . . 664.4.1 P-Grid . . . . . . . . . . . . . . . . . . . . . . . . 664.4.2 P-Tree . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5 Multi-dimensional Range Queries . . . . . . . . . . . . 684.5.1 Space-filling Curves with Range Partitioning . . 684.5.2 Multidimensional Rectangulation with kd-trees 69

4.6 Nearest Neighbors Queries . . . . . . . . . . . . . . . . 694.6.1 pSearch . . . . . . . . . . . . . . . . . . . . . . . 704.6.2 Small-World Access Methods . . . . . . . . . . . 70

5 GHT* – Native Peer-to-Peer Similarity Search Structure . . 735.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 745.2 Address Search Tree . . . . . . . . . . . . . . . . . . . . 755.3 Storage Management . . . . . . . . . . . . . . . . . . . . 76

5.3.1 Bucket Splitting . . . . . . . . . . . . . . . . . . . 775.3.2 Choosing Pivots . . . . . . . . . . . . . . . . . . . 78

5.4 Insertion of Objects . . . . . . . . . . . . . . . . . . . . . 785.5 Range Search . . . . . . . . . . . . . . . . . . . . . . . . . 795.6 Nearest-Neighbors Search . . . . . . . . . . . . . . . . . 805.7 Deletions and Updates of Objects . . . . . . . . . . . . . 825.8 Image Adjustment . . . . . . . . . . . . . . . . . . . . . 845.9 Logarithmic Replication Strategy . . . . . . . . . . . . . 85

2

5.10 Joining the Peer-to-Peer Network . . . . . . . . . . . . . 865.11 Leaving the Peer-to-Peer Network . . . . . . . . . . . . 87

6 GHT* Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 896.1 Datasets and Computing Infrastructure . . . . . . . . . 896.2 Performance of Storing Data . . . . . . . . . . . . . . . . 906.3 Performance of Similarity Queries . . . . . . . . . . . . 93

6.3.1 Global Costs . . . . . . . . . . . . . . . . . . . . . 936.3.2 Parallel Costs . . . . . . . . . . . . . . . . . . . . 976.3.3 Comparison of Range and Nearest-Neighbors

Search Algorithms . . . . . . . . . . . . . . . . . 1026.4 Data Volume Scalability . . . . . . . . . . . . . . . . . . 104

7 Scalability Comparison of Distributed Similarity Search . 1077.1 MCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.2 M-Chord . . . . . . . . . . . . . . . . . . . . . . . . . . . 1107.3 VPT* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127.4 Comparison Background . . . . . . . . . . . . . . . . . . 1147.5 Scalability with Respect to the Size of the Query . . . . 1157.6 Scalability with Respect to the Size of Datasets . . . . . 1227.7 Number of Simultaneous Queries . . . . . . . . . . . . . 1247.8 Comparison Summary . . . . . . . . . . . . . . . . . . . 125

8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 1308.2 Research Directions . . . . . . . . . . . . . . . . . . . . . 130

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

3

Chapter 1

Introduction

The search problem is constrained in general by the type of datastored in the underlying database, the method of comparing individ-ual data instances, and the specification of the query by which usersexpress their information needs. The traditional approach, typicalfor common database systems, applies the search operation to struc-tured (attribute-type) data. Then, when a query is given, the recordsexactly matching the query are returned. Complex data types – suchas images, videos, time series, text documents, DNA sequences, etc.– are becoming increasingly important in modern data processingapplications. A common type of searching in such applications isbased on a gradual rather than the exact relevance, so it is calledthe similarity or the content-based retrieval. The notion, what doesthe similarity mean in particular, may vary greatly between differentdata domains, applications or even from one user to another.

Treating data collections as metric objects brings a great advan-tage in generality, because many data classes and information seek-ing strategies conform to the metric view. Accordingly, a single met-ric indexing technique can be applied to many specific search prob-lems quite different in nature. In this way, the important extensibilityproperty of indexing structures is automatically satisfied. An index-ing scheme that allows various forms of queries, or which can bemodified to provide additional functionality, is of more value thanan indexing scheme otherwise equivalent in power or even better incertain respects, but which cannot be extended.

Because of the solid mathematical foundations underlying thenotion of metric space, straightforward but precise partitioning andpruning rules can be constructed. This is very important for develop-ing index structures, especially in cases where query execution costs

5

1. INTRODUCTION

are not only I/O-bound but also CPU-bound. Many metric indexstructures have been proposed and their results demonstrate signifi-cant speed-up (both in terms of distance computations and disk-pagereads) in comparison with the sequential search. Unfortunately, thesearch costs are also linearly increasing with the size of the data-set. This means that when the data file grows, sooner or later theresponse time becomes intolerable.

On the other hand, it is estimated that 93% of data now pro-duced is in a digital form. The amount of data added each yearexceeds exabyte (i.e. 1018 bytes) and it is estimated to grow expo-nentially. In order to manage similarity search in multimedia datatypes such as plain text, music, images, and video, this trend calls forputting equally scalable infrastructures in motion. In this respect, theGRID infrastructures and the Peer-to-Peer communication paradigmare quickly gaining in popularity due to their scalability and self-organizing nature, forming bases for building large-scale similaritysearch indexes at low costs.

Most of the numerous peer-to-peer search techniques proposedin the recent years have focused on the single-key retrieval. Since theretrieved records either exist (and then they are retrieved) or theydo not, there are no problems with the query relevance or degreeof matching. Although techniques for range or nearest neighborqueries were proposed very recently, they are usually limited to aspecialized domain of objects or efficient only for a specific appli-cation and they lack the extensibility of the metric space approach.Thus, the objective of our work is to combine the extensibility of themetric space similarity search with the power and high scalability ofthe peer-to-peer data networks employing their virtually unlimitedstorage and computational resources.

This thesis is organized as follows. Chapter 2 embraces a back-ground and definitions of the similarity search problem in metricspaces. In Chapter 3 we discuss the problem of scalability, i.e. theability to maintain a reasonable search response while the storeddataset grows. Chapter 4 provides a survey of existing distributedindex-searching structures. We focus mainly on the scalable anddistributed index structures paradigm, which consequently evolvedinto the peer-to-peer data networks. In the next stage, we proposea novel peer-to-peer based technique for similarity search in met-

6

1. INTRODUCTION

ric spaces that scales up with (nearly) constant search response time.Exhaustive experimental confirmation of its properties is provided inChapter 6. Finally, we describe another variant of our structure alongwith another two recently published distributed similarity searchstructures in Chapter 7 and show the results of a comprehensive ex-perimental comparison between them. We conclude in Chapter 8and outline the future research directions.

7

Chapter 2

The Similarity Search Problem

Searching has always been one of the most prominent data process-ing operations. However, the exact-match retrieval, typical for tradi-tional databases, is neither feasible nor meaningful for data types inthe present digital age. The reason is that the constantly expandingdata of modern digital collections lacks structure and precision. Be-cause of this, what constitutes a match to a request is often differentfrom that implied in more traditional, well-established areas.

A very useful, if not necessary, search paradigm is to quantifythe proximity, similarity, or dissimilarity of a query object versus theobjects stored in a database to be searched. Roughly speaking, ob-jects that are near a given query object form the query response set.A useful abstraction for nearness is provided by the mathematicalnotion of metric space [42]. We consider the problem of organizingand searching large datasets from the perspective of generic or arbi-trary metric spaces, sometimes conveniently labelled distance spaces.In general, the search problem can be described as follows:

Problem 2.0.1 Let D be a domain, d a distance measure on D, and (D, d)a metric space. Given a set X ⊆ D of n elements, preprocess or structurethe data so that proximity queries are answered efficiently.

From a practical point of view, X can be seen as a file (a dataset ora collection) of objects that takes values from the domain D, with das the proximity measure, i.e. the distance function defined for anarbitrary pair of objects from D. Though several types of similarityqueries exist and others are expected to appear in the future, the ba-sic types are known as the similarity range and the nearest neighbor(s)queries.

In a distance space, the only possible operation on data objectsis the computation of a distance function on pairs of objects which

9

2. THE SIMILARITY SEARCH PROBLEM

satisfies the triangle inequality. In contrast, objects in a coordinate space– coordinate space being a special case of metric space – can be seenas vectors. Such spaces satisfy some additional properties that canbe exploited in storage (index) structure designs. Naturally, the dis-tance between vectors can be computed, but each vector can also beuniquely located in coordinate space. Further, vector representationallows us to perform operations like vector addition and subtraction.Thus, new vectors can be constructed from prior vectors. For moreinformation, see e.g., [26, 7] for surveys of techniques that exploit theproperties of coordinate space.

Since many data domains in use are represented by vectors, theremight seem to be little point in hunting efficient index structures inpure metric spaces, where the number of possible geometric prop-erties would seem limited. The following discussion should clarifythe issue and provide sufficient evidence of the importance of thedistance searching problem.

Applications managing non-vector data like character strings (nat-ural language words, DNA sequences, etc.) do exist, and their num-ber is growing. But even when the objects processed are vectors,the properties of the underlying coordinate space cannot always beeasily exploited. If the individual vectors are correlated, i.e. there iscross-talk between them, the neighborhood of the vectors seen throughthe lens of the distance measure between them will not map directlyto their coordinate space, and vice versa. Distance functions whichallow user-defined weights to be specified better reflect the user’sperception of the problem and are therefore preferable. This occurs,for instance, when searching images using color similarity, wherecross-talk between color components is a factor that must be takeninto account.

Existing solutions for searching the coordinate space suffer fromso-called dimensionality curse – such structures either become slowerthan naive algorithms with linear search times or they use too muchspace. Though the structure of indexed data may be intrinsicallymuch simpler (the data may, e.g., lie in a lower-dimensional hyper-plane), this is typically difficult to ascertain. Moreover, some spaceshave coordinates restricted to small sets of possible values (perhapseven binary), so that the use of such coordinates is not necessarilyhelpful.

10


Depending on the data objects, the distance measure and the di-mensionality of a given space, we agree that the use of coordinatescan be advantageous in special cases, resulting in non-extensible so-lutions. But we also agree with [15], that

to strip the problem down to its essentials by only consideringdistances, it is reasonable to find the minimal properties neededfor fast algorithms.

In summary, the primary reasons for looking at the distance datasearch problem seriously are the following:

1. There are numerous applications where the proximity criteriaoffer no special properties but distance, so a metric search be-comes the sole option.

2. Many specialized solutions for proximity search perform nobetter than indexing techniques based on distances. Metric searchthus forms a viable alternative.

3. If a good solution utilizing generic metric space can be found,it will provide high extensibility. It has the potential to workfor a large number of existing proximity measures, as well asmany others to be defined in the future.

2.1 The Metric Space

A similarity search can be seen as a process of obtaining data objectsin order of their distance or dissimilarity from a given query object. Itis a kind of sorting, ordering, or ranking of objects with respect to thequery object, where the ranking criterion is the distance measure.Though this principle works for any distance measure, we restrictthe possible set of measures by the metric postulates.

Suppose a metric space M = (D, d) defined for a domain ofobjects (or the objects’ keys or indexed features) D and a total (dis-tance) function d. In this metric space, the properties of the functiond : D × D 7→ R, sometimes called the metric space postulates, aretypically characterized as:

11


∀x, y ∈ D, d(x, y) ≥ 0 non-negativity,

∀x, y ∈ D, d(x, y) = d(y, x) symmetry,

∀x, y ∈ D, x = y ⇔ d(x, y) = 0 identity,

∀x, y, z ∈ D, d(x, z) ≤ d(x, y) + d(y, z) triangle inequality.

For brevity, some authors call the metric function simply the met-ric. There are also several variations of metric spaces. In order tospecify them more easily, we first transform the metric space postu-lates above into an equivalent form in which the identity postulate isdecomposed into (p3) and (p4):

(p1) ∀x, y ∈ D, d(x, y) ≥ 0 non-negativity,

(p2) ∀x, y ∈ D, d(x, y) = d(y, x) symmetry,

(p3) ∀x ∈ D, d(x, x) = 0 reflexivity,

(p4) ∀x, y ∈ D, x 6= y ⇒ d(x, y) > 0 positiveness,

(p5) ∀x, y, z ∈ D, d(x, z) ≤ d(x, y) + d(y, z) triangle inequality.

If the distance function does not satisfy the positiveness prop-erty (p4), it is called a pseudo-metric. In this thesis, we do not con-sider pseudo-metric functions separately, because such functions canbe transformed to the standard metric by regarding any pair of ob-jects with zero distance as a single object. Such a transformationis correct: if the triangle inequality (p5) holds, we can prove thatd(x, y) = 0⇒ ∀z ∈ D, d(x, z) = d(y, z). Specifically, by combining thetriangle inequalities

d(x, z) ≤ d(x, y) + d(y, z)

andd(y, z) ≤ d(x, y) + d(x, z),

we get d(x, z) = d(y, z), if d(x, y) = 0.If, on the other hand, the symmetry property (p2) does not hold,

we talk about a quasi-metric. For example, let the objects be different

12


locations within a city, and the distance function the physical dis-tance a car must travel between them. The existence of one-waystreets implies the function must be asymmetrical. There are tech-niques to transform asymmetric distances into symmetric form, forexample:

dsym(x, y) = dasym(x, y) + dasym(y, x).

To round out our list of possible metric distance function variants,we conclude this section with a version which satisfies a strongerconstraint on the triangle inequality. It is called the super-metric or theultra-metric. Such a function satisfies the following tightened triangleinequality:

∀x, y, z ∈ D, d(x, z) ≤ max{d(x, y), d(y, z)}.

The geometric characterization of the super-metric requires every tri-angle to have at least two sides of equal length, i.e. to be isosceles,which implies that the third side must be shorter than the others.Ultra-metrics are widely used in the field of biology, particularly inevolutionary biology. By comparing the DNA sequences of pairs ofspecies, evolutionary biologists obtain an estimate of the time whichhas elapsed since the species separated. From these distances, anevolutionary tree (sometimes called phylogenetic tree) can be recon-structed, where the weights of the tree edges are determined by thetime elapsed between two speciation events [60, 61]. Having a set ofextant species, the evolutionary tree forms an ultra-metric tree withall the species stored in leaves and an identical distance from root toleaves. The ultra-metric tree is a model of the underlying ultra-metricdistance function.

2.2 Distance Measures

The distance functions of metric spaces represent a way of quantify-ing the closeness of objects in a given domain. In the following, wepresent examples of distance functions used in practice on varioustypes of data. Distance functions are often tailored to specific appli-cations or a class of possible applications. In practice, distance func-tions are specified by domain experts, however, no distance functionrestricts the range of queries that can be asked with this metric.

13


Depending on the character of values returned, distance mea-sures can be divided into two groups:

• discrete – distance functions which return only a small (prede-fined) set of values, and

• continuous – distance functions in which the cardinality of theset of values returned is very large or infinite.

An example of a continuous function is the Euclidean distance be-tween vectors, while the edit distance on strings represents a discretefunction. Some metric structures are applicable only in the area ofdiscrete metric functions. In the following, we mainly survey metricfunctions used for complex data types like multidimensional vectors,strings or sets. However, even domains as simple as the real num-bers (D = R) can be seen in terms of metric data, by defining thedistance function as d = |oi − oj|, that is, as the absolute value of thedifference of any pair of numbers (oi, oj) from D.

2.2.1 Minkowski Distances

The Minkowski distance functions form a whole family of metric func-tions, designated as the Lp metrics, because the individual cases de-pend on the numeric parameter p. These functions are defined onn-dimensional vectors of real numbers as:

Lp[(x1, . . . , xn), (y1, . . . , yn)] = p

√

√

√

√

n∑

i=1

|xi − yi|p,

where the L1 metric is known as the Manhattan distance (also the City-Block distance), the L2 distance denotes the well-known Euclideandistance, and the L∞ = maxn

i=1 |xi−yi| is called the maximum distance,the infinite distance or the chessboard distance. Figure 2.1 illustratessome members of the Lp family, where the shapes denote points ofa 2-dimensional vector space that are at the same distance from thecentral point. The Lp metrics find use in a number of cases wherenumerical vectors have independent coordinates, e.g., in measure-ments of scientific experiments, environmental observations, or thestudy of different aspects of the business process.

14


L1 L2 L6 ooL

Figure 2.1: The sets of points at a constant distance from the centralpoint for different Lp distance functions.

2.2.2 Quadratic Form Distance

Several applications using vector data have individual components,i.e. feature dimensions, correlated, so a kind of cross-talk exists be-tween individual dimensions. Consider, for example, color histogramsof images, where each dimension represents a specific color. To com-pute a distance, the red component, for example, must be comparednot only with the dimension representing the red color, but also withthe pink and orange, because these colors are similar. The Euclideandistance L2 does not reflect any correlation of features of color his-tograms. A distance model that has been successfully applied to im-age databases in [25], and that has the power to model dependen-cies between different components of features, is provided by thequadratic form distance functions in [32, 64]. In this approach, the dis-tance measure of two n-dimensional vectors is based on an n×n pos-itive semi-definite matrix M = [mi,j], where the weights mi,j denotehow strong the connection between two components i and j of vec-tors ~x and ~y is, respectively. These weights are usually normalized sothat 0 ≤ mi,j ≤ 1 with the diagonal elements mi,i = 1. The followingexpression represents a generalized quadratic distance measure dM ,where the superscript T denotes vector transposition:

dM(~x, ~y) =√

(~x− ~y)T ·M · (~x− ~y).

Observe that this definition of distance also subsumes the Euclideandistance when the matrix M is equal to the identity matrix. Alsothe weighted Euclidean distance measure can be expressed using thematrix with non-zero elements on the diagonal representing weightsof the individual dimensions, i.e. M = diag(w1, . . . , wn). Applying

15


such a matrix, the quadratic form distance formula turns out to beas follows, yielding the general formula for the weighted Euclideandistance:

dM(~x, ~y) =

√

√

√

√

n∑

i=1

wi(xi − yi)2.

As an example, consider simplified color histograms with threedifferent colors (blue, red, orange) represented as 3-D vectors. As-suming three normalized histograms of a pure red image ~x = (0, 1, 0),a pure orange image ~y = (0, 0, 1) and a pure blue image ~z = (1, 0, 0),the Euclidean distance evaluates to the following distances: L2(~x, ~y) =√

2 and L2(~x, ~z) =√

2. This implies that the orange and the blue im-ages are equidistant from the red. However, human color perceptionis quite different and perceives red and orange to be more alike thanred and blue. This can be modeled with the matrix M shown below,yielding a ~x, ~y distance equal to

√0.2, while the distance ~x, ~z evalu-

ates to√

2.

M =

1.0 0.0 0.00.0 1.0 0.90.0 0.9 1.0

The quadratic form distance measure may be computationally ex-pensive, depending upon the dimensionality of the vectors. Colorimage histograms are typically high-dimensional vectors consistingof 64 or 256 distinct colors (vector dimensions).

2.2.3 Edit Distance

The closeness of sequences of symbols (strings) can be effectivelymeasured by the edit distance, also called the Levenshtein distance, pre-sented in [50]. The distance between two strings x = x1 · · ·xn andy = y1 · · ·ym is defined as the minimum number of atomic edit oper-ations (insert, delete, and replace) needed to transform string x intostring y. The atomic operations are defined formally as follows:

• insert the character c into the string x at the position i:ins(x, i, c) = x1x2 · · ·xicxi+1 · · ·xn;

• delete the character at the position i from the string x:del(x, i) = x1x2 · · ·xi−1xi+1 · · ·xn;

16


• replace the character at the position i in x with the new charac-ter c:replace(x, i, c) = x1x2 · · ·xi−1cxi+1 · · ·xn.

The generalized edit distance function assigns weights (positive realnumbers) to individual atomic operations. Hence, the distance be-tween strings x and y is the minimum value of the sum of weightedatomic operations needed to transform x into y. If the weight of in-sert and delete operations differ, the edit distance is not symmetric(violating property (p2) defined in Section 2.1) and therefore not ametric function. To see why, consider the following example, wherethe weights of atomic operations are set as wins = 2, wdel = 1, wreplace =1:

dedit(“combine”, “combination”) = 9– replacement e→ a, insertion of t, i, o, n

dedit(“combination”, “combine”) = 5– replacement a→ e, deletion of t, i, o, n

Within this thesis, we only assume metric functions, thus the weightsof insert and delete operations must be the same. However, theweight of the replace operation can differ. Usually, the edit distanceis defined with all weights equal to one. An excellent survey onstring matching can be found in [55].

Using weighting functions, we can define a most generic edit dis-tance which assigns different costs even to operations on individualcharacters. For example, the replacement a → b can be assigned adifferent weight than a → c. To retain the metric postulates, someadditional limits must be placed on weight functions, e.g. symmetryof substitutions – the cost of a → b must be the same as the cost ofb→ a.

2.2.4 Tree Edit Distance

The tree edit distance is a well-known proximity measure for trees, ex-tensively studied in [30, 16]. The tree edit distance function defines

17


a distance between two tree structures as the minimum cost neededto convert the source tree to the target tree using a predefined set oftree edit operations, such as the insertion or deletion of a node. Infact, the problem of computing the distance between two trees is ageneralization of the edit distance to labeled trees. The individualcost of edit operations (atomic operations) may be constant for thewhole tree, or may vary with the level in the tree at which the oper-ation is carried out. The reason for having different weights for treelevels is that the insertion of a single node near the root may be moresignificant than adding a new leaf node. This will, of course, dependon the application domain. Several strategies for setting costs andcomputing the tree edit distance are described in the doctoral thesisby Lee [48]. Since XML documents are typically modeled as rootedlabeled trees, the tree edit distance can also be used to measure thestructural dissimilarity of XML documents.

2.2.5 Jacard’s Coefficient

Let us now focus on a different type of data and present a similar-ity measure that is applicable to sets. Assuming two sets A and B,Jacard’s coefficient is defined as follows:

d(A, B) = 1− |A ∩B||A ∪B| .

This distance function is simply based on the ratio between the car-dinalities of intersection and union of the compared sets. As an ex-ample of an application that deals with sets, suppose we have accessto a log file of web addresses (URLs) accessed by visitors to an In-ternet Cafe. Along with the addresses, visitor identifications are alsostored in the log. The behavior of a user browsing the Internet can beexpressed as the set of visited network sites and Jacard’s coefficientcan be applied to assess the similarity (or dissimilarity) of individualusers’ search interests.

An application of this metric to vector data is called the Tanimotosimilarity measure dTS , (see for example [43]), defined as:

dTS(~x, ~y) =~x · ~y

‖~x‖2 + ‖~y‖2 − ~x · ~y ,

18


where ~x · ~y is the scalar product of ~x and ~y, and ‖~x‖ is the Euclideannorm of ~x.

2.2.6 Hausdorff Distance

An even more complicated distance measure defined on sets is theHausdorff distance [37]. In contrast to Jackard’s coefficient, where anytwo elements of sets must be either equal or completely distinct, theHausdorff distance matches elements based upon a distance func-tion de. Specifically, the Hausdorff distance is defined as follows.Assume:

dp(x, B) = infy∈B

de(x, y),

dp(A, y) = infx∈A

de(x, y),

ds(A, B) = supx∈A

dp(x, B),

ds(B, A) = supy∈B

dp(A, y).

Then the Hausdorff distance over sets A, B is:

d(A, B) = max{ds(A, B), ds(B, A)}.

The distance de(x, y) between two elements of sets A and B can be ar-bitrary, e.g. the Euclidean distance, and is application-specific. Suc-cinctly put, the Hausdorff distance measures the extent to which eachpoint of the “model” set A lies near some point of the “image” set Band vice versa. A typical application is the comparison of shapes inimage processing, where each shape is defined by a set of points in a2-dimensional space.

2.2.7 Time Complexity

In general, computing a distance is a nontrivial process which willcertainly be much more computationally intensive than a keywordcomparison as used in traditional search structures. For example,the Lp norms (metrics) are computed in linear time dependent onthe dimensionality n of the space. However, the quadratic form dis-tance is much more expensive because it involves multiplications by

19


a matrix M . Thus, the time complexity in principle is O(n2 + n).Existing dynamic programming algorithms which evaluate the editdistance on two strings of length n and m have time complexityO(nm). Tree edit distance is even more demanding and has a worst-case time complexity of O(n4), where n refers to the number of treenodes. For more details see for example [48]. Similarity metrics be-tween sets are also very time-intensive to evaluate. The Hausdorffdistance has a time complexity of O(nm) for sets of size n and m.A more sophisticated algorithm by [3] can reduce its complexity toO((n + m)log(n + m)).

In summary, the high computational complexity of metric dis-tance functions gives rise to an important objective for metric indexstructures, namely minimizing the number of distance evaluations.

2.3 Similarity Queries

A similarity query is defined explicitly or implicitly by a query ob-ject q and a constraint on the form and extent of proximity required,typically expressed as a distance. The response to a query returns allobjects which satisfy the selection conditions, presumed to be thoseobjects close to the given query object. In the following, we first de-fine elementary types of similarity queries, and then discuss possi-bilities for combining them.

2.3.1 Range Query

Probably the most common type of a similarity query is the similarityrange query R(q, r). The query is specified by a query object q ∈D, with some query radius r as the distance constraint. The queryretrieves all objects found within distance r of q, formally:

R(q, r) = {o ∈ X, d(o, q) ≤ r}.

If needed, individual objects in the response set can be ranked ac-cording to their distance with respect to q. Observe that the queryobject q need not exist in the collection X ⊆ D to be searched, andthe only restriction on q is that it belongs to the metric domainD. For

20


o3

o2

o1o

4

o6

o5

q

r

o3

o2

o1

o6

o5

o4

q

2.0

2.5 3.3

3.3

(a) (b)

Figure 2.2: (a) Range query R(q, r) and (b) nearest neighbor query3NN(q).

convenience, Figure 2.2a shows an example of a range query. In a ge-ographic application, a range query can formulate the requirement:Give me all museums within a distance of two kilometers from my hotel.

When the search radius is zero, the range query R(q, 0) is called apoint query or exact match. In this case, we are looking for an identicalcopy (or copies) of the query object q. The most usual use of this typeof query is in delete algorithms, when we want to locate an object toremove from the database.

2.3.2 Nearest-Neighbor Query

Whenever we want to search for similar objects using a range search,we must specify a maximal distance for objects to qualify. But it canbe difficult to specify the radius without some knowledge of the dataand the distance function. For example, the range r = 3 of the editdistance metric represents less than four edit operations betweencompared strings. This has a clear semantic meaning. However, adistance of two color-histogram vectors of images is a real numberwhose quantification cannot be so easily interpreted. If too small aquery radius is specified, the empty set may be returned and a newsearch with a larger radius will be needed to get any result. On theother hand, if query radii are too large, the query may be computa-tionally expensive and the response sets contain many nonsignificantobjects.

21


An alternative way to search for similar objects is to use nearestneighbor queries. The elementary version of this query finds the clos-est object to the given query object, that is the nearest neighbor of q.The concept can be generalized to the case where we look for the knearest neighbors. Specifically, kNN(q) query retrieves the k nearestneighbors of the object q. If the collection to be searched consists offewer than k objects, the query returns the whole database. Formally,the response set can be defined as follows:

kNN(q) = {R ⊆ X, |R| = k ∧ ∀x ∈ R, y ∈ X − R : d(q, x) ≤ d(q, y)}.

When several objects lie at the same distance from the k-th nearestneighbor, the ties are solved arbitrarily. Figure 2.2b illustrates the sit-uation for a 3NN(q) query. Here the objects o1, o3 are both at distance3.3 and the object o1 is chosen as the third nearest neighbor (at ran-dom), instead of o3. If we continue with our geographic application,we can pose a query: Tell me which three museums are the closest to myhotel.

2.3.3 Reverse Nearest-Neighbor Query

In many situations, it is interesting to know how a specific objectis perceived or ranked in terms of distance by other objects in thedataset, i.e., which objects view the query object q as their nearestneighbor. This is known as a reverse nearest neighbor search. Thegeneric version, conveniently designated kRNN(q), returns all ob-jects with q among their k nearest neighbors. An example is illus-trated in Figure 2.3a, where the dotted circles denote the distanceto the second nearest neighbor of objects oi. Objects satisfying the2RNN(q) query, that is those objects with q among their two nearestneighbors, are represented by black points.

Recent work, such as [45, 67, 72, 66, 44], has highlighted the im-portance of reverse nearest-neighbor queries in decision support sys-tems, profile-based marketing, document repositories, and manage-ment of mobile devices. The response set of the general kRNN(q)query may be defined as follows:

kRNN(q) = {R ⊆ X, ∀x ∈ R : q ∈ kNN(x) ∧∀x ∈ X −R : q 6∈ kNN(x)}.

22


o4

o6

o5

o3

o1

o2

(b)

1 2 30

o3

o2

o1o

4

o6

o5

(a)

q

Figure 2.3: (a) A reverse nearest neighbor query 2RNN(q) and (b) asimilarity self join query SJ(2.5). Qualifying objects arefilled.

Observe that even an object located far from the query object q canbelong to the kRNN(q) response set. At the same time, an objectnear q need not necessarily be a member of the kRNN(q) result. Thischaracteristic of the reverse nearest neighbor search is called the non-locality property. A specific query can ask for: all hotels with a specificmuseum as the nearest cultural heritage site.

2.3.4 Similarity Join

The development of Internet services often requires the integrationof heterogeneous sources of data. Such sources are typically unstruc-tured whereas the intended services often require structured data.An important challenge here is to provide consistent and error-freedata, which entails some kind of data cleaning or integration typicallyimplemented by a process called a similarity join. The similarity joinbetween two datasets X ⊆ D and Y ⊆ D retrieves all pairs of objects(x ∈ X, y ∈ Y ) whose distance does not exceed a given thresholdµ ≥ 0. Specifically, the result of the similarity join J(X, Y, µ) is de-

23


fined as:J(X, Y, µ) = {(x, y) ∈ X × Y : d(x, y) ≤ µ}

If µ = 0, we get the traditional natural join. If the datasets X and Ycoincide, i.e. X = Y , we talk about the similarity self join and denote itas SJ(µ) = J(X, X, µ), where X is the searched dataset. Figure 2.3bpresents an example of a similarity self join SJ(2.5). For illustration,consider a bibliographic database obtained from diverse resources.In order to clean the data, a similarity join request might identifyall document titles with an edit distance smaller than two. Anotherapplication might maintain a collection of hotels and a collection ofmuseums. The user might wish to find all pairs of hotels and museumswhich are a five minute walk apart.

2.3.5 Combinations of Queries

As an extension of the query types defined above, we can define ad-ditional types of queries as combinations of the previous ones. Forexample, we might combine a range query with a nearest-neighborquery to get kNN(q, r) with the response set:

kNN(q, r) = {R ⊆ X, |R| ≤ k ∧ ∀x ∈ R, y ∈ X −R :

d(q, x) ≤ d(q, y) ∧ d(q, x) ≤ r}.

In fact, we have constrained the result from two sides. First, all ob-jects in the result-set should lie at a distance not greater than r, andif there are more than k of them, just the first (i.e., the nearest) kare returned. By analogy, we can combine a similarity self join anda nearest neighbor search. In such queries, we limit the number ofpairs returned for a specific object to the value k.

2.3.6 Complex Similarity Queries

Efficient processing of queries consisting of more than one similaritypredicate, i.e., complex similarity queries, differs substantially from tra-ditional (Boolean) query processing. The problem was studied firstby [22, 23]. The basic lesson learned is that the similarity score (orgrade) a retrieved object receives as a whole depends not only on thescores it gets for individual predicates, but also on how such scores

24


are combined. In order to understand the problem, consider a queryfor circular shapes of red color. In order to find the best match, it isnot enough to retrieve the best matches for the color features and theshapes. Naturally, the best match for the whole query need not bethe best match for a single (color or shape) predicate.

To this aim, [22] has proposed the so-called A0 algorithm whichsolves the problem. This algorithm assumes that for each query pred-icate we have an index structure able to return objects of decreasingsimilarity. For every predicate i, the algorithm successively creates aset Xi containing objects which best match the query predicate. Thisbuilding phase continues until all sets Xi contain at least k commonobjects, i.e. |⋂i Xi| = k. This implies that the cardinalities of setsXi are not known in advance, so a rather complicated incrementalsimilarity search is needed (please refer to [34] for details). For allobjects o ∈ ⋃

i Xi, the algorithm evaluates all query predicates andestablishes their final ranks. Then the first k objects are returned asa result. This algorithm is correct, but its performance is not veryoptimal and the expected query execution costs can be quite high.

[14] have concentrated on complex similarity queries expressedthrough a generic language. On the other hand, they assume thatquery predicates are from a single feature domain, i.e. from the samemetric space. Contrary to the language level that deals with simi-larity scores, the proposed evaluation process is based on distancesbetween feature values, because metric indexes can use just distancesto evaluate predicates. The proposed solution suggests that the in-dex should process complex queries as a whole, evaluating multi-ple similarity predicates at a time. The flexibility of this approachis demonstrated by considering three different similarity languages:fuzzy standard, fuzzy algebraic and weighted sum. The possibilityto implement such an approach is demonstrated through an exten-sion of the M-tree [13]. Experimental results show that performanceof the extended M-tree is consistently better than the A0 algorithm.The main drawback of this approach is that even though it is able toemploy more features during the search, these features are comparedusing a single distance function. An extension of the M-tree [12]which goes further is able to compare different features with arbi-trary distance functions. This index structure outperforms the A0

algorithm as well.

25


A similarity algebra with weights has been introduced in [10]. Thisis a generalization of relational algebra to allow the formulation ofcomplex similarity queries over multimedia databases. The maincontribution of this work is that it combines within a single frame-work several relevant aspects of the similarity search, such as newoperators (Top and Cut), weights to express user preferences, andscores to rank search results.

26

Chapter 3

The Scalability Challenge

In the previous chapter, we have explained the metric space back-ground, which in fact gives us the notion, what the similarity searchis. However, we haven’t said, how to actually evaluate the similarityqueries.

A naive approach to solve a basic similarity query such as a rangesearch R(q, r) or a nearest neighbor search NN(q) is to compare thethe query object q with all the stored objects, evaluating the distancebetween the query object and every object o in the database. Specif-ically, we report all objects o ∈ D that d(q, o) ≤ r for a particularrange search R(q, r). The nearest neighbor is also easily obtained, ifwe maintain a so far q-closest object on as we scan through all the ob-jects in the database, i.e. we assign on = o whenever d(o, q) < d(on, q).After scanning the whole dataset, the on is the nearest neighbor of thequery object q.

It is obvious that this approach is far from being effective, sincewe need to compute an expensive distance computation for eachstored object and the number of those evaluations is linearly pro-portional to the size of database. Thus, the goal is to design andmaintain some additional information allowing an enhanced perfor-mance of the search. The scalability of the similarity search is also avery important problem.

Definition 3.0.1 The scalability is the ability to support larger volumes ofdata and more users with minimal impact on the effectiveness of the searchevaluation. We say that the scalability of a problem is linear if the costsgrow linearly with the size of the problem, e.g., a twice as big problem has atwice as big response as an original small problem.

From this point of view, we can employ two general techniquesto lower the expenses of the index search computations. The first one

27

3. THE SCALABILITY CHALLENGE

is the partitioning of the metric space, i.e. we can limit the numberof objects that must be accessed while evaluating a particular indexoperation. In fact, the method improves the I/O scalability and wedescribe the general overview in Section 3.1. The second group oftechniques allows us to avoid usually expensive distance computa-tions using some previously computed distances, and thus affect theCPU scalability. The description of them can be found in Section 3.2.Both the groups are practically applicable to any generic metric spaceindex, since they are based on the triangular inequality of the metricfunction only.

With those basic principles a more effective similarity query eval-uation is possible. In the following, we provide a brief explanationof two advanced dynamic metric index structures – the M-Tree (seeSection 3.3.1) and the D-Index (see Section 3.3.2). We also provide thescalability experiment results for those two structures, which wereadopted from [74], compared to the aforementioned sequential scan.From these we can deduct that the centralized indexes indeed im-prove the similarity search a lot with respect to the naive algorithm.However, even these advanced techniques scale linearly with thesize of the problem. Thus, the response of the search becomes unac-ceptable at a certain point. With this respect, we lay down our scal-ability challenge in Section 3.4 – is it possible to achieve logarithmicor even constant scalability for a metric space indexing technique?

3.1 Basic Partitioning Principles

Partitioning, in general, is one of the most fundamental principlesof any storage structure, aiming at dividing the search space intosub-groups, so that once a query is given, only some of these groupsare searched. Given a set S ⊆ D of objects in metric space M =(D, d), [70] defines ball partitioning and generalized hyperplane partition-ing, while [73] suggests excluded middle partitioning. In the following,we briefly characterize these techniques.

28


p

dm

S1

S2

(a)

p1

p2

S1

S2

(b)

S2

dm

p

S3

S1

(c)

2ρ

Figure 3.1: Examples of partitioning: (a) the ball partitioning, (b) thegeneralized hyperplane partitioning, and (c) the excludedmiddle partitioning.

3.1.1 Ball Partitioning

Ball partitioning breaks the set S into subsets S1 and S2 using a spher-ical cut with respect to p ∈ D, where p is the pivot, chosen arbitrarily.Let dm be the median of {d(oi, p), ∀oi ∈ S}. Then all oj ∈ S are dis-tributed to S1 or S2 according to the following rules:

• S1 ← {oj | d(oj, p) ≤ dm}

• S2 ← {oj | d(oj, p) ≥ dm}The redundant conditions ≤ and ≥ assure balance when the medianvalue is not unique. This is accomplished by assigning each elementat the median distance to one of the subsets in an arbitrary, but bal-anced, fashion. An example of a data space containing twenty-threeobjects is depicted in Figure 3.1a. The selected pivot p and the me-dian distance dm establish the ball partitioning.

3.1.2 Generalized Hyperplane Partitioning

Generalized hyperplane partitioning can be considered as an orthog-onal principle to ball partitioning. This partitioning also breaks theset S into subsets S1 and S2. This time, though, two reference objects(pivots) p1, p2 ∈ D are arbitrarily chosen. All other objects oj ∈ Sare assigned to S1 or S2 depending upon their distances from the se-lected pivots as follows:

29


• S1 ← {oj | d(p1, oj) ≤ d(p2, oj)}

• S2 ← {oj | d(p1, oj) ≥ d(p2, oj)}

In contrast to ball partitioning, the generalized hyperplane does notguarantee a balanced split, and a suitable choice of reference pointsto achieve this objective is an interesting challenge. An example of abalanced split of a hypothetical dataset is given in Figure 3.1b.

3.1.3 Excluded Middle Partitioning

Excluded middle partitioning [73] divides S into three subsets S1, S2

and S3. In principle, it is an extension of ball partitioning whichhas been motivated by the following fact: Though similarity queriessearch for objects lying within a small vicinity of the query object,whenever a query object appears near the partitioning threshold,the search process typically requires accessing both of the ball par-titioned subsets. The central idea of excluded middle partitioning istherefore to leave out points near the threshold dm in defining thetwo subsets S1 and S2. The excluded points form a third subset S3.An illustration of excluded middle partitioning can be seen in Fig-ure 3.1c, where the dark objects fall into the exclusion zone. Withsuch an arrangement, the search for similar objects always ignoresat least one of the subsets S1 or S2, provided that the search selec-tivity is smaller than the thickness of the exclusion zone. Naturally,the excluded points cannot be lost, so they can either be consideredto form a third subset or, if the set is large, the basis of a new par-titioning process. Given the thickness of the exclusion zone 2ρ, thepartitioning can be defined as follows:

• S1 ← {oj | d(oj, p) ≤ dm − ρ}

• S2 ← {oj | d(oj, p) > dm + ρ}

• S3 ← otherwise.

Figure 3.1c also depicts a situation where the split is balanced, i.e. thecardinalities of S1 and S2 are the same. However, this is not alwaysguaranteed.

30


3.2 Avoiding Distance Computations

Since the performance of similarity search in metric spaces is not onlyI/O, but also CPU-bound, it is very important to limit the numberof distance computations as much as possible. To this aim, prun-ing conditions must be applied not only to avoid accessing irrele-vant set of objects, but also to minimize the number of distancescomputed. The rationale behind such strategies is to use already-evaluated distances between some objects, while properly applyingthe metric space postulates – namely the triangle inequality, symme-try, and non-negativity – to determine bounds on distances betweenother objects.

In this section, we describe several bounding strategies, originallyproposed in [34] and refined in [35]. These techniques represent gen-eral pruning rules that are employed, in a specific form, in practicallyall index structures for metric spaces. The following rules thus formthe basic formal background. The individual techniques describeddiffer as to the type of distance we have available, as well as whatkind of distance computation we seek to avoid.

3.2.1 Object-Pivot Distance Constraint

The basic type of bounding constraint is the object-pivot distance con-straint, so called because it is usually applied to leaf nodes containingthe data, i.e. the metric objects of the searched collection. Figure 3.2demonstrates a situation in which such a bounding constraint canbe beneficial with respect to the trivial sequential scan computingdistances to all objects. Assume a range query R(q, r) is issued (seeFigure 3.2a) and the search algorithm has reached the left-most leafnode as illustrated in Figure 3.2b. At this stage, the sequential scanwould examine all objects in the leaf, i.e. compute the distancesd(q, o4), d(q, o6), d(q, o10), and decide qualifying objects. However, pro-vided the distances d(p2, o4), d(p2, o6), d(p2, o10) are in memory (hav-ing been computed during insertion) and the distance from q to p2 isd(q, p2), some distance evaluations can be omitted.

Figure 3.3a shows a detail view of the situation. The dashed linesrepresent distances we do not know and the solid lines, known dis-tances. Suppose we need to estimate the distance between the query

31


rq

o3 o7 o8o4 o6 o10 o1 o5 o11 o2 o9

p1

p3p2

p1

dm1

o1

p2

p3

o5

o7

o4

o3 o11 o9

o2o10

dm2

dm3

o6

o8

(a) (b)

Figure 3.2: Range search for query R(q, r): (a) from the geometricpoint of view, (b) algorithm accessing the left-most leafnode.

p2

rq

o10

o4 o6

p2

rq

o10

o4 o6

o6

o4

o10

p2

rq

o10

o4 o6

o6

o4

o10

(a) (b) (c)

Closest to q

Furthest from

q

Figure 3.3: Illustration of the object-pivot constraint: (a) our modelsituation, (b) the lower bound, and (c) the upper bound.

32


object q and the database object o10. Given only an object and thedistance from it to another object, the object’s precise position inspace cannot be determined. Knowledge of the distance alone is notenough. With respect to p2, for example, the object o10 could lie any-where along the dotted circle representing all equidistant positions.This also implies the existence of two extreme possible positions foro10 with respect to the query object q, a closest and furthest possi-ble position. The former is depicted in Figure 3.3b while the latter isshown in Figure 3.3c. Systematically, the lower bound is computedas the absolute value of the difference between d(q, p2) and d(p2, o10),while the sum d(q, p2) and d(p2, o10) forms the upper bound on thedistance d(q, o10).

In our example, the lower bound on distance d(q, o10) is greaterthan the query radius r, thus we are sure the object o10 cannot qual-ify the query and can skip it in the search process without actuallycomputing the distance. If, on the contrary, we focus on the objecto6, it can be seen from Figure 3.3c that the upper bound on d(q, o6)is less than r. As a result, o6 can be directly included in the queryresponse set because the distance d(q, o6) cannot exceed r. In bothcases described, one distance computation is omitted, speeding upthe search process. Concerning the object o4, we discover that thelower bound is less than the radius r and the upper bound is greaterthan r. That means o4 must be compared directly against q using thedistance function, i.e. d(q, o4) must be computed to decide whethero4 is relevant to the query or not. We formally summarize the ideasdescribed in Lemma 3.2.1.

Lemma 3.2.1 Given a metric spaceM = (D, d) and three arbitrary objectsq, p, o ∈ D, it is always guaranteed:

|d(q, p)− d(p, o)| ≤ d(q, o) ≤ d(q, p) + d(p, o).

Consequently, the distance d(q, o) can be bounded from below and above,provided the distances d(q, p) and d(p, o) are known.

3.2.2 Range-Pivot Distance Constraint

The object-pivot distance constraint described above assumes thatall distances between the database objects oi and the respective pivot

33


o10

o4 o6

p2

rlr

q

rh

o10

o4 o6

p2

rlr

q

rh

o10

o4 o6

p2

rq

rh

rl

(a) (b) (c)

Closest to q Furthest from q

Figure 3.4: Illustration of the range-pivot constraint: (a) our modelsituation, (b) the lower bound, and (c) the upper bound.

p are known. However, some metric structures try to minimize thespace needed to build the index, so storing such an amount of datais not acceptable. An alternative is to store only a range (a distanceinterval) in which the database objects occur with respect to p. Here,we can apply a weaker condition called the range-pivot distance con-straint.

Consider Figure 3.2 with the range query R(q, r) again and as-sume the search procedure is just about to enter the left-most leafnode of our sample tree. At this stage, a sophisticated search al-gorithm should decide if it is necessary to visit the leaf or not, i.e.whether any qualifying object can be found at this node. If we knowthe interval [rl, rh] in which distances from the pivot p2 to all objectso4,o6,o10 occur, it can be applied to solve the problem. A detail of sucha situation is depicted in Figure 3.4a, where the dotted circles repre-sent limits of the range and the known distance between the pivotand the query is emphasized by a solid line. The shortest distancefrom q to any object lying within the range is rl − d(q, p2) (see Fig-ure 3.4b). Obviously, no object can be closer to q, because it would benearer p2 than the threshold rl otherwise. By analogy, we can definethe upper bound as rh +d(q, p2), see Figure 3.4c. In this way, we havetwo expressions which limit the distance between an object and thequery q.

To reveal the usefulness of this, consider range queries again. Ifthe lower bound is greater than the query radius r, we are sure that

34


rhr l

q

p

(a)

o rhr l

qp

(b)

o rhr l

q

p

(c)

o

Figure 3.5: Illustration of Lemma 3.2.2 with three different positionsof the query object: (a) above, (b) below and (c) within therange [rl, rh].

no qualifying object can be found and the node need not be accessed.On the other hand, if the upper bound is less than or equal to r, wecan conclude that all objects qualify and directly include all descen-dant objects in the query response set – no further distance compu-tations are needed at all. Note that in the model situation depicted inFigure 3.4, we can neither directly include nor prune the node, so thenode must be accessed and its individual objects examined instanceby instance.

Up to now, we have only examined one possible position for thequery and range, and stated two rules concerning the search radiusr. Before we give a formal definition of the range-pivot constraint,we illustrate three different query positions in Figure 3.5, namely:above the range [rl, rh] in (a), below the range in (b), and within theinterval in (c). We can bound d(q, o), provided rl ≤ d(p, o) ≤ rh andthe distance d(q, p) is known. The dotted and dashed line segmentsdenote the lower and upper bounds, respectively. At a general level,the problem can be formalized as follows:

Lemma 3.2.2 Given a metric spaceM = (D, d) and objects o, p ∈ D suchthat rl ≤ d(o, p) ≤ rh, and given some q ∈ D and an associated distanced(q, p), the distance d(q, o) can be restricted by the range:

max{d(q, p)− rh, rl − d(q, p), 0} ≤ d(q, o) ≤ d(q, p) + rh.

35


3.2.3 Pivot-Pivot Distance Constraint

We have just described two principles which lead to a performanceboost in search algorithms. Now, we turn our attention to a thirdapproach which, while weaker than the foregoing two, still providessome benefit. Consider a situation in which the range search algo-rithm has approached the internal node with pivot p1 (the root nodeof the structure) and the distance d(q, p1) has been evaluated. Here,the algorithm can apply Lemma 3.2.2 to decide which subtrees tovisit. The careful reader may object that the range of distances withrespect to the pivot p1 must be known separately for both left andright branches. But this is simple to achieve, because every objectinserted into the structure must be compared with p1. Thus, we canassume that the correct intervals are known. The specifics of apply-ing Lemma 3.2.2 are left to the reader as an easy exercise.

Without loss of generality, we assume the algorithm has followedthe left branch, reaching the node with pivot p2. Now, the algorithmcould compute the distance d(q, p2) and apply Lemma 3.2.2 again.But since we know the distance d(q, p1), then if we also know the dis-tance between pivots p1 and p2, we can employ Lemma 3.2.1 to get anestimate of d(q, p2) without computing it, since d(q, p2) ∈ [r′l, r

′

h]. Infact, we have now an interval on d(q, p2) and an interval on d(p2, oi),where objects oi are descendants of p2. Specifically, we have d(q, p2) ∈[r′l, r

′

h] and d(p2, oi) ∈ [rl, rh]. Figure 3.6 illustrates both intervals.Figure 3.6a depicts the range on d(q, p2) with the known distanced(q, p1) emphasized. In Figure 3.6b, the second interval on distancesd(p2, oi) is given in addition to the first interval, indicated by twodotted circles around the pivot p2. The purpose of these ranges isto give bounds on distances between q and database objects oi, lead-ing to a faster qualification process that does not require evaluatingdistances between q and oi, nor even computing d(q, p2). The figureshows both ranges intersect, which implies that the lower bound ond(q, oi) is zero. On the other hand, the sum rh + r′h obviously formsthe upper bound on the distances d(q, oi).

The example in Figure 3.6 only depicts the case when the rangesintersect. In Figure 3.7, we show what happens when the intervalsdo not coincide. In this case, the lower limit is equal to r′l− rh, whichcan be seen easily from Figure 3.7a. Figure 3.7b shows another view

36


rq

p1 o1

o5

o4

o11

o10o6

p2

r’l

r’h

(a)

rq

p1 o1

o5

o4

o11

o10o6r l

r h

p2

r’h

r’l

(b)

Figure 3.6: (a) The lower r′l and upper r′h bounds on distance d(q, p2),(b) the range [rl, rh] on distances from p2 and database ob-jects – the range from (a) is also included.

of the upper bound. The third possible position for the interval is op-posite that depicted in Figure 3.7. This time, the intervals have beenreversed, giving a lower limit of rl− r′h. The general formalization ofthis principle is as follows:

Lemma 3.2.3 Given a metric space M = (D, d) and objects o, p, q ∈ Dsuch that rl ≤ d(p, o) ≤ rh and r′l ≤ d(q, p) ≤ r′h, the distance d(q, o) canbe bounded by the range:

max{r′l − rh, rl − r′h, 0} ≤ d(q, o) ≤ rh + r′h.

3.2.4 Double-Pivot Distance Constraint

The three previous approaches to speeding up the retrieval processin metric structures all use a single pivot, in keeping with the ballpartitioning paradigm. Next we explore an alternate strategy basedupon generalized hyperplane partitioning. As defined in Section 3.1,this technique employs two pivots to partition the metric space.

Figure 3.8a shows an example of generalized hyperplane parti-tioning in which pivots p1, p2 are used to divide the space into twosubspaces – objects nearer p1 belonging to the left subspace and ob-jects nearer to p2 to the right. The vertical dashed line represents

37


p

qo

r lrh

rh’

r l’

h− rr l’

(a)

po q

r lrh

r l’ rh

’

r +h rh’

(b)

Figure 3.7: Illustration of Lemma 3.2.3: (a) the ranges [rl, rh] and[r′l, r

′

h] do not intersect, so the lower bound is r′l − rh; (b)the upper limit rh + r′h.

p1 p2

o

q

(a)

p1 p2

o

q

(b)

p1 p2

o

q

q’

(c)

Figure 3.8: Illustration of Lemma 3.2.4: (a) the lower bound on d(q, o),(b) the equidistant positions of q with respect to the lowerbound, and (c) shrinking the lower bound.

points equidistant from both pivots. With this partitioning we can-not establish an upper bound on the distance from query object q todatabase objects oi, because the database objects may be arbitrarilyfar away from the pivots. Thus only lower limits can be defined.

First, let us examine the case in which objects o and q are in thesame subspace, not considered in Figure 3.8. Obviously the lowerbound will equal zero, since it is possible some objects may be iden-tical. Next, we consider the situation in Figure 3.8a, where the lowerbound (depicted by a dotted line) is equal to (d(q, p1)− d(q, p2))/2. InFigure 3.8b, the hyperbolic curve represents all possible positions ofthe query object q with a constant value of (d(q, p1)−d(q, p2))/2. If wemove the query object q up vertically while maintaining the distance

38


to the dashed line, the expression (d(q, p1)−d(q, p2))/2 decreases. Forillustration, see Figure 3.8c, where q′ represents the new position ofthe query object. Consequently, the expression (d(q, p1) − d(q, p2))/2is indeed the lower bound on d(q, o). The formal definition of thisdouble-pivot distance constraint is given in Lemma 3.2.4.

Lemma 3.2.4 Assume a metric spaceM = (D, d) and objects o, p1, p2 ∈D such that d(o, p1) ≤ d(o, p2). Given a query object q ∈ D and the dis-tances d(q, p1) and d(q, p2), the distance d(q, o) is lower-bounded as follows:

max

{

d(q, p1)− d(q, p2)

2, 0

}

≤ d(q, o).

We should point out this constraint does not employ any already-evaluated distance from a pivot to a database object. If we knew suchdistances to both pivots we would simply apply Lemma 3.2.1 twice,for each pivot separately. The concept of using known distances tomore pivots is detailed in the following.

3.2.5 Pivot Filtering

Given a range query R(q, r), we can eliminate database objects by ap-plying Lemma 3.2.1, provided we know the distance between p andall database objects. This situation is demonstrated in Figure 3.9a,where the white area contains objects that cannot be eliminated un-der such a distance criterion. After elimination, the search algo-rithm would proceed by inspecting all remaining objects and com-paring them against the query object using the original distance func-tion, i.e. for all non-discarded objects oi, verify the query conditiond(q, oi) ≤ r.

To achieve a greater degree of pruning, several pivots can be com-bined into a single pivot filtering technique [20]. The underlying ideais shown in Figure 3.9b, where the reader can observe the improvedfiltering effect for two pivots. We formalize this concept in the fol-lowing lemma.

Lemma 3.2.5 Assume a metric spaceM = (D, d) and a set of pivots P ={p1, . . . , pn}. We define a mapping function Ψ: (D, d) → (Rn, L∞) as

39


qr

p

p1p2

(a) (b)

qr

Figure 3.9: Illustration of filtering technique: (a) using a single pivot,(b) using a combination of pivots.

follows:Ψ(o) = (d(o, p1), d(o, p2), . . . , d(o, pn)).

Then, we can bound the distance d(q, o) from below:

L∞(Ψ(q), Ψ(o)) ≤ d(q, o).

The mapping function Ψ(·) returns a vector of distances from anobject o to all pivots in P . For a database object, the vector actuallycontains the pre-computed distances to pivots. On the other hand,the application of Ψ(·) on a query object q requires computation ofdistances from the query object to all pivots in P . Once we havethe vectors Ψ(q) and Ψ(o), the lower bound criterion can be appliedto eliminate the object o if |d(q, pi) − d(o, pi)| > r for any pi ∈ P .The white area in Figure 3.9b represents the objects that cannot beeliminated from the search using two pivots. These objects will stillhave to be tested directly against the query object q with the originalmetric function d.

The mapping Ψ(·) is contractive, i.e. the distance L∞(Ψ(o1), Ψ(o2))is never greater than the distance d(o1, o2) in the original metric space.The bad outcome of a range query performed in the projected space(Rn, L∞) may contain some spurious objects that do not qualify forthe original query. To get the final result, the outcome has to be testedby the original distance function d. More details about the metricspace transformations can be found in the next section.

40


3.3 Dynamic Index Structures

In the previous, we describe some general techniques to prune thesearch space and avoid some computations based on the propertiesof the metric function. Now, we have the tools necessary to design amore efficient index structure than the sequential scan. We providethe descriptions of only two major dynamic techniques, others canbe found in exhaustive surveys [9] or [35]. The first one is a multi-way tree based structure called M-tree – see Section 3.3.1. The secondone – D-index in Section 3.3.2 – uses another paradigm known fromthe traditional primary key indexing: the hashing.

3.3.1 M-tree

The M-tree is, by nature, designed as a dynamic and balanced indexstructure capable of organizing data stored on a disk. By buildingthe tree in a bottom-up fashion from its leaves to its root, the M-treeshares some similarities with R-trees [31] and B-trees [17]. This con-cept results in a balanced tree structure independent of the numberof insertions or deletions and has a positive impact on query execu-tion.

In general, the M-tree behaves like the R-tree. All objects arestored in (or referenced from) leaf nodes while internal nodes keeppointers to nodes at the next level, together with additional informa-tion about their subtrees. Recall that R-trees store minimum bound-ing rectangles in non-leaf nodes that cover their subtrees. In generalmetric spaces, we cannot define such bounding rectangles becausea coordinate system is lacking. Thus M-trees use an object called apivot, and a covering radius, to form a bounding ball region. In theM-tree, pivots play a role similar to that in the GNAT access struc-ture, but unlike in GNAT, all objects are stored in leaves. Becausepre-selected objects are used, the same object may be present severaltimes in the M-tree – once in a leaf node, and once or several times ininternal nodes as a pivot.

Each node in the M-tree consists of a specific number of entries,m. Two types of nodes are presented in Figure 3.10. An internalnode entry is a tuple 〈p, rc, d(p, pp), ptr〉, where p is a pivot and rc isthe corresponding covering radius around p. The parent pivot of p is

41


p2 r c2 ptr2

p2d(p ,p )p1 r c

1 ptr1p

1d(p ,p ) pm ptrmp

md(p ,p )r cm

...Internal Node:

o2p

2d(o ,o ) omp

md(o ,o )o1p

1d(o ,o ) ...Leaf Node:

Figure 3.10: Graphical representation of the internal and leaf nodesof the M-tree.

denoted as pp and d(p, pp) is the distance from p to the parent pivot.As we shall soon see, storing distances to parent objects enhances thepruning effect of search processes. Finally, ptr is a pointer to a childnode. All objects o in the subtree rooted through ptr are within thedistance rc from p, i.e. d(o, p) ≤ rc. By analogy, a tuple 〈o, d(o, op)〉forms one entry of a leaf node, where o is a database object (or itsunique identifier) and d(o, op) is the distance between o and its parentobject, i.e. the pivot in the parent node.

Figure 3.11 depicts an M-tree with three levels, organizing a setof objects o1, . . . , o11. Observe that some covering radii are not neces-sarily minimum values for their corresponding subtrees. Look, e.g.,at the root node, where neither the covering radius for object o1 northat for o2 is optimal. (The minimum radii are represented by dottedcircles.) Obviously, using minimum values of covering radii wouldreduce the overlap of individual bounding ball regions, resulting ina more efficient search. For example, the overlapping balls of theroot node in Figure 3.11 become disjoint when the minimum cov-ering radii are applied. The original M-tree does not consider suchoptimization, but [11] have proposed a bulk-load algorithm for build-ing the tree which creates a structure that sets the covering radii totheir minimum values.

The M-tree is a dynamic structure, thus we can build the treegradually as new data objects come in. The insertion algorithm looksfor the best leaf node in which to insert a new object oN and stores theobject there if enough space is available. The heuristics for findingthe most suitable leaf node proceeds as follows: The algorithm de-scends down through a subtree for which no enlargement of the cov-ering radius rc is needed, i.e. d(oN , p) ≤ rc. If multiple subtrees existwith this property, the one for which object oN is closest to its pivot is

42


o1 0.0 o6 1.4 o10 0.0 o3 1.2 o7 0.0 o5 1.3 o11 1.0 o2 0.0 o8 2.9 o4 0.0 o9 1.6

o1 0.01.4 o10 1.2 3.3 o7 1.3 3.8 o2 2.9 0.0 o4 1.6 5.3

o1 4.5 −.− o2 −.−6.9

o1

o3

o10

o6

o7

o4

o2

o8

o11

o5

o9

Figure 3.11: Example of an M-tree consisting of three levels. Above,a 2-D representation of partitioning. Pivots are denotedby crosses and the circles around pivots correspond tovalues of covering radii. The dotted circles represent theminimum values of covering radii.

chosen. Such a heuristics supports the creation of compact subtreesand tries to minimize covering radii. Figure 3.11 depicts a situationin which object o11 could be inserted into the subtrees around pivotso7 and o2. Because o11 is closer to o7 than to the pivot o2, it is insertedinto the subtree of o7. If there is no pivot for which zero enlargementis needed, the algorithm’s choice is to minimize the increase of thecovering radius. In this way, we descend through the tree until wecome to a leaf node where the new object is inserted. During the treetraversal phase, the covering radii of all affected nodes are adjusted.

Insertion into a leaf may cause the node to overflow. The over-flow of a node N is resolved by allocating a new node N ′ at the samelevel and by redistributing the m + 1 entries between the node sub-ject to overflow and the one newly created. This node split requirestwo new pivots to be selected and the corresponding covering radiiadjusted to reflect the current membership of the two new nodes.

43


Naturally, the overflow may propagate towards the root node and,if the root splits, a new root is created and the tree grows up onelevel. A number of alternative heuristics for splitting nodes is con-sidered in [13]. Through experimental evaluation, a strategy calledthe minimum maximal radius mMRAD2 has been found to be the mostefficient. This strategy optimizes the selection of new pivots so thatthe corresponding covering radii are as small as possible. Specifi-cally, two objects pN , pN ′ are used as new pivots for nodes N, N ′ ifthe maximum (i.e. the larger, max(rc

N , rcN ′)) of corresponding radii is

minimum. This process reduces overlap within node regions.Starting at the root, the range search algorithm for R(q, r) tra-

verses the tree in a depth-first manner. During the search, all thestored distances to parent objects are brought into play. Assumingthe current node N is an internal node, we consider all non-emptyentries 〈p, rc, d(p, pp), ptr〉 of N as follows:

• If |d(q, pp)−d(p, pp)|− rc > r, the subtree pointed to by ptr neednot be visited and the entry is pruned. This pruning criterionis based on the fact that the expression |d(q, pp) − d(p, pp)| − rc

forms the lower bound on the distance d(q, o), where o is anyobject in the subtree ptr. Thus, if the lower bound is greaterthan the query radius r, the subtree need not be visited becauseno object in the subtree can qualify the range query.

• If |d(q, pp)−d(p, pp)|−rc ≤ r holds, we cannot avoid computingthe distance d(q, p). Having the value of d(q, p), we can stillprune some branches via the criterion: d(q, p) − rc > r. Thispruning criterion is a direct consequence of the lower boundin Lemma 3.2.2 with substitutions rl = 0 and rh = rc (i.e. thelower and upper bounds on the distance d(p, o)).

• All non-pruned entries are searched recursively.

Leaf nodes are similarly processed. Each entry 〈o, d(o, op)〉 is ex-amined using the pruning condition |d(q, op)−d(o, op)| > r. If it holds,the entry can be safely ignored. This pruning criterion is the lowerbound in Lemma 3.2.1. If the entry cannot be discarded, the distanced(q, o) is evaluated and the object o is reported if d(q, o) ≤ r. Notethat in all three steps where pruning criteria hold, we discard some

44


entries without computing distances to the corresponding objects. Inthis way, the search process is made faster and more efficient.

The algorithm for k-nearest-neighbors queries is based on therange search algorithm, but instead of the query radius r the distanceto the k-th current nearest neighbor is used – for details see [13].

From a theoretical point of view, the space complexity of the M-treeinvolves O(n + mmN ) distances, where n is the number of distancesstored in leaf nodes, mN is the number of internal nodes, and eachnode has a capacity of m entries. The construction complexity claimedis O(nm2 logm n) distance computations.

A dynamic structure called the Metric Tree (M-tree) is proposedin [13]. It can handle data files that change size dynamically, whichbecomes an advantage when insertions and deletions of objects arefrequent. In contrast to other metric trees, the M-tree is built bottom-up by splitting its fixed-size nodes. Each node is constrained bysphere-like (ball) regions of the metric space. A leaf node entry con-tains an identification of the data object, its feature value used as anargument for computing distances, and its distance from a routingobject (pivot) that is kept in the parent node. Each internal nodeentry keeps a child node pointer, the covering radius of the ball re-gion that bounds all objects indexed below, and its distance fromthe associated pivot. Obviously, the distance to the parent pivot hasno meaning for the root. The pruning effect of search algorithms isachieved by using the covering radii and the distances from objectsto their pivots in parent nodes.

Dynamic properties in storage structures are highly desirable buttypically have a negative effect on performance. Furthermore, the in-sertion algorithm of the M-tree is not deterministic, i.e., inserting ob-jects in different order results in different trees. That is why the bulkloading algorithm has been proposed in [11]. The basic idea of this al-gorithm works as follows: Given a set of objects, the initial clusteringproduces k sets of relatively close objects. This is done by choosing kdistant objects from the set and making them representative samples.The remaining objects get assigned to the nearest sample. Then, thebulk-loading algorithm is invoked for each of these k sets, resultingin an unbalanced tree. Special refinement steps are applied to makethe tree balanced.

45


p

dm

[1]

1,ρS

[0]

1,ρS

[−]

1,ρS

o1

o2

o3

(a)

2ρ[00]

2,ρS [01]

2,ρS

[10]

2,ρS

[11]

2,ρS

[0−]

2,ρS

[1−]

2,ρS [−0]

2,ρS

[−1]

2,ρS

[−−]

2,ρS

[−−]

2,ρS dm

Separable set 0

Separable set 1Separable set 3

Exclusion set(b)

Separable set 2

2ρ

Figure 3.12: (a) The bps split function and (b) the combination of twobps functions.

3.3.2 D-index

In tree-like indexing techniques, search algorithms traverse trees andvisit nodes which reside within the query region. This representslogarithmic search costs in the best case. Indexes based on hashing,sometimes called key-to-address transformation paradigms, contrastby providing a direct access to searched regions with no additionaltraversals of the underlying structure. In this section, we describe aninteresting hash-based index structure that supports disk storage.

The Distance Index (D-index) is a multi-level metric structure,based on hashing objects into buckets which are search-separable onindividual levels – see Section 3.1.3 for the concept of partitioningwith exclusion. The structure supports easy insertion and boundedsearch costs because at most one bucket per level need be accessedfor range queries with a search radius up to some predefined valueρ. At the same time, the use of a pivot-filtering strategy described inSection 3.2.5 significantly cuts the number of distance computationsin the accessed buckets. In what follows, we provide an overview ofthe D-index, which is fully specified in [21]. A preliminary idea ofthis approach is available in [28].

Before presenting the structure, we provide more details on thepartitioning principles employed by the technique, which are basedon multiple definition of a mapping function called the ρ-split func-tion. An example of a ρ-split function named bps (ball-partitioningsplit) is illustrated in Figure 3.12a. With respect to the parameter (dis-

46


tance) ρ, this function uses one pivot p and the median distance dm

to partition a dataset into three subsets. The result of the followingbps function uniquely identifies the set to which an arbitrary objecto ∈ D belongs:

bps1,ρ(x) =

0 if d(o, p) ≤ dm − ρ1 if d(o, p) > dm + ρ− otherwise

(3.1)

In principle, this split function uses the excluded middle partitioningstrategy described in Section 3.1.3. To illustrate, consider Figure 3.12again. The split function bps returns zero for the object o3, one for theobject o1 (it lies in the outer region), and ‘−’ for the object o2. The sub-set of objects characterized by the symbol ‘−’ is called the exclusionset, while the subsets characterized by zero and one are separable sets.In Figure 3.12a, the separable sets are denoted by S1,ρ

[0] (D), S1,ρ

[1] (D)

and the exclusion set by S1,ρ

[−](D). Recall that D is the domain of a

given metric space. Because this split function produces two separa-ble sets, we call it a binary bps function. Objects of the exclusion setare retained for further processing. To emphasize the binary behav-ior, Equation 3.1 uses the superscript 1 which denotes the order ofthe split function. The same notation is retained for resulting sets.

Two sets are separable if any range query using a radius not greaterthan ρ fails to find qualifying objects in both sets. Specifically, for anypair of objects oi and oj such that bps1,ρ(oi) = 0 and bps1,ρ(oj) = 1, thedistance between oi and oj is greater than 2ρ, i.e. d(oi, oj) > 2ρ. Thisis obvious from Figure 3.12a, however, it can also be easily provedusing the definition of the bps function and applying the triangle in-equality. We call such a property of ρ-split functions the separableproperty.

For most applications, partitioning into two separable sets is notsufficient, so we need split functions that are able to produce moreseparable sets. In the D-index, we compose higher order split func-tions by using several binary bps functions. An example of a system oftwo binary split functions is provided in Figure 3.12b. Observe thatthe resulting exclusion set is formed by the union of the exclusionsets of the original split functions. Furthermore, the new separablesets are obtained as the intersections of all possible pairs of the orig-inal separable sets. Formally, we have n binary bps1,ρ split functions,

47


each of them returning a single value from the set {0, 1,−}. The jointn-order split function is denoted as bpsn,ρ and the return value can beseen as a concatenated string of results of participating binary func-tions, that is, the string b = (b1, . . . , bn), where bi ∈ {0, 1,−}. In orderto obtain an addressing scheme, which is essential for any hashingtechnique, we need another function that transforms the string b intoan integer. The following function 〈b〉 returns an integer value in therange [0..2n] for any string b ∈ {0, 1,−}n:

〈b〉 =

{

[b1, b2, . . . , bn]2 =∑n

j=1 2n−jbj , if ∀j bj 6= −2n, otherwise

When no string elements are equal to ‘−’, the function 〈b〉 simplytreats b as a binary number, which is always smaller than 2n. Other-wise the function returns 2n.

By means of ρ-split functions and the 〈·〉 operator, we assign aninteger number i (0 ≤ i ≤ 2n) to each object o ∈ D and in this respect,group objects fromD in 2n+1 disjoint subsets. Considering again theillustration in Figure 3.12b, the sets denoted as S2,ρ

[00], S2,ρ

[01], S2,ρ

[10], S2,ρ

[11]

are mapped to S2,ρ

[0] , S2,ρ

[1] , S2,ρ

[2] , S2,ρ

[3] (i.e. four separable sets). The re-

maining combinations S2,ρ

[0−], S2,ρ

[1−], S2,ρ

[−0], S2,ρ

[−1], S2,ρ

[−−] are all interpreted

as a single set S2,ρ

[4] (i.e. the exclusion set). Once again, the first 2n sets

are called separable sets and the exclusion set is formed by the set ofobjects o for which 〈bpsn,ρ(o)〉 evaluates to 2n.

The most important fact is that the combination of split functionsalso satisfies the separable property. We say that such a disjoint sepa-ration of subsets, or partitioning, is separable up to 2ρ. This property isused during retrieval, because a range query with radius r ≤ ρ neverrequires accessing more than one of the separable sets and, possiblythe exclusion set.

Naturally, the more separable sets we have, the larger the exclu-sion set is. For a large exclusion set, the D-index allows an additionallevel of splitting by applying a new set of split functions to the ex-clusion set of the previous level. This process is repeated until theexclusion set is conveniently small.

The storage architecture of the D-index is based on a two di-mensional array of buckets used for storing data objects. On thefirst level, a bps function is applied to the whole dataset and a list

48


of separable sets is obtained. Each separable set forms a separablebucket. In this respect, a bucket represents a metric region and orga-nizes all objects from the metric domain falling into it. Specifically,on the first level, we get a one-dimensional array of buckets. Theexclusion set is partitioned further at the next level, where anotherbps function is applied. Finally, the exclusion set on the final level,which will not be further partitioned, forms the exclusion bucket ofthe whole multi-level structure. Formally, a list of h split functions

(bpsm1,ρ1 , bpsm2,ρ

2 , . . . , bpsmh,ρh ) forms 1 +

∑h

i=1 2mi buckets as follows:

B1,0, B1,1, . . . , B1,2m1−1

B2,0, B2,1, . . . , B2,2m2−1...Bh,0, Bh,1, . . . , Bh,2mh−1, Eh

In the structure, objects from all separable buckets are included, butonly the Eh exclusion bucket is present because exclusion bucketsEi<h are recursively repartitioned on levels i + 1. The bps functionsof individual levels should be different but must employ the sameρ. Moreover, by using a different order of split functions (generallydecreasing with the level), the D-index structure can have a differentnumber of buckets at individual levels. To deal with overflow prob-lems and file growth, buckets are implemented as elastic buckets andconsist of the necessary number of fixed-size blocks (pages) – basicdisk access units.

In Figure 3.13, we present an example of the D-index structurewith a varying number of separable buckets per level. The struc-ture consists of three levels. Exclusion buckets which are recursivelyrepartitioned are shown as dashed rectangles. Obviously, the ex-clusion bucket of the third level forms the exclusion bucket of thewhole structure. Observe that the object o5 falls into the exclusionset several times and is finally accommodated in the global exclu-sion bucket. The object o4 has also fallen into the exclusion bucket onthe first level, but it is accommodated in a separable bucket on thesecond level. Below the structural view, there is an example of thepartitioning applied to the first level.

49


rd3 level: 2 buckets

st1 level: 4 buckets

nd2 level: 4 buckets

1 2

34

5

5

5

5

4

st1 level in 2−D:

1

2

3

5

4

exclusion bucket

Figure 3.13: Example of D-index structure.

3.3.3 Scalability Experiments

Figure 3.14 presents scalability of range and nearest-neighbor queriesin terms of distance computations and block accesses. In these exper-iments, the authors used 45-dimensional vectors of color image fea-tures (labeled VEC) compared via the quadratic form distance func-tion and the amount of data grows from 100,000 up to 600,000 objects.Apart from the sequential (SEQ) organization, individual curves arelabeled by a number indicating either the count of nearest neighborsor the search radius, and a letter, where ‘D’ stands for the D-indexand ‘M’ for the M-tree. Query size is not provided for the results ofSEQ because sequential organization has the same costs no matter

50


0

100,000

200,000

300,000

400,000

500,000

600,000

100 200 300 400 500 600

Data Set Size (x1,000)

Distance Computations VEC

1,000D2,000D1,000M2,000M

SEQ

0

100,000

200,000

300,000

400,000

500,000

600,000

100 200 300 400 500 600


Distance Computations VEC

1D100D

1M100MSEQ

0

5,000

10,000

15,000

20,000

25,000

30,000

100 200 300 400 500 600


Page Reads VEC

1,000D2,000D1,000M2,000M

SEQ

0

5,000

10,000

15,000

20,000

25,000

30,000

100 200 300 400 500 600


Page Reads VEC

1D100D

1M100MSEQ

Figure 3.14: Scalability of range (left) and nearest-neighbor queries(right) for the VEC dataset.

the query. The results indicate that on the level of distance computa-tions, the D-index is usually slightly better than the M-tree, but thedifferences are not significant – the D-index and M-tree can each savea considerable number of distance computations over the SEQ. Tosolve a query, the M-tree needs significantly more block reads thanthe D-index and for some queries (see the 2,000M curve) this num-ber is even higher than for the SEQ. The reason for such behavior hasbeen given earlier.

In general, the D-index can be said to behave strictly linearlywhen the size of the dataset grows, i.e. search costs depend linearlyupon the amount of data. In this regard, the M-tree came out slightlybetter, because execution costs for querying a file twice as large werenot twice as high. This sublinear behavior should be attributed tothe fact that the M-tree incrementally reorganizes its structure bysplitting blocks and, in this way, improves data clustering. On theother hand, the D-index used a constant bucket structure, where onlythe number of blocks changed. However, the static hashing schema

51


allows the D-index to have constant costs for exact match queries.The D-index required one block access and eighteen distance com-parisons, independent of dataset size. This was in sharp contrastto the M-tree, which needed about 6,000 block reads and 20,000 dis-tance computations to find the exact match in a set of 600,000 vectors.Moreover, the D-index has constant costs to insert one object, whilethe M-tree exhibits logarithmic behavior.

3.4 Research Objective

The basic lessons learned from the experiments are twofold:

• similarity search is expensive;

• the scalability of centralized indexes is linear.

Of course, there are differences in search costs among individualtechniques, but the global outcome is that search costs grow linearlywith dataset size. This property prohibits their applicability for hugedata archives, because, after a certain point, centralized indexes be-come inefficient for users’ needs.

A solution may be obtained by sacrificing some precision in searchresults. This technique is called approximate similarity search. Thoughthe approximate versions of similarity search structures, e.g. theapproximate M-tree in [4], improve the query execution a lot, theyare also not scalable for large volumes of data. On the other hand,high percentage of data is now produced in a digital form and theiramount grows exponentially. In order to manage similarity search inmultimedia data types such as plain text, music, images, and video,this trend calls for putting equally scalable infrastructures in motion.

Problem 3.4.1 Design a metric index structure that has significantly bet-ter scalability than just linear. In other words, the similarity search re-sponse should not be significantly affected by the increases of the size of thedatabase.

In this respect, the distributed computing environment, like theones provided by GRID infrastructures or the Peer-to-Peer (P2P) net-works, is quickly gaining in popularity due to its scalability and self-

52


organizing nature. It forms the base for building large-scale similar-ity search indexes at low costs. Thus, we would try to combine theparallel scalability of the distributed computing with the power ofthe centralized metric index structures to for a new distributed scal-able metric index structure (see Chapter 5).

53

Chapter 4

Distributed Index Structures

Classical data storage and processing approach uses one centralizedplace (usually a powerful server computer), where all data are held.Every data operation - such as insertion of new objects or retriev-ing some relevant data subset - must be sent to and processed onthis central server in order to get desired results. Drawbacks of sucharrangement are evident. The server system is usually unable to pro-cess huge amounts of concurrent requests. In addition, the failure ofthis central part means unavailability of the whole system. The scal-ability of such system is also limited - the amount of data that can bestored is bounded by a constant size of the storage. Finally, the priceof one powerful server is usually high.

A possible solution of those problems is the use of distributedenvironment to store and process data. In principle, a distributedsystem can exploit unused resources of computers connected with ahigh-speed network. For instance, workstations of a common orga-nization have plenty of RAM and disk space and they do not use afull power of its processor all the time. Moreover, the workstationsare connected in a local intranet. Another example can be computerslinked in the Internet.

Therefore, it is a challenging task to exploit such resources intoa fully operational data storage system. That is, a system allowingthe insertion and distribution of data among networked computersand an efficient retrieval of relevant data from any participant. In thischapter, we present a brief survey on the available data structures fordistributed environments. As anticipated earlier, we focus on the in-dexing structures with emphasis on the problem of scalability. Thuswe do not consider the extensions of search algorithms for parallelCPUs or the distributed solutions with some global coordinator, etc.,

55

4. DISTRIBUTED INDEX STRUCTURES

since they always reach their limitations sooner or later. Instead, weare interested in structures that can spread over unlimited numberof resources and that can grow furthermore without significant per-formance degradation.

4.1 Scalable and Distributed Data Structures

The paradigm of Scalable and Distributed Data Structures was orig-inally proposed by [53] for simple search keys like numbers andstrings. Data objects are stored in a distributed file on specializednetwork nodes called servers. More servers are employed as the filegrows and additional storage capacity is required. The file is modi-fied and queried by network nodes called clients through insert, delete,and search operations. The number of clients is unlimited and anyclient can request an operation at any time. To ensure high effective-ness, the following three properties should be built into the system:

1. Scalability - data expand to new network nodes gracefully, andonly when the network nodes already used are efficiently loaded

2. No hot-spot - there is no master site that must be accessed forresolving the address of the searched objects, e.g., there is nocentralized directory

3. Independence - the file access and maintenance primitives, e.g.,search, insertion, split, etc., never require atomic updates tomultiple nodes.

There are several practical reasons why the second property shouldbe satisfied. In particular, if hotspots such as centralized directoriesexist, they would sooner or later turn into bottlenecks as the filesgrow. Structures without hotspots are also potentially more efficientin terms of the number and the distribution of messages sent overthe network during the execution of an operation.

The third property is vital in a distributed environment, becauseinforming other nodes may be either inefficient or even impossible inlarge-scale networks. Since they do not support techniques like mul-ticast or broadcast, update operations cannot efficiently contact mul-tiple servers with only one message. As an alternative, they would

56


flood the network with multiple independent messages to all the re-spective nodes, which is certainly undesirable. Moreover, when sev-eral updates occur simultaneously on different servers, it may be dif-ficult to maintain data consistency on individual nodes.

In this section, we will describe the most important distributedand scalable data structures proposed so far. We will outline a repre-sentative of the SDDS based on hashing (Section 4.1.1), the SDDSbased on a binary search tree (Section 4.1.2), a hybrid algorithm,which uses hashing and a distributed tree (Section 4.1.3), and finally,the order preserving distributed structure that is able to solve rangequeries (Section 4.1.4).

4.1.1 Distributed Linear Hashing

The Distributed Linear Hashing (LH*) [53] is a pure distributed struc-ture for primary key searching. It is based on the linear hashing [51],where a sort of “evolving” hash function is used to address storeddata. The data are organized in buckets, which are held on servers(computers of a network that run a specialized software). For sim-plicity, one server holds exactly one bucket. Other network nodescalled clients insert new data and issue queries retrieving subsetsof data. The structure is scalable and grows through adding moreservers to get the needed capacity. Every node, participating in thisdistributed structure, has the knowledge of a dynamic hashing func-tion h, which is used to address the data and is fixed for the wholestructure. Moreover, the nodes also maintain two important featuresof dynamic hashing functions, i.e. the current level of the hashingi and the identification of the bucket with the split token n. Thesetwo properties represent an internal “view” of the current state ofthe structure, and can be more or less accurate.

When a node wants to search for an object identified by the keyk, it uses its local knowledge of the hashing level and split token (i′,n′) to hash the key k using hash-function h. The result is the identifi-cation of bucket (and thus also the server), which has to be accessed,and the query is sent to this server. The server confirms that the ad-dressing by the client was accurate by computing the hash value onthe key again, but with its own “view” of i and n, which is moreaccurate than the client’s (a proof is provided in [53]). If the result

57


client 1

n’ = 5i’ = 6

client 2

n’ = 0i’ = 2

server 0

DATA

server 1

DATA

server 80

DATA

n = 80

server 256

DATA

server 591

DATA

client m

n’ = 31i’ = 9

... ...

...

...

Figure 4.1: Example of LH∗ structure with multiple servers andclients

identifies the server itself, the data are on this server and they aresent back to client. Otherwise, the result is the identification of morecorrect server and the request is forwarded there. The query is thensolved on that server and the result data are sent back to the client.

To allow the client to update its “view”, the actual i and n val-ues of the server are sent along with the response whenever a clientmakes an addressing error. Assuming uniform hashing functions,LH* needs two messages to do a search – one to send a query andone as a reply – in the best case. In the worst case, it needs four mes-sages – one to send a query, two to do the forwarding at most, andone for a reply. The proof is provided in [53].

4.1.2 Distributed Random Tree

Distributed Random Tree (DRT) [46] is a binary search tree with alldata elements stored in leaves. Each leaf represents a block of dataelements (bucket), while internal nodes store auxiliary informationnecessary for guiding the basic tree operations, i.e. search, insert anddelete.

Initially, the structure consists of one bucket at the server 1 (seeFigure 4.2). Whenever the bucket at server 1 overflows, a new server

58


(2) with a bucket is created. A new father node is assigned to both thebuckets (rising the depth of the tree by one). This node contains in-formation necessary to divide the data stored in the bucket at server1 into two parts. The data are split using this information and partof them is moved to the bucket at server 2. The father node is knownto both the servers, but each the server maintains only one bucket.The opposite branch leaf node is replaced by a reference to the otherserver (where the bucket is actually stored). As long as insertionskeep coming, buckets split and new internal nodes are being gener-ated.

1 2

serv

er 1

SPLIT 1

2 3

serv

er 2

1

2 3

serv

er 3

21SPLIT1

serv

er 1

serv

er 2

Figure 4.2: The evolution of the distributed search tree

Clients store a part of the global tree. Actually, they are unawareof the changes occurring in the search tree until they issue a newrequest. Whenever a client c wants to insert or to search for a key k,it traverses its local view of the search tree using k in order to findthe server s possibly accommodating the bucket holding k. Then itsends the request to the server s.

If the server s is pertinent for the operation (i.e. its bucket con-tains k), the server performs the requested operation. In oppositecase, s searches its own view of the search tree to figure out the pos-sibly pertinent server s to whom it forwards the request. The for-wards are done recursively until the correct server is found and theoperation is processed. The client is then informed about the “fresh”search tree and has to update its own local tree.

59


4.1.3 Distributed Dynamic Hashing

The advantage of the Distributed Dynamic Hashing (DDH) [19] asopposed to LH* is the greater splitting autonomy by splitting over-flowing buckets immediately. DDH uses the Dynamic Hashing (DH)[47] as its kernel algorithm. In DH, the dynamic hashing function isa trie (the binary tree with bit encoded branches). The level i of thehashing basically means that the rightmost i bits of the key, or ofhashed key (pseudo-key), are used as the logical address of the key.

The hashing logically consists of the traversal of the trie from theroot (which always points to bucket “0”) to some leaf with level i(which points to the actual bucket where the key should be). Thetraversal is routed by the bits of the key. First bit is used on the firstlevel of the tree – zero means left branch, one is the right path. Thesecond bit is used for decision on the second level and so on until thelowest level i is reached.

Unlike DH, though, the actual trie is not maintained in DDH. In-stead, every client has an image trie that it builds through updates(image adjustments). A new client always sends a key to bucket 0,thus its internal view has an empty trie with only root node (0). Akey incorrectly sent to a server is forwarded to the correct server andupdate is sent back to the client, which updates its local trie. Theprocess continues until the correct bucket is reached.

4.1.4 Scalable Distributed Order Preserving Data Structures

RP* (Range-Partitioning) family [52] is different from the LH* in thatit is a group of order preserving data structures. Such data structuresare best suited for sequential and range scanning of files. There arethree data structures in RP* family and they are differentiated basedon what kind of network environment they are located in (with orwithout multicast/broadcast capabilities).

All communication between clients and servers is done by mul-ticast messages in the simplest scheme (referred to as RPN in thepaper). Searching is rather straightforward – any client just sends amulticast message and all buckets receive it. Only buckets that con-tain matching records reply. Range searching is done in the similarway. If a server contains entries in the range of query, it replies to the

60


client by sending matching records.The search stops when the client merges all the ranges it received

from the servers and the union covers the range of the query. Generalsearches (i.e. all buckets addressed) proceed in the same way. Insteadof using the union to determine when all results are received, a clientcan use a timer. Timed searches reduce the number of messages ex-changed by requiring servers to send replies only if they have match-ing data. Completeness of results is probabilistic (i.e. the longer thetime out, the smaller the chance that the result is incomplete).

The objective of more complicated versions is to reduce the usageof multicast (RPC) or to completely go without any multicast (RPS).It is accomplished by replicating a search B+-Tree on every serverand maintaining the updates. More details can be found in [52].

4.2 Unstructured Peer-to-Peer Networks

Another distributed paradigm has led to the definition of the Peer-to-Peer (P2P) data network. In this environment, network nodes arecalled peers, equal in functionality and typically operating as part ofa large-scale, potentially unreliable, network. Basically, a peer offerssome computational resources, but can also use resources of the oth-ers [2]. In principle, the P2P network inherits the basic principles ofSDDSs with added new requirements to overcome the problems ofunreliability in the underlying network. These can be summarizedas follows:

1. peer – every node participating in the structure behaves as bothclient and server, i.e. the node can perform queries and at thesame time store a part of the processed data file;

2. fault tolerance – the failure of a network node participating inthe structure is not fatal. All defined operations can still beperformed, but the affected part of the dataset is inaccessible;

3. redundancy – data components are replicated on multiple nodesto increase availability. Search algorithms must respect the pos-sibility of multiple paths leading to specific instances.

61


Not every P2P structure proposed so far follows all the propertiesmentioned. However, those are rules that any P2P system should beaware of and which ensures the maximal scalability and effectivenessof the system.

4.2.1 Napster

The first P2P structures were developed for sharing of data files be-tween users on the Internet. We must mention the most famous onehere, the Napster [54], although it is not a classical decentralizedpeer-to-peer system. However, this technology defined the peer-to-peer in the context of data distributed among cooperating networkof computers for the first time.

We usually call this technique a hybrid P2P system now, becausesearching is centralized at a one server. Thus, no complex distributedsearch algorithms are necessary. The user that wants to join the net-work simply connects to the central server and insert the names ofall the files he wishes to contribute to the network. When lookingfor something, the user simply queries the server for the names offiles. Server uses a classical wildcard match on all maintained list offiles from currently joined users and respond with files that matchthe query along with the IP address of the users holding them. Therequesting user then chooses from the received list of files and con-nects directly (using the provided IP address) to the other users inorder to download the files. Thus, the peer-to-peer paradigm is usedonly during the transferring the files between participants and thesearch is still centralized.

4.2.2 Gnutella

Similarly to the Napster, Gnutella [29] was also designed as a filesharing protocol. On the other hand, the goal of Gnutella is to pro-vide a purely distributed file sharing solution. The decentralized na-ture of Gnutella provides a level of anonymity for users but also in-troduces a degree of uncertainty. A user must first know the IP ad-dress of another Gnutella node in the network. He is then connectedto the network through this node and maintains a list of known nodes(those that connects through him). When a user wishes to find a

62


file, he issues a query to the other Gnutella users, which he knows.Those users may or may not respond with results and they forwardthe query request to any other Gnutella nodes they know about. Aquery contains a Time-To-Live (TTL) field, which is decremented onevery forward, and the query is forwarded until the TTL drops tozero (the usual default TTL of a query is 7).

However, the scalability of this search technique is quite limited,because the search is “flooding” all known nodes and the numberof queried servers grows exponentially. The client then receives allthe responses. Another problem is, that the query may never reachthe server, which may potentially hold the requested data. More-over, the client also cannot determine whether the result gatheredso far is or is not complete. On the other hand, this approach canbe easily adopted to metric based similarity search. For example, wecan flood the network with a range query (according to the techniquedescribed above). Every contacted node evaluates the received rangequery on its local data. If there are objects satisfying the query, theserver returns them to the originator.

Other enhancements of primary key search peer-to-peer networksare available (for instance see a survey [71]), adding for example alevel of anonymity or fault tolerance. However, we would like to fo-cus on the generally different paradigms in order to investigate thepossibilities of distributed similarity search.

4.3 Distributed Hash Table Peer-to-Peer Networks

The main problem of the unstructured peer-to-peer networks is theinability to apply more effective navigation algorithms simply be-cause of the fact that the data can be virtually anything and we haveno knowledge of their structure (hence the term unstructured). Inthis section, we present techniques that assume some structure. Specif-ically, that we can use a hash function on every data object to geta structured information, which can be used to navigate more effi-ciently in the P2P network.

63


4.3.1 Content-Addressable Network

Content-Addressable Network (CAN) [62] is a distributed hash-basedinfrastructure that provides fast lookup functionality on the Internet-like scales. A hash function assigns a d-dimensional vector P =hash(K) for each data record K that corresponds to a point in d-dimensional space. In CAN, the point indicates the virtual positionof that particular data record. The virtual space is partitioned intomany small d-dimensional zones. Each peer (a physical computer)corresponds to one zone and stores the data that are mapped to thiszone by the hash function. Each machine knows the zones and ad-dresses of its neighbors (i.e. the machines that hold zones that abutalong one dimension).

6 2

1

4

3 5

(x,y)

1’s coordinate neighbor set = {2,3,4,5}7’s coordinate neighbor set = {}

1’s coordinate neighbor set = {2,3,4,7}7’s coordinate neighbor set = {1,2,4,5}

sample routing path fromnode 1 to point (x,y)

6 2

4

3 51 7

Figure 4.3: Example of 2-D space and routing of CAN before node 7joins (a) and after node 7 joins (b)

The locating of the data is done as follows. For a given key, thevirtual position is calculated. Then, starting from any physical ma-chine, the query message is passed through the neighbors until itreaches the target machine. In a d-dimensional space, each nodemaintains 2d neighbors.

To allow the CAN to grow incrementally, the zones of the serverscan split. The split server retains one half of the zone (splitting in thehalf of one dimension), handing the other half to the new node. TheFigure 4.3 shows the split of node 1.

64


4.3.2 Chord

The Chord [68] is one of the most famous distributed hash table pro-tocols. This structure also allows to locate a peer responsible for agiven search key. However, as opposed to CAN, the hash functionmaps the keys to a linearly sortable one-dimensional domain, specif-ically the positive integer numbers. The structure is message drivenand it is able to adapt as nodes join or leave the system.

Using consistent hashing, the protocol uniformly maps the do-main of search keys into the Chord domain of keys [0, 2m). EveryChord node Ni is assigned a key Ki ∈ [0, 2m) from this domain. Theidentifiers are ordered in an identifier circle modulo 2m, Ki < Kj ⇐⇒i < j. Node Ni is “responsible” for all keys from interval (Ki−1, Ki](mod 2m) – see Figure 4.4 for visualization. All data objects that mapto the interval are stored on the peer that is responsible for it.

N

N1

N

K

K2

N3 K K

2

0

4

3 1

4

Figure 4.4: Chord peers with intervals they are responsible for

Additionally, every node maintains physical addresses of its suc-cessor on the identifier circle. This information is necessary duringthe joins and leaves of peers in the system. However, this is not suffi-cient for an effective search, thus every peer also maintains a routingtable called the finger table, where it stores “long distance” links to

65


up to m other nodes. The search for a key is then forwarded to thenearest known link (either long link or the successor link) from peerto peer until the query reaches the peer responsible for the intervalwithin which is mapped the key. This peer then search its data andrespond back to the originating peer with the data that belongs to thesearch key. Due to the uniformity of the Chord domain distribution,the protocol preserves (with high probability) balanced load of thenodes and a logarithmic hop count for the key searching operation.

4.4 Tree-based Peer-to-Peer Networks

Another group of structured peer-to-peer networks resort to tree-likestructures to navigate queries between peers. Again these techniquesassume some knowledge of the stored data that can be exploited dur-ing the search. The structures presented in this section are also lim-ited to exact match queries only, i.e. we can locate data according toa specific key if they are present in the structure, but no similarityqueries are possible.

4.4.1 P-Grid

P-Grid [1] is a peer-to-peer lookup system based on a virtual dis-tributed search tree. Each peer only holds part of the overall tree,which comes into existence only through the cooperation of the indi-vidual peers. Searching in P-Grid is efficient and fast even for unbal-anced trees – it achieves logarithmic search costs with respect to thenumber of peers.

Every participating peer’s position is determined by its path, thatis, the binary bit string representing the subset of the tree overallinformation that the peer is responsible for. For example, the path ofPeer 4 in Figure 4.5 is 10, so it stores all data items whose keys beginwith 10. For fault-tolerance, multiple peers can be responsible forthe same path, for example, Peer 1 and Peer 6. P-Grid query routingapproach is simple but efficient. For each bit in its path, a peer storesa reference to at least one other peer that is responsible for the otherside of the binary tree at that level. Thus, if a peer receives a binaryquery string it cannot satisfy, it forwards the query to a peer that is

66


"virtual binary tree"

00 01 10 11

0 1

peer 1

101

routing prefix:

23

peer 6

101

routing prefix:

25

peer 2

100

routing prefix:

64

DATAprefix 01

peer 3

011

routing prefix:

52

DATAprefix 10

peer 4

011

routing prefix:

56

DATAprefix 10

peer 5

010

routing prefix:

46

DATAprefix 11

DATAprefix 00

DATAprefix 00

Figure 4.5: Peer internal structure with address search tree and stor-age area with buckets full of objects

“closer” to the result.In Figure 4.5, Peer 1 forwards queries starting with 1 to Peer 3,

which is in Peer 1 routing table and whose path starts with 1. Peer 3can either satisfy the query or forward it to another peer, dependingon the next bits of the query. If Peer 1 gets a query starting with 0and the next bit of the query is also 0, it is responsible for the query.If the next bit is 1, however, Peer 1 will check its routing table andforward the query to Peer 2 whose path starts with 01.

4.4.2 P-Tree

The P-Tree index [18] is based on the idea of B+-tree and inheritsthe range lookup algorithm from it. Unfortunately, it is primarilydesigned for one resource stored in each peer and, therefore, has noload-balancing mechanism.

The core of the navigation algorithm lays in structure called semi-independent B+-tree maintained by every peer. Let us first introducea fully independent trees for a set of peers p1, . . . , pn that are assignedkeys k1, . . . , kn (in an increasing order). Then each peer pi views theset of keys organized in a ring with ki as the smallest value and buildsa full B+-tree over these keys. The B+-tree leaf nodes point to peerswith respective keys.

Such an index is space consuming and has high management re-quirements. The semi-independent tree, i.e. P-Tree lets every peermaintain only the left-most root-to-leaf path of the corresponding fullyindependent B+-tree. The pointers to the tree nodes that are not

67


stored locally lead to corresponding peers. The locally stored semi-independent trees may have overlaps of keys between peers. Thisallows the peers to grow independently on the others, but it incurssome overhead at the search time. The semi-independent tree alsocan be in a so called inconsistent state. That means that the changeson other peers does not yet propagate to this tree. However, a sta-bilization mechanism that makes the tree consistent again is intro-duced.

Searching is done using the local B+-tree and the queries are for-warded whenever a pointer to another peer is reached in the leafnode. Range search is evaluated similarly to the original B+-tree al-gorithm. First, the key from the beginning of the interval is foundusing the previous search algorithm. Because the order of keys ismaintained and the links between leaf nodes are available, we canscan through the whole interval retrieving as necessary until the up-per bound of the interval is reached.

4.5 Multi-dimensional Range Queries

The next step towards a full similarity search is supporting rangequeries over multiple attributes. Let us introduce several systemsthat walk in this direction.

4.5.1 Space-filling Curves with Range Partitioning

The method called Space-filling Curves with Range Partitioning [27](SCRAP) was designed to solve multi-dimensional attribute rangequeries. The idea of the method is simple: first transform the origi-nal vector space into single-dimensional domain and then distributeranges of this new space across a dynamic set of peers.

The transformation step uses space filling curves in order to mapn-dimensional space into a single number. There are several differentspace filling curves available that are suitable to different types ofvector data (see for example [58, 38]).

For the transformed space, a continuous ranges of the destinationspace are assigned to peers in a peer-to-peer network. The techniqueis practically the same as for the Chord described in Section 4.3.2. To

68


resolve a multi-dimensional range query, the query is first mappedto several intervals of the space filling domain. This is done accord-ing to the curve used for mapping, where also range mapping algo-rithms are defined. Then, the standard distributed hash table proto-col of Chord is employed to route the query to all peers overlappingwith these intervals. On these peers, the query is evaluated in theoriginal vector space and the results are returned back to requestingpeer.

4.5.2 Multidimensional Rectangulation with kd-trees

Unlike the previous structure in this section, the MultidimensionalRectangulation with kd-trees (MURK) system [27] does not map theoriginal space into one dimension but routes directly in the multi-dimensional space. This system partitions the dataspace using thekd-trees approach: Every node is responsible for “rectangles” of thespace (hypercuboids in higher dimensions) and every new peer splitsregion of an existing peer in order to divide its load equally.

The routing mechanism is based on the idea of greedy CAN nav-igation (Section 4.3.1). This simple routing is extended by skip point-ers (long links to peers not necessarily neighboring the peer) in orderto speed up the navigation. The skip pointers are either random oremulating the exponential distribution of the pointers known fromone-dimensional routing. This approach gives the desired logarith-mic hop count cost.

4.6 Nearest Neighbors Queries

The systems introduced in the previous section generalize the searchqueries for retrieving data objects that are mutually similar by meansof belonging to some range. The last step towards the similaritysearch, as we perceive it, is searching for data objects that are mostsimilar to a given query object, i.e. we are looking for the nearestneighbors. The systems described in this section allow such a query-ing for specific data domains.

69


4.6.1 pSearch

The pSearch [69] is an information retrieval system that supportssearching for documents whose content best fits the query terms(words). In a classical solution, the documents are represented us-ing a vector space model, i.e. each document is assigned a point ina high-dimensional vector space, where each element of the vectorcorresponds to the importance of a specific term in the document.The similarity between a given query, which is also represented as aterm vector, and a specific document is measured as the cosine of theangle between the vectors of the query and the document. In orderto further restrict the noise in term vectors and to reduce the dimen-sionality of the space, a technique called latent semantic indexing isintroduced. It uses singular value decomposition to transform andtruncate the matrix of term vectors computed in the previous stepand thus, a refined lower-dimensional vector space is received.

The obtained space is then distributed in a peer-to-peer networkusing the CAN technique (see Section 4.3.1), which is designed toprocess vector data. Obviously, we do not use additional hashingfunction, instead the keys are the vectors received from the latentsemantic indexing and the objects stored are the original documents.To resolve a query, first its position in the space is computed, it isforwarded to the peer responsible for the position. The query is thenflooded to peers within a determined similarity radius. The receivingpeers do a local search and return the best matching documents.

4.6.2 Small-World Access Methods

The authors of the Small-World Access Methods [5] (SWAM) gener-ally formalize the issue of similarity search in vector domains in P2PData Networks. They define the similarity in the terms of Range andkNN queries on vector spaces with Lp metrics. This defined modelcan be considered general for all structures that constitute a graph ofpeers (nodes) and navigation links (edges), where nodes have edgesof two types – to spatially abut nodes and the random edges to “dis-tant” nodes. As intuitively clear, this concept embraces some of thepreviously mentioned structures.

Furthermore, the authors of the model also describe the structure

70


SWAM-V, member of this family that partitions the dataspace in aVoronoi-like manner to neighboring cells. The authors also definethree general metrics to measure performance of the similarity searchusing peer-to-peer data networks. The experimental part comparesthe SWAM-V with the CAN (Section 4.3.1) and with a baseline accessmethod – a random graph that simply floods queries to all nodes(practically the Gnutella protocol).

71

Chapter 5

GHT* – Native Peer-to-Peer Similarity Search

Structure

We have shown that the distributed paradigm of the peer-to-peerdata networks truly allows to shift the scalability problem into a newdimension. The structures presented in Chapter 4 can easily employpractically unlimited number of peers and thus overcome the prob-lem of insufficient computing or storage power of centralized solu-tions. However, the presented structures are either not designed tosolve the similarity queries or they are limited to only a specific datadomain. As stated in Chapter 2, the metric space abstraction is asuitable model for similarity searching and it is sufficiently generalto cover wide variety of different data domains.

In this chapter, we present the first distributed index that sup-ports similarity search (namely the range queries and the k-nearestneighbors queries) in generic metric spaces. It is based on the idea ofthe Generalized Hyperplane Tree and it is called GHT∗ [6]. The struc-ture allows storing datasets from any metric space and has many es-sential properties of the SDDS and P2P approaches. It is scalable,because every peer can perform an autonomous split and distributethe data over more peers at any time. It has no hotspot, and all peersuse an addressing schema as precise as possible, while learning frommisaddressing. Updates are performed locally and splitting neverrequires sending multiple messages to many peers. Finally, everypeer can store data and perform similarity queries simultaneously.In what follows, we present the main characteristics of the GHT∗ in-dex.

73

5. GHT* – NATIVE PEER-TO-PEER SIMILARITY SEARCH STRUCTURE

5.1 Architecture

In general, the GHT∗ exploits the Peer-to-Peer paradigm, i.e. it con-sists of network nodes (peers) that can insert, update and delete ob-jects in the structure, and retrieve them using similarity queries.

In the GHT∗, the dataset is distributed among peers participatingin the network. Every peer holds sets of objects in its storage areascalled buckets. A bucket is a limited space dedicated to storing ob-jects. It may, for example, be a memory segment or a block on a disk.The number of buckets managed by a peer depends on its own po-tentialities – a peer can have multiple buckets, only one bucket, or nobucket at all. In the latter case, the peer is unable to hold objects, butcan still issue similarity queries and insert or update objects.

Since the GHT∗ structure is dynamic and new objects can be in-serted at any time, a bucket on a peer may reach its capacity limit.In this situation, a new bucket is created and some objects from thefull bucket are moved to it. This new bucket may be located on adifferent peer than the original one. Thus, the GHT∗ structure growsas new data come in. The opposite operation – merging two bucketsinto one – is also possible, and may be used when objects are deletedfrom the GHT∗.

The core of the algorithm lays down a mechanism for locating ap-propriate peers which hold requested objects. The part of the GHT∗

responsible for this navigation is called the Address Search Tree (AST).In order to avoid hotspots which may be caused by the existence ofa centralized node accessed by every request, an instance of the ASTstructure is present in every peer. Whenever a peer wants to accessor modify the data in the GHT∗ structure, it must first consult its ownAST to get locations, i.e. peers, where the data resides. Then, it con-tacts the peers via network communication to actually process theoperation.

Since we are in a distributed environment, it is practically im-possible to maintain a precise address for every object in every peer.Thus, the ASTs in the peers contain only limited navigation informa-tion which may be imprecise. The locating step is then repeated oncontacted peers until the desired peers are reached. It is guaranteedby the algorithm that the destination peers are always found. TheGHT∗ also provides a mechanism called image adjustment for updat-

74


ing the imprecise parts of the AST automatically.In the following, we summarize the foregoing information and

provide some necessary identifiers which will be employed in theremainder of this chapter:

• Each peer maintains data objects in a set of buckets. Within apeer, the Bucket IDentifier (BID) is used to address a bucket.

• Every object is stored in exactly one bucket.

• Each peer participating in the network has a unique NetworkNode IDentifier (NNID).

• A structure called an Address Search Tree (AST) is present in ev-ery peer.

• Subtrees of the AST are automatically updated during the eval-uation of queries using an algorithm called image adjustment.

• Peers communicate through the message passing paradigm. Forconsistency reasons, each request message expects a confirmationby a proper acknowledgment message.

5.2 Address Search Tree

The AST is a binary search tree based on the Generalized HyperplaneTree (GHT) [70], one of the centralized metric space indexing struc-tures. Its inner nodes hold the routing information of the GHT, a pairof pivots each. Each leaf node represents a pointer to either a bucket(using BID) or a peer (using NNID) holding the data. Whenever thedata is in a bucket on the local peer, a leaf node is a BID pointer. AnNNID pointer is used if the data is on a remote peer. An exampleof the AST is depicted in Figure 5.1. The NNID and BID pointers inleaf nodes are denoted by BIDi and NNIDi symbols, while pivotsof inner nodes are designated as pi. Observe that every inner nodehas exactly two pivots. In order to recognize inconsistencies betweenASTs on different peers, every inner node has a serial number. It isinitially set to one and incremented whenever a particular part of theAST is modified. The serial numbers of inner nodes are shown abovethe inner nodes in Figure 5.1.

75


< p1, p2 >2

ւ ց< p3, p4 >2 < p5, p6 >3

ւ ց ւ ցBID1 BID2 BID3 NNID1

Figure 5.1: An example of an Address Search Tree.

Figure 5.2 illustrates the instances of AST structure in a networkof three peers. The dashed arrows indicate the NNID pointers whilethe solid arrows represent the BID pointers. Observe that Peer 1 hasno buckets, while the other two peers contain objects located onlyunder specific leaves.

Bucket

NNID or BID

Inner node

Legend:

Peer 2 Peer 3

Peer 1

Figure 5.2: The GHT* network of three peers.

5.3 Storage Management

As we have already explained, the atomic storage unit of the GHT∗ isa bucket. The number of buckets and their capacity on a peer always

76


have upper bounds, but these can be different for different peers.Since the bucket identifiers are only unique within a peer, a bucketin the global context is addressed by a pair (NNID, BID). To achievescalability, the GHT∗ must be able to split buckets and allocate newstorage and network resources. As is intuitively clear, splitting onebucket into two implies changes in the AST, i.e. the tree must grow.The complementary operation, merging two buckets into one, forcesthe AST to shrink.

5.3.1 Bucket Splitting

The bucket splitting operation is triggered by the insertion of an ob-ject into an already-full bucket. The procedure consists of the follow-ing three steps:

• A new bucket is allocated. If the capacity exists on the localpeer, the bucket is created there. Otherwise, the bucket is allo-cated either to another peer with free capacity, or a new peer isused.

• A pair of pivots is chosen from objects of the overflowing bucketas detailed in Section 5.3.2.

• Objects from the overflowing bucket closer to the second pivotthan to the first one are moved to the new bucket.

SPLITp2p1

BID 1

p1 p2

2BID1BID

Figure 5.3: Splitting of a bucket in GHT.

Figure 5.3 illustrates splitting one bucket into two. First, two ob-jects are selected from the original bucket as pivots p1 and p2. Then,the distances between the pivots and every object in the original

77


bucket are computed. All objects closer to the pivot p2 are movedinto a new bucket BID2. A new inner node with the two pivots isadded into the AST.

5.3.2 Choosing Pivots

A specific choice of pivot mechanism directly impacts the perfor-mance of the GHT∗ structure. However, the selection can be a time-consuming operation, typically requiring many distance computa-tions. To smooth this process, the authors use an incremental pivotselection algorithm which is based on the hypothesis that the GHTstructure performs better if the distance between pivots is great.

First, the first two objects inserted into an empty bucket becomepivot candidates. Then, distances to the candidates are computed forevery other object inserted. If at least one of these distances is greaterthan the distance between the current candidates, the new object re-places one of the candidates, so the distance between the new pairof candidates is greater. After a sufficient number of insertions, thedistance between the candidates is large with respect to the bucketdataset. However, the technique need not choose the most distantpair of objects. When the bucket overflows, the candidates becomepivots and the split is executed.

5.4 Insertion of Objects

Inserting an object oN starts in a peer by traversing its local AST fromthe root. For every inner node < p1, p2 >, the left branch is followedif d(p1, oN) ≤ d(p2, oN), otherwise the right branch is followed. Oncea leaf node has been reached, a BID or NNID pointer is obtained. If itis the BID pointer, the inserted object is stored in the local bucket thatthe BID points to. Otherwise, the NNID pointer found is applied toforward the request to the peer, where the insertion continues recur-sively until an AST leaf with the BID pointer is reached.

For an example refer to Figure 5.1 again, where the AST is shown.To insert object oN , the peer starts traversing the AST from the root.Assume that d(p1, oN) > d(p2, oN), so the right branch is taken wheredistances d1 = d(p5, oN) and d2 = d(p6, oN) are evaluated. If d1 ≤ d2

78


the left branch is taken which is a leaf node with BID3. Therefore,the object oN is stored locally in a bucket denoted by BID3. In theopposite situation, i.e. d1 > d2, the right branch leading to a leafwith NNID1 is traversed. Reaching the leaf with NNID, the inser-tion must be forwarded to the peer denoted by NNID1 and the insertoperation continues there.

In order to avoid redundant distance computations when search-ing the AST on the other peer, a path, once-determined, in the orig-inal AST is forwarded as well. The path is encoded as a bit-stringcalled BPATH, where each node is represented by one bit – “0” rep-resents the left branch, “1” represents the right branch. Every bitin this path is also accompanied by the respective serial number ofthe inner node. This is used to recognize possible out-of-date entriesand if such entries are found, to update the AST with a more recentversion. (The mechanism is explained in Section 5.8).

When a BPATH is received by a peer, it helps to quickly tra-verse the AST, because the distance computations to pivots are notrepeated. During this quick traversal, the only check is to see if theserial number of the respective inner node equals the serial numberstored in the BPATH. If not, the search resumes with standard ASTtraversal, and the pivot distances are evaluated until the traversal isfinished.

To clarify the concept, see Figure 5.1. A BPATH representing thetraversal to the leaf node BID3 can be expressed as “1[2], 0[3]”. First,the right branch from the root (the first bit thus being one) is takenand the serial number of the root node is two (denoted by the numberin brackets). Then, the left branch with serial number three (thus“0[3]” is the next item) is taken. Finally, reaching a leaf node, thetraversal is finished.

5.5 Range Search

Range search for query R(q, r) is processed as follows. By analogyto insertion, the evaluation of a range search operation in GHT∗ alsostarts by traversing the local AST of the peer which issued the query.However, a different traversal condition is used in every inner node

79


< p1, p2 >, specifically:

d(p1, q)− r ≤ d(p2, q) + r (5.1)

d(p1, q) + r > d(p2, q)− r (5.2)

The right subtree of the inner node is traversed if Condition 5.1 qual-ifies and the left subtree is traversed whenever Condition 5.2 holds.From the equations derived from Lemma 3.2.4 of Chapter 3, it is clearthat both conditions may qualify for a particular range search. There-fore, multiple paths may qualify and finally, multiple leaf nodes maybe reached.

For all qualifying paths having an NNID pointer in their leaves,the query request is recursively forwarded (including known BPATH)to identified peers until a BID pointer is found in every leaf. If mul-tiple paths point to the same peer, only one request with multipleBPATH attachments is sent. The range search condition is evalu-ated by the peers in every bucket determined by the BID pointers,together forming the response as a set of qualifying objects.

5.6 Nearest-Neighbors Search

In principle, there are two strategies for evaluating kNN queries. Thefirst starts with a very large query radius, covering all the data in agiven dataset, to identify the degree to which specific regions mightcontain searched neighbors. The information is stored in a prioritystack (queue) so that the most promising regions are accessed first.As suitable objects are found, the search radius is reduced and thestack adjusted accordingly. Though this strategy never accesses re-gions which do not intersect the query region bounded by the dis-tance from the query object to its k-th nearest neighbor, processing ofregions is strictly serial. On a single computer, the approach is opti-mal [33], but it is not convenient for distributed environments aimingat exploiting parallelism. The second strategy starts with a zero ra-dius to locate the first region to explore and then extends the radiusto locate other candidate regions, if the result-set is still not complete.The nearest-neighbors search in the GHT∗ structure adopts the sec-ond approach.

80


The algorithm first searches for a bucket which has a high proba-bility of containing nearest neighbors. In particular, it seeks a bucketin which the query object would be stored using an insert operation.The accessed bucket’s objects are sorted according to their distanceswith respect to the query object q. Assume there are at least k objectsin the bucket, so that the first k objects, the objects with the shortestdistances to q, are candidates for the result-set. However, there maybe other objects in different buckets that are closer to the query objectthan some of the candidates. In order to check this, a range searchis issued with the radius equal to the distance of the k-th candidate.In this way, a set of objects is obtained which always has cardinalitygreater than or equal to k. If all the retrieved objects are sorted andonly the first k possessing the shortest distances are retained, the ex-act answer to the query is obtained.

If less than k objects are found during the search in the first bucket,another strategy must be applied because the upper bound on thedistance to the k-th nearest neighbor is unknown. The range searchoperation is once again executed, but the radius must be estimated.If enough objects are returned from the range query (at least k), thesearch is complete – the result is again the first k objects from thesorted result of the range search. Otherwise, the radius must beexpanded and the search done again until enough objects are ob-tained. There are two possible strategies for estimating the radius:(1) the optimistic strategy, in which the number of distance compu-tations is kept low but multiple incremental range searches might beperformed in order to retrieve all necessary objects, and (2) the pes-simistic strategy, which prefers bigger range radii at the expense ofadditional distance computations.

Optimistic strategy The objective is to minimize the costs, i.e. thenumber of buckets accessed and distance computations carried out,using a smallish radius, at the risk of more iterations being necessaryif not enough objects are found. In the first iteration, the boundingradius of the candidates is used, i.e. the distance to the last candi-date, even though there are fewer than k candidates. The optimisticstrategy hopes that there will be enough objects in the other bucketswithin this radius. Let x be the number of objects returned from the

81


last range query. If x ≥ k, the search is finished, because the resultis guaranteed, otherwise, the radius is expanded by factor 1 + k−x

k

and the algorithm iterates again. The higher the number of missingobjects, the more the radius is enlarged.

Pessimistic strategy The estimated radius is chosen rather large sothat the probability of a next iteration is minimized, while risking ex-cessive (though parallel) bucket accesses and distance computations.To estimate the radius, the distance between pivots of inner nodes isused, because the algorithm presumes pivots are very distant. Morespecifically, the pessimistic strategy traverses the AST from the leafup to the tree root, using the distance between pivots of the currentnode as the range radius. Every iteration climbs up one level in theAST until the search terminates or the root is encountered. If thereare still not enough objects retrieved, the maximum distance of themetric is used and all objects in the structure are examined.

5.7 Deletions and Updates of Objects

For simplicity reasons, the updates are not handled specifically. In-stead, if the algorithm needs to update an object, it first deletes theprevious instance of this object and inserts the new one.

The deletion of an object o takes place in two phases. First, asearch is made for a particular peer and a bucket containing the ob-ject being deleted. The insert traversal algorithm is used for this.More specifically, the algorithm searches for the leaf node in the ASTcontaining the BID pointer b, where object o would be inserted.

The bucket b is sought to determine whether the object is reallythere. If not, the algorithm finishes, because the object is not presentin the structure. Otherwise, the object is removed from the bucket b.

At this point, an object has been removed from the structure.However, if many objects are removed from buckets, the overall loadof the GHT∗ structure would degrade. Many nearly-empty bucketswould also worsen efficiency at the whole-system level. Therefore,an algorithm is provided to merge two buckets into one in order toincrease the load factor of the bucket.

First, the algorithm must detect (after a deletion) that the bucket

82


��

��

��

��

Delete

BID1 BID2

BID3BID1 BID2BID3 BID3

Np

Nb

4 4

1

2

3... ...

Figure 5.4: Removing a bucket pointer from the AST.

has become underfilled and needs to be merged. This can be easilyimplemented by, e.g., a minimal-load threshold for a bucket. Let Nb

be the leaf node representing the pointer to the underfilled bucket b.A bucket to merge with the underfilled bucket must be found. Thealgorithm, as a rule, always merges the right bucket with the left one,because after a split the original bucket stays in the left and the newone goes to the right.

Let Np be the parent inner node of the node Nb. If the node Nb isa right sub-node of the node Np, then the algorithm reinserts all theobjects from the underfilled bucket to the left subtree of node Np andremoves node Np from the AST, shrinking the path from the root.Similarly, if Nb is a left sub-node, all the objects from the right branchare taken and reinserted into the left branch, and Np is removed fromthe AST. Possible bucket overflows are handled as usual. To allowother peers to detect changes in the AST, the serial numbers of allinner nodes in the subtree with root Np are incremented by one.

Figure 5.4 outlines the concept. We are removing the bucket BID3,so first we reinsert the data to the left subtree of its parent (the shadednode). For every object in BID3 we decide according to pivots inthe left subtree (specifically, the hatchmarked node) whether to go tobucket BID1 or bucket BID2. Then we remove the leaf node withBID3 and, preserving the binary tree, we also remove the parentnode. One can also see that the serial numbers of the affected nodesare incremented by one.

83


5.8 Image Adjustment

An important advantage of the GHT∗ structure is update indepen-dence. During object insertion, a peer can split an overflowing bucketwithout informing other nodes in the network. Similarly, deletionsmay merge buckets. Consequently, peers need not have their ASTsup-to-date with respect to the data, but the advantage is that the net-work is not flooded with many “adjustment” messages for every up-date. AST updates are thus postponed and actually done when therespective insertion, deletion, or search operations are executed.

The inconsistency in the ASTs is recognized on a peer that re-ceives an operation request with corresponding BPATH from anotherpeer. In fact, if the BPATH derived from the AST of the current peeris longer than the received BPATH, this indicates that the sendingpeer has an out-of-date version of the AST and must be updated.The other possibility is inconsistency between serial numbers in theBPATH and the inner nodes of the AST. The current peer easily de-termines a subtree that is missing or outdated on the sending peerbecause the root of this subtree is the last correct element of the re-ceived BPATH. Such a subtree is sent back to the peer through anImage Adjustment Message, IAM.

If multiple BPATHs are received by the current peer (which canoccur in case of range queries) several subtrees can be sent backthrough one IAM (including all found inconsistencies). Naturally,the IAM process can also involve multiple peers. Whenever a peerfinds an NNID in its AST leaf during the path expansion, the re-quest must be forwarded to the located peer. This peer can also de-tect an inconsistency and respond with an IAM. This image adjust-ment message updates the ASTs of all previous peers, including thefirst peer starting the operation. This is a recursive procedure whichguarantees that, for an insertion, deletion or a search operation, ev-ery involved peer is correctly updated.

An example of a communication during a query execution is givenby Figure 5.5. At the beginning, the peer with NNID1 starts to evalu-ate a query. According to its local AST the query must be forwardedto peer NNID2. However, this peer detects that the BPATHs fromthe forwarded request are not complete – i.e. using local AST ofpeer NNID2 the BPATHs are extended and new leaf nodes with

84


NNID2 NNID4

NNID5

NNID3

Request

Reply and IAM

Forward

Reply

Reply

Forward

Forward

Reply

NNID1

Peer Peer Peer

Peer

Peer

Figure 5.5: Message passing during a query and image adjustments.

NNID3, NNID4, and NNID5 are reached. Therefore, the requestis forwarded to those peers and processed there. The peers werecontacted by NNID2, so they respond with the query results backto peer NNID2. Finally, peer NNID2 passes the responses to peerNNID1 as the final result-set along with image adjustment, whichis represented by the respective subtrees of the local AST of the peerNNID2.

5.9 Logarithmic Replication Strategy

As explained previously, every inner node of the AST contains twopivots and the AST structure is present in a more or less accurateform on every peer. Therefore, the number of replicated pivots in-creases linearly with the number of peers used. In order to reducereplication, the authors propose a more economical strategy whichachieves logarithmic replication among peers at the cost of a moder-ately increased number of forwarded requests.

Inspired by the lazy updates strategy by [40], the logarithmic repli-cation scheme uses a slightly modified AST containing only the nec-essary number of inner nodes. More precisely, the AST on a specific

85


Peer 2

Peer 1Peer 1

Peer 3Peer 2

Standard AST Logarithmic AST

Figure 5.6: Example of the logarithmic AST.

peer stores only those nodes containing pointers to local buckets (i.e.leaf nodes with BID pointers) and all their ancestors. However, theresulting AST is still a binary tree which substitutes all subtrees lead-ing exclusively to leaf nodes with NNID pointers by the leftmost leafnode of the subtree. The rationale for choosing the leftmost leaf nodederives from the split strategy, which always retains the left node andadds the right one. Figure 5.6 illustrates this principle. In a way, thelogarithmic AST can be seen as the minimum subtree of the fullyupdated AST. The search operation with the logarithmic replicationscheme may require more forwarding (compared to the full replica-tion scheme), but replication is significantly reduced.

5.10 Joining the Peer-to-Peer Network

The GHT∗ scales-up to process a large volume of data by utilizingmore and more peers. In principle, such an extension can be solvedin several ways. In the GRID infrastructure, for example, new peersare added by standard commands. In the prototype implementation,authors use a pool of available peers known to every active peer.They do not use a centralized registering service. Instead, they ex-ploit broadcast messaging to notify active peers about a new peer thathas become available. When a new network node becomes available,the following actions occur:

86


• The new node with its NNID sends a broadcast message saying“I am here”. This message is received by each active peer in thenetwork.

• The receiving peers add the announced NNID to their localpool of available peers.

Additional storage and computational resources required by an ac-tive peer are extended as follows:

• The active peer picks up one item from the pool of availablepeers. An activation message is sent to the chosen peer.

• With another broadcast message, the chosen peer announces:“I am being used now” so that other active peers can removeits NNID from their pools of available peers.

• The chosen peer initializes its own pool of available peers, cre-ates a copy of the AST, and sends the caller the “Ready to serve”reply message.

The algorithm is illustrated in Figure 5.7, where the numbers rep-resent the messages sent in the order they appear. The darker com-puter is a new peer that has just joined the network. It announces itspresence by an “I am here” message (1) delivered to all other activepeers. Then an active peer (at top left) needs a new peer. It contactsthe shaded peer in order to activate it (2). The shaded peer broadcasts“I am being used now” to all others (3) and responds with “Ready toserve” to the first peer (4).

If a peer does not want to be activated, it might respond to thefirst message immediately saying that it is not available any more.The requesting peer then removes it from its pool and continues withthe next one.

5.11 Leaving the Peer-to-Peer Network

As stated earlier, peers may want to leave the network. The pro-posed technique does not deal with the unexpected exit of peerswhich may occur due to the unreliability in the network, operating

87


2

1

11

3

33

1

3

4

Figure 5.7: New peer allocation using broadcast messages.

system crashes among peers, etc. To recover from such situations,replication and fault tolerance mechanisms are required to preservedata even if part of the system goes down unexpectedly. However,this is stated by the authors of the GHT∗ to be a future research chal-lenge and has not yet been addressed. Therefore, if a peer wants toleave the network, it must perform a clean-up first.

There are two kinds of peers – peers which store some data intheir local buckets and peers which do not. Those which do not pro-vide use of their storage may leave the system safely without caus-ing problems and need not inform the others. However, peers whichhold data must first ensure that data is not lost. In general, such apeer uses the deletion mechanism and reinserts the data again, butwithout offering its storage capacity to the network any longer. Thepeer thus gets rid of all its objects and does not receive new ones.

88

Chapter 6

GHT* Evaluation

In this section, we report on our experience with the distributed in-dex GHT∗ using our prototype implementation in Java. First, wedescribe our computing infrastructure and the datasets used for theevaluation – see Section 6.1. Section 6.2 shows results of the stor-age characteristics of the GHT∗ structure for different settings andreplication strategies. The next section demonstrates the GHT∗ per-formance during evaluation of similarity queries. The global costs,which represent the expenses directly comparable to centralized struc-tures, are observed in Section 6.3.1. In Section 6.3.2, we show theenhancement of distributed computing, i.e. the parallel costs, whichrepresent the actual response time of the search system. In both sec-tions, we show results of range and nearest-neighbors queries, whichare then compared with each other in Section 6.3.3.

The final group of experiments concentrates on the scalability as-pects of the GHT∗. The point we would most like to emphasize in thissection is that, even with a huge and permanently growing dataset,the index distributed on sufficient number of peers is able to main-tain practically constant response times to similarity queries.

6.1 Datasets and Computing Infrastructure

We conducted our experiments using two real-life datasets. The firstwas a dataset of 45-dimensional vectors of color image features (la-beled VEC) compared via the quadratic form distance function. Thesecond dataset consisted of sentences from the Czech language cor-pus (labeled STR), with the edit distance function used to quantifysentence proximity. Both datasets contained 100,000 objects, vectorsor sentences, respectively.

89

6. GHT* EVALUATION

We used a local network of 100 workstations, which are pub-licly available for students. The computers are connected by a high-speed 100Mbit switched network with access times approximately5ms. Since the computers have enough memory, we used the sim-plest setting of the GHT∗ implementation, in which the buckets areimplemented as unordered lists of objects stored in RAM. However,more advanced settings are possible, such as organizing the bucketsby a centralized index, for example the M-tree or D-index and stor-ing the data on disks. Such schemas would additionally extend theefficiency of the distributed index, but would also further complicatethe evaluation of results.

The computers were not exclusively dedicated to our performancetrials. In such an environment, it is practically impossible to maintainidentical behavior for each participating computer, and the speedand actual response times of the computers may vary depending ontheir actual computational load. Therefore we do not report absoluteresponse times but rather the number of distance computations toshow CPU costs, the number of buckets accessed for I/O costs, andthe number of messages used to indicate network communicationcosts.

6.2 Performance of Storing Data

In order to assess the properties of the GHT∗ storage, we have in-serted all the 100,000 objects from both the datasets using the follow-ing four different allocation configurations:

nb 1 5 5 10bs 4,000 1,000 2,000 1,000

In Figure 6.1, we report load factors for the increasing size of theSTR dataset, because similar experiments for the VEC dataset exhibitsimilar behavior. Observe that the load factor is increasing until abucket split occurs, then it goes sharply down. Naturally, the effectsof a new bucket activation become less significant as the size of thedataset and the number of buckets increase.

In sufficiently large datasets, the load factor was around 35% forthe STR and 53% for the VEC dataset. The lower load for the STR

90

6. GHT* EVALUATION

data is mainly attributed to the variable size of the STR sentences –each vector in the VEC dataset was of the same length. In general,the overall load factor depends on the pivot selection method and onthe distribution of objects in the metric space. It can also be observedfrom Figure 6.1 that the peer capacity settings, i.e. the number ofbuckets and the bucket size, do not significantly affect the overallbucket load.

0

0.2

0.4

0.6

0.8

1

0 20000 40000 60000 80000 100000

load

fact

or

data−set size

nb=1 bs=4000nb=5 bs=1000nb=5 bs=2000nb=10 bs=1000

Figure 6.1: Load factor for the STR dataset

The second set of experiments evaluates the replication factorof the GHT∗ separately for the logarithmic and the full replicationstrategies. Figures 6.2a and 6.2b show the replication factors for in-creasing dataset sizes using different allocation configurations. Weagain report only results for the STR dataset because the experimentswith the VEC dataset did not reveal any different dependencies.

Figure 6.2a concerns the full replication strategy. Though the ob-served replication is quite low, it grows in principle linearly with theincreasing dataset size, because complete AST is replicated on eachcomputer node. In particular, the replication factor rises whenever anew peer is applied (after splitting) and then it goes down slowly asnew objects are inserted until another peer is activated. The increas-ing number of used (active) peers can be seen in Figure 6.3.

Figure 6.2b is reflecting the same experiments, but using the log-arithmic replication scheme. The replication factor is more than 10times lower than it is for the full replication strategy. Moreover, wecan see the logarithmic tendency of the graphs for all the allocationconfigurations. Using the same allocation configuration (for examplewhen nb = 5, bs = 2000), the replication factor after 100,000 inserted

91

6. GHT* EVALUATION

(a)

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 20000 40000 60000 80000 100000

repl

icat

ion

fact

or (

full)

data−set size


(b)

0

0.002

0.004

0.006

0.008

0.01

0.012

0 20000 40000 60000 80000 100000

repl

icat

ion

fact

or (

loga

rithm

ic)

data−set size


Figure 6.2: Replication factor for the STR dataset: (a) full replicationscheme, (b) logarithmic replication scheme

objects is 0.118 for the full replication scheme and only 0.005 for thelogarithmic replication scheme. In this case, the replication using thelogarithmic scheme is more than twenty times lower than for the fullreplication scheme.

0

10

20

30

40

50

60

70

80

90

0 20000 40000 60000 80000 100000

num

ber

of a

ctiv

e se

rver

s

data−set size


Figure 6.3: Number of peers used to store the STR dataset

Though the trends of the graphs are very similar, the specific in-stances depend on the applied allocation configuration. In particular,the more objects stored on one peer, the lower the replication factor.For example, the configuration nb = 5, bs = 1, 000 (i.e. maximally5,000 objects per peer) results in a higher replication than the con-figuration nb = 5, bs = 2000 with maximally 10,000 objects per peer.Moreover, we can see that the configuration with one bucket per peeris significantly worse than the similar setting with 5 buckets per peerand 1,000 objects in one bucket. On the other hand, the other twosettings with 10,000 objects per peer (either as 10 buckets with 1 000

92

6. GHT* EVALUATION

objects or 5 buckets with 2,000 objects) achieve almost the same repli-cation.

6.3 Performance of Similarity Queries

In order to study the performance of the GHT∗ for changing queries,we have measured the costs of range and nearest-neighbors queriesfor different sizes of query radii and different numbers of neighbors,respectively.

To achieve deterministic and reliable experimental results, weused the logarithmic replication schema for all participating peers.We also used a constant number of buckets per peer and capacity.Specifically, every peer was capable of holding up to five bucketswith a maximum 1,000 objects per bucket.

All inputs for graphs in this section were obtained by averagingthe results of fifty queries with a different set of (randomly chosen)query objects and search radius or number of neighbors, respectively.

6.3.1 Global Costs

A distributed structure uses the power of networked computers tospeed up query evaluation by parallel execution. However, everyparticipating peer must employ its resources and that naturally in-curs some costs. In this section, we provide the total cost neededover all trials, i.e. the sum of costs for each peer employed duringquery evaluation.

In general, total costs are directly comparable to those of central-ized indexes, because these represent the costs the distributed struc-ture would need if run on a single computer. Of course, there aresome additional costs due to the distributed nature of the algorithms.In particular, a centralized structure incurs no network communica-tion costs.

Buckets Accessed (I/O costs) The first trial focused on relation-ships between query size and total number of buckets and peers ac-cessed. Figure 6.4 reports these results separately for the VEC andSTR datasets, for different radii of range queries together with the

93

6. GHT* EVALUATION

number of retrieved objects (divided by 100 to make the graphs eas-ier to evaluate). If the radius increases, the number of peers accessedgrows practically linearly, the number of accessed buckets a bit faster.However, the number of retrieved objects satisfying the query, i.e.,the result-set size, may grow exponentially. This is in accordancewith the I/O behavior of centralized metric indexes such as the M-tree or the D-index on the global (not distributed) scale.

0

10

20

30

40

50

60

70

600 800 1000 1200 1400 1600range query radius

VEC

result set size/100buckets accessedpeers accessed

0

20

40

60

80

100

120

140

0 5 10 15 20range query radius

STR

result set size/100buckets accessedpeers accessed

Figure 6.4: Average number of buckets, peers, and objects retrievedas a function of radius.

We have also measured these characteristics for kNN queries,and the results are shown in Figure 6.5. We again report the numberof buckets and peers accessed with respect to the increasing valueof k. As should be clear, the value k also represents the number ofobjects retrieved. These trials once again reveal a behavior similarto centralized indexes – total costs are low for small values of k, butgrow very rapidly as the number of neighbors increases.

Distance Computations (CPU costs) In the following experiments,we have concentrated on the total cost of the similarity queries mea-sured by the number of distance computations. Specifically, Fig-ure 6.6 shows the results for increasing radii in range queries. Thetotal cost is the sum of all distance computations performed by ev-ery accessed peer in accessed buckets plus “navigation” costs. Thenavigation cost is measured in terms of distance computations in theAST (shown separately in the graph). Since these costs are well be-low 1%, they can be neglected for practical purposes. Observe thattotal costs have once again been divided by 100.

94

6. GHT* EVALUATION

0

20

40

60

80

100

120

140

10 100 1000 10000k

VEC

buckets accessedpeers accessed

0

20

40

60

80

100

120

140

10 100 1000 10000k

STR

buckets accessedpeers accessed

Figure 6.5: Average number of buckets, peers, and objects retrievedas a function of k.

0

50

100

150

200

250


VEC

total/100AST

dist

ance

com

puta

tions

600 0

50

100

150

200

250

300


STR

total/100AST

dist

ance

com

puta

tions

Figure 6.6: Total and AST distance computations as a function of ra-dius.

In Figure 6.7, we show total distance computation costs of kNNqueries for different values of k. The results were obtained similarlyas for range queries, and for convenience we provide the AST com-putations as well. It can be seen that, even for the computationallymore expensive nearest-neighbors queries, AST navigation costs areonly marginal and can be neglected.

Compared to centralized indexes, the GHT∗ performs better thanthe sequential scan, but the M-tree and D-index achieve better results(see Section 3.3.3). However, the GHT∗ can perform distance compu-tations in parallel, which is the main advantage of the distributedindex. We elaborate on this issue in the next section.

95

6. GHT* EVALUATION

0

200

400

600

800

1000

1200

10 100 1000 10000

dist

ance

com

puta

tions

k

VEC

total/100AST

0

50

100

150

200

250

300

10 100 1000 10000

dist

ance

com

puta

tions

k

STR

total/100AST

Figure 6.7: Total and AST distance computations as a function of k.

Messages Sent (communication cost) Algorithms for the evalu-ation of similarity queries in GHT∗ send messages whenever theyneed to access other peers. More specifically, if an NNID pointer forthe peer is encountered in a leaf node during evaluation, a messageis sent to that peer. These are termed request messages. Messagesdestined for the same peer (i.e. multiply-reached leaf nodes containthe same NNID) are clustered and sent together within one message.Figures 6.8 and 6.9 depict the total number of request messages sentby peers involved in a range and kNN search, respectively. We havealso measured the number of messages that had to be forwarded be-cause of improper addressing. This situation occurs when a requestmessage arrives at a peer that does not evaluate the query in its localbuckets and only passes (forwards) the message to a more appropri-ate peer. This cost is a little higher for the kNN algorithm because itsfirst phase needs to navigate to the proper bucket first.

Intuitively, the total number of (request) messages is strictly re-lated to the number of peers accessed. This fact is confirmed by trialsusing both range and nearest-neighbors queries. We have also ob-served that, even with the logarithmic replication strategy, the aver-age number of messages forwarded is below 15% of the total numberof messages sent during query execution.

Messages are specific to a distributed environment and thereforehave no adequate counterpart in centralized structures. However, ingeneral, trials confirmed the fact that similarity queries are expen-

96

6. GHT* EVALUATION

forwardrequest/10

0.6

0.8

1

1.2

1.4

600 800 1000 1200 1400 1600

num

ber

of m

essa

ges

range query radius

VEC

forwardrequest/10

0

1.0

1.5

2.0

2.5


STR

num

ber

of m

essa

ges

0.5

Figure 6.8: Average number of request and forward messages as afunction of the radius.

0

5

10

15

20

25

30

35

10 100 1000 10000

mes

sage

s se

nt

k

VEC

requestforward

0

5

10

15

20

25

30

35

10 100 1000 10000

mes

sage

s se

nt

k

STR

requestforward

Figure 6.9: Average number of request and forward messages as afunction of the k.

sive, that is the total execution costs are high.

6.3.2 Parallel Costs

The objective of this section is to report results using the distributedstructure GHT∗, with an emphasis on parallel costs. As opposed tothe total costs, these correspond to the actual response time of theGHT∗ index structure to execute similarity queries.

For our purposes, we define the parallel cost as the maximum ofthe serial costs from all accessed peers. For example, to measure aparallel distance computations cost during a range query, we gatherthe number of distance computations on each peer accessed during

97

6. GHT* EVALUATION

the query. The maximum of those values is the query’s parallel cost,since the range query evaluation has practically no serial component(except for the search in the AST on the first peer, which is very low-cost and so can be neglected).

A different situation occurs during the execution of kNN queries,because the kNN search algorithm consists of two phases, whichcannot be performed simultaneously. The parallel cost is thereforethe sum of the parallel costs of the respective phases. As explained inSection 5.6, the first phase navigates to a single bucket seeking can-didates for neighbors. The second phase consists of a range query,for which we have already defined the parallel cost. However, thesecond phase can be repeated when the number of objects retrievedis still smaller than k. Finally, the parallel cost of a nearest-neighborsquery is the sum of the cost of the first phase plus the parallel cost ofevery needed second phase.

Buckets Accessed (I/O costs) The parallel costs for range queries,measured as the maximal number of accessed buckets per peer, aresummarized in Figure 6.10. Since the number of buckets per peer isbounded – our trials employed a maximum five buckets per peer –the parallel cost remains stable at around 4.3 buckets per peer. Forthis reason, the parallel range query cost scales well with increasingquery radius.

0

1

2

3

4

5

6

7

600 800 1000 1200 1400 1600

buck

ets

acce

ssed

range query radius

VEC

0

1

2

3

4

5

6

7

0 5 10 15 20

buck

ets

acce

ssed

range query radius

STR

Figure 6.10: Parallel cost in accessed buckets as a function of radius.

A nearest-neighbors query always requires one bucket access forthe first phase. Then multiple second phases may be required and

98

6. GHT* EVALUATION

their costs are added to the resulting parallel cost. Figure 6.11 showsthese results, with the number of iterations in the second phase ofthe algorithm represented by the lower curve. It can be seen that, forsmall values of k, only one iteration is needed and the cost is some-where around the value 5.4, consisting of 1.0 for the initial bucketand 4.4 for the range query. As the value of k grows above the num-ber of objects in one bucket, more iterations are needed. Obviously,each additional iteration represents a serial step of query execution,so the cost slightly increases, but the increase is not doubled, becausethe algorithm never accesses buckets which have already been pro-cessed. In any case, the number of iterations is not high and in ourexperiments maximally two iterations were always sufficient.

iterationsbucket accessed

0

2

4

6

8

10

12

14

16

10 100 1000 10000k

VEC

0

2

4

6

8

10

12

14

10 100 1000 10000k

STR

iterationsbucket accessed

Figure 6.11: Parallel cost in accessed buckets with the number of it-erations as a function of k.

Distance Computations (CPU costs) Parallel distance computationsrepresented the major query execution cost in our trials, and can beconsidered an accurate approximation to actual query response time.This is mainly thanks to the fact that the time to access buckets andsend messages is practically negligible compared to evaluation of thedistance metric functions. This proved to be especially true when thecomputations of edit distance and quadratic form metric functionswere very time demanding – accessing a bucket in local memorycosts microseconds, while network communications can be achievedin tens of milliseconds.

99

6. GHT* EVALUATION

We have applied a standard methodology: We have measured thenumber of distance computations evaluated by each peer, and takenas the reported cost the maximum of these values. Figure 6.12 showsresults averaged for the same set of fifty randomly chosen query ob-jects and a specific radius. Since the number of objects stored perpeer is bounded (maximum five buckets per peer and 1,000 objectsper bucket), the cost would never exceed this value. Recall that wedo not consider AST costs, which are of no practical significance.Thus the structure retains an essentially constant response time forany size query radius.

0

5000

10000

15000

20000

25000

800 1000 1200 1400 1600

dist

ance

com

puta

tions

range query radius

VEC

600 0

5000

10000

15000

20000

25000

30000

0 5 10 15 20

dist

ance

com

puta

tions

range query radius

STR

Figure 6.12: Parallel cost in distance computations as a function ofradius.

The situation is similar for kNN queries but the sequential com-ponents of the search algorithm must be properly considered. The re-sults are shown in Figure 6.13 and represent the parallel costs for dif-ferent numbers of neighbors k, measured in terms of distance com-putations. It can be seen that costs grow very quickly to a value ofaround 5,000 distance computations. This value represents the par-allel cost of the range query plus the initial search for the first bucket.Some increase in distance computations with k around 800 can alsobe seen. This is caused by the added sequential phase of the algo-rithm, i.e., the next iteration. The increase is not dramatic, since onlysome additional buckets are searched to amend the result-set to kobjects. This is in accordance with the buckets accessed in parallelshown in Figure 6.10. It can be seen there that only one additional“parallel” bucket was searched during the second iteration, thus theincrease in parallel distance computations may be maximally 1,000

100

6. GHT* EVALUATION

(the capacity of a bucket).

0

2000

4000

6000

8000

10000

12000

14000

16000

10 100 1000 10000

dist

ance

com

puta

tions

k

VEC

0

2000

4000

6000

8000

10000

12000

14000

10 100 1000 10000

dist

ance

com

puta

tions

k

STR

Figure 6.13: Parallel cost in distance computations as a function of k.

Messages Sent (communication cost) Parallel communication costis a bit different from previous cases, since we cannot compute it “perpeer”. During the evaluation of a query, every peer can send mes-sages to several other peers, but we can consider the cost of send-ing several messages to different peers equal to the cost of sendingonly one message to a specific peer, since a peer sends them all atonce. Thus, the parallel communication cost consists of a chain offorwarded messages, the sequential passing of the request to otherpeers. The number of peers sequentially contacted during a search,is usually called the hop count. In the GHT∗ algorithm, there can beseveral different “hop” paths. For our purposes, we have taken thelongest hop path, i.e., the path with maximal hop count, for the par-allel communication cost.

Figures 6.14 and 6.15 present the number of hops during a rangeand kNN search, respectively. Our experimental trials show paral-lel communication is essentially logarithmically proportional to thenumber of peers accessed (see Figure 6.4), a desirable property in anydistributed structure. The time spent communicating can also be de-duced from these graphs. However, it is hard to see the contributionof this cost to the overall response time of a query, since each peerfirst traverses its AST and forwards messages to the respective peers(if needed), and only then it begins to compute distances inside itsbuckets. So the communication time is only added to the time spent

101

6. GHT* EVALUATION

computing the distances in peers contacted subsequently, which canhave only a few objects in their buckets. In this case, the overall re-sponse time is practically unaffected by communication costs.

1

1.5

2

2.5

3

600 800 1000 1200 1400 1600

hop

coun

t

range query radius

VEC

1

1.5

2

2.5

3

0 5 10 15 20

hop

coun

t

range query radius

STR

Figure 6.14: Number of parallel messages as a function of radius.

0

2

4

6

8

10

1 10 100 1000 10000

hop

coun

t

k

VEC

0

2

4

6

8

10

12

1 10 100 1000 10000

hop

coun

t

k

STR

Figure 6.15: Number of parallel messages as a function of k.

6.3.3 Comparison of Range and Nearest-Neighbors Search Algo-rithms

In principle, the nearest-neighbors search can be solved by a rangequery, provided a specific radius is known. After a kNN query hasbeen solved, it becomes trivial to execute the corresponding rangequery with a precisely measured radius, i.e., using the distance fromthe query object to the k-th retrieved object. However, such a radius

102

6. GHT* EVALUATION

will generally be unknown, so kNN queries are typically more ex-pensive. We have compared the costs in terms of distance computa-tions of the original nearest-neighbors query execution with the costsof the respective range query with exact radius. In what follows, weprovide both the parallel and total costs measured according to themethodology used throughout this section.

kNN total/10kNN parallelrange total/10range parallel

0

2000

4000

6000

8000

10000

12000

14000

16000

10 100 1000 10000

dist

ance

com

puta

tions

k

VEC

kNN total/10

range total/10kNN parallel

range parallel

0

2000

4000

6000

8000

10000

12000

14000

10 100 1000 10000

dist

ance

com

puta

tions

k

STR

Figure 6.16: Comparison of a kNN and range query returning k ob-jects as a function of k.

The trials show kNN query execution costs are always slightlyhigher than those of a comparable range query execution. In par-ticular, total costs are practically equal to those of the range query,mainly because the kNN algorithm never accesses the same buckettwice. The difference is caused by the fact that the estimated radiusneed not be optimal. A different situation can be observed for par-allel costs, since the kNN search needs some sequential executionsteps, thus diminishing the possibility for parallel execution. In Fig-ure 6.16, the effects of accessing the first bucket during the first phaseof the kNN algorithm can be clearly seen in the difference betweenthe range and kNN parallel cost lines in the graphs. The costs of thesecond iteration become visible after k > 800, which further wors-ens the parallel response time of the nearest-neighbors query. How-ever, the parallel response time is still comparable to that of the rangequery. It is practically stable and does not grow markedly.

103

6. GHT* EVALUATION

6.4 Data Volume Scalability

In this section, we detail our tests of scalability of the GHT∗, i.e., theability to adapt to the expanding dataset. To measure this experimen-tally, we have fixed the query parameters by choosing two distinctquery radii and three different values for nearest neighbors k. Thesame set of fifty randomly chosen query objects was employed dur-ing the experiment, with the graphs depicting average values. More-over, we have gradually expanded the original dataset to 1,000,000objects. The following results were obtained as measures at particu-lar stages of incremental insertion. More specifically, we have mea-sured intraquery parallelism costs after every block of 2,000 insertedobjects.

We quantify the intraquery parallelism cost as the parallel re-sponse of a query measured in terms of distance computations. Thisis defined to be the maximum of costs incurred on peers involvedin the query, including navigation costs in the AST. Specifically, eachaccessed peer computes its internal cost as the sum of the compu-tations in its local AST, and the computations in its buckets visitedduring the evaluation. The intraquery cost is then computed as themaximum of the internal costs of all peers accessed during the eval-uation. Note also that the intraquery parallelism is proportional tothe response time of a query.

0 500

1000 1500 2000 2500 3000 3500 4000

200 400 600 800 1000

dist

ance

com

puta

tions

dataset size (x 1000)

VEC

range 600range 1500

0

1000

2000

3000

4000

200 400 600 800 1000

dist

ance

com

puta

tions


STR

range 5range 15

Figure 6.17: Parallel cost as a function of dataset size for two differentradii.

The results summarized in Figure 6.17 show intraquery paral-lelism remains very stable, independently of dataset size. Thus the

104

6. GHT* EVALUATION

parallel search time, which is proportional to this cost, remains prac-tically constant, which is to be expected from the fact that storage andcomputing resources are added gradually as the size of the datasetgrows. Of course, the number of distance computations needed fortraversing the AST grows with the size of the dataset. However, thiscontribution is not visible. The reason is that AST growth is logarith-mic, while peer expansion is linear.

0

2000

4000

6000

8000

10000

12000

200 400 600 800 1000

dist

ance

com

puta

tions


3 NN100 NNrange for 3 NNrange for 100 NN

VEC

0

2000

4000

6000

8000

10000

12000

14000

200 400 600 800 1000

dist

ance

com

puta

tions 3 NN

100 NNrange for 3 NNrange for 100 NN


STR

Figure 6.18: Parallel cost as a function of dataset size for three differ-ent k.

The nearest-neighbors results shown in Figure 6.18 exhibit simi-lar behavior, only the absolute cost is a bit higher. This is incurred bythe sequential steps of the nearest-neighbors search algorithm, con-sisting of locating the first bucket, followed by possibly multiple se-quential iterations. However, the cost is still nearly constant, thus thequery response remains unchanged even if the file grows in size.

105

Chapter 7

Scalability Comparison of Distributed Simi-

larity Search

Very recently two other distributed similarity searching structuresbased on the metric space approach were published. Contrary tothe GHT∗ structure, they apply transformation strategies – the met-ric similarity search problem is transformed into a series of rangequeries executed on existing distributed keyword structures, namelythe CAN [62] and the Chord [68] (see Section 4.3). By analogy, theyare called the MCAN [24] and the M-Chord [57]. Each of the struc-tures is able to execute similarity queries for any metric dataset, andthey both exploit parallelism for query execution.

We have also modified the GHT∗ algorithms changing the under-lying partitioning principle from the generalized hyperplane to theball partitioning (see Section 3.1). We refer to this technique as theVPT∗ and it is a fourth distributed metric based index structure.

Since all the structures were implemented using the very sameJava framework an interesting question arisen: What are the advan-tages and disadvantages of the individual approaches in terms ofsearch costs and scalability for different real-life search problems?

In this chapter, we first provide a brief specification of the MCANand M-Chord techniques in Section 7.1 and 7.2 respectively. Since theVPT∗ technique is a modification of the GHT∗ structure, we brieflyreport on its differences in Section 7.3. To be better comparable withthe other two techniques we enhanced the GHT∗ algorithms to avoidsome distance computations by using the pivot filtering techniqueas explained in Section 3.2.5. The rest of this chapter then presentsthe results of the extensive experimental comparison of all the fouraforementioned distributed metric index structures.

107

7. SCALABILITY COMPARISON OF DISTRIBUTED SIMILARITY SEARCH

7.1 MCAN

In order to manage metric data, the MCAN [24] uses a pivot-basedtechnique that maps data objects x ∈ D to an N-dimensional vec-tor space R

N . Then, the CAN [62] Peer-to-Peer protocol is used forpartitioning the space and the internal navigation – see Section 4.3.1.Having a set of N pivots p1, . . . , pN selected from D, MCAN maps anobject x ∈ D to the vector space by means of the following functionF : D → R

N :

F (x) = (d(x, p1), d(x, p2), . . . , d(x, pN)). (7.1)

The virtual vector space coordinates designate the object x place-ment within the MCAN structure. The CAN protocol divides thevector space into regions and assigns them to the participating peers.The object x is stored by the peer whose region contains F (x). Us-ing L∞ as a distance function in the vector space, the mapping F iscontractive, i.e. L∞(F (x), F (y)) ≤ d(x, y). It can be proved using thetriangle inequality of the metric function d [24]. Thus, the algorithmfor Range(q, r) query involve only the regions that cover objects xfor which L∞(F (x), F (q)) ≤ r. In other words, it accesses the re-gions that intersect the hypercube with side 2r centered in F (q) (seeFigure 7.1).

In order to further reduce the number of evaluated distances,MCAN uses the additional pivot-based filtering according to Sec-tion 3.2.5. All peers use the same set of pivots: the N pivots fromthe mapping function F (Equation 7.1), plus additional pivots sinceN is typically low. All the pivots are selected from a sample datasetusing the incremental selection technique [8].

Routing in MCAN works in the same way as for the originalCAN. Every peer maintains a coordinate-based routing table con-taining the network identifiers and the coordinates of its neighboringpeers in the virtual R

N space. In every step, the routing algorithmpasses the query to the neighboring peer that is geometrically theclosest to the target point in the space. Given a dataset, the averagenumber of neighbors per peer is proportional to the dimensionalityN while the average number of hops to reach a peer is inversely pro-portional to this value [62].

108


Figure 7.1: Example of MCAN range query

The insert operation

When inserting an object x ∈ D into MCAN, the initiating peer com-putes distances between x and all pivots. These values are used formapping x into R

N by Equation 7.1 and then the insertion request isforwarded (using the CAN navigation) to the peer that covers valueF (x). The receiving peer stores x and if it reaches its storage capacitylimit (or another defined condition) it executes a split. The peer’s re-gion is split into two parts trying to divide the storage equally. Oneof the new regions is assigned to the new active peer and the otherone replaces the original region.

Range search algorithm

The peer that initiates a Range(q, r) query first computes distancesbetween q and all the pivots. The CAN protocol is then employed inorder to reach the region which covers F (q). If a peer, visited duringthe routing process, intersects the query area, the request is spread toall other involved peers using a multicast algorithm described in de-

109


tail in [41, 63]. Every affected peer searches its data storage employ-ing the pivot filtering mechanism and returns the answer directly tothe initiator.

7.2 M-Chord

Similarly to the previous MCAN method, the M-Chord [57] approachalso transforms the original metric space. The core idea is to mapthe dataspace into a one-dimensional domain and use this domaintogether with the Chord routing protocol [68].

In particular, this approach exploits the idea of a vector indexmethod iDistance [39] which partitions the data space into clusters(Ci), identifies reference points (pi) within the clusters, and definesone-dimensional mapping of the data objects according to their dis-tances from the cluster reference point. Having a separation constantc, the iDistance key for an object x ∈ Ci is

idist(x) = d(pi, x) + i · c.

Figure 7.2a visualizes the mapping schema. Handling a Range(q, r)query, the space to be searched is specified by iDistance intervals forsuch clusters that intersect the query sphere – see an example in Fig-ure 7.2b.

qr

(a) (b)*c *c0 c2*c 3*c0

(iDistance)

0

p2

p1

c 2 3

C

C2

C1

p

p

p

1

0 2

0

C

C2

C1

p

0

Figure 7.2: The iDistance principles

110


This method is generalized to metric spaces in the M-Chord. Nocoordinate system can be used to partition a general metric space,therefore, a set of n pivots p0, . . . , pn−1 is selected from a sample datasetand the space is partitioned according to these pivots. The partition-ing is done in a Voronoi-like manner [36] (every object is assigned toits closest pivot).

Because the iDistance domain is to be used as the key space forthe Chord protocol, the domain is transformed by a uniform order-preserving hash function h into the M-Chord domain of size 2m. Thus,for an object x ∈ Ci, 0 ≤ i < n, the M-Chord key-assignment formulabecomes:

m-chord(x) = h(d(pi, x) + i · c) (7.2)

Insert Algorithm

Having the dataspace mapped into the one-dimensional M-Chord do-main, every active node of the system takes over responsibility for aninterval of keys. The structure of the system is formed by the Chordcircle [68]. This Peer-to-Peer protocol provides an efficient localiza-tion of the node responsible for a given search key – see explanationin Section 4.3.2.

When inserting an object x ∈ D into the structure, the initiatingnode Nins computes the m-chord(x) key through Formula 7.2 and em-ploys the Chord to forward a store request to the node responsible forthe computed key (see Figure 7.3a).

The nodes store the data in a B+-tree storage according to theirM-Chord keys. When a node reaches its storage capacity limit (or an-other defined condition) it executes a split. A new node is placed onthe M-Chord circle, so that the requester’s storage can be split evenly.

Range Search Algorithm

The node Nq that initiates the Range(q, r) query uses the iDistancepruning idea to determine several M-Chord intervals to be examined.The Chord protocol is then employed to reach the nodes responsiblefor middle points of the intervals. The request is then spread to allnodes covering the particular interval (see Figure 7.3b).

The iDistance pruning technique filters out all objects x ∈ Ci that

111


responserequest

N

Nx

Nq

(b)

0

(a)

ins

insert(x):

forward(k,x)

receive(k,x):store(k,x)

m−chord(x)

k:=m−chord(x)

Figure 7.3: The insert (a) and Range (b) schema of the M-Chord

fulfil |d(x, pi)−d(q, pi)| > r. When inserting an object x into M-Chord,distances d(x, pi) are computed ∀i : 0 ≤ i < n. These values arestored together with object x and the general metric filtering criterion(see Section 3.2.5) improves the pruning of the search space.

7.3 VPT*

This structure is similar to the GHT∗ technique described in Chap-ter 5, only the underlying partitioning schema is based on ball par-titioning of the metric space. The structure, similarly to the GHT∗,store metric objects in buckets that are held by peers in a peer-to-peer network. Peers communicate by interchanging messages, posequeries and evaluate results. The core of the algorithm lay in themodified Address Search Tree (AST) that adopts the Vantage PointTree [70].

The ball partitioning, which is used by the VPT∗ structure, canbe seen in Figure 7.4. In general, this principle also allows to dividea set I into two partitions I1 and I2. However, only one pivot o11

is selected from the set and the objects are divided by a radius r1.More specifically, if the distance between the pivot o11 and an objecto ∈ I is smaller or equal to the specified radius r1, i.e. if d(o11, o) ≤ r1

then the object belongs to partition I1. The object is assigned to I2

112


13

2120

4

17

18

12

r3

1 3

22

9

14

2

65 7

16

r1

r2

19

11

15

8

10 BID2 BID3 NNID1BID1

11

15 r2

r1

r312

10 11 14

191615

20 21

652

7 8 9

13 22

181712

1 3 4BID 1 BID 2 BID 3

Figure 7.4: Address Search Tree with the ball partitioning

otherwise. This algorithm is used recursively to build a binary tree.Additionally, for every inserted object o, we store all the computeddistances to pivots, e.g. for the first traversal step we store the dis-tance d(o11, o), etc. These precomputed distances are then used forpivot filtering (see Section 3.2.5) to avoid some distance computa-tions. The leaf nodes of the AST follow the same schema for address-ing local buckets and remote peers as the original GHT∗.

Range search

The search for Range(q, r) query proceed similarly to the GHT∗, onlythe traversing condition is different. For every inner node in the tree,we evaluate the following conditions:

d(p, q)− r ≤ rp (7.3)

d(p, q) + r > rp (7.4)

The right subtree of the inner node is traversed if Condition 7.3qualifies. The left subtree is traversed whenever Condition 7.4 holdsrespectively. It is clear that both conditions may qualify at the sametime for a particular range search. Therefore, multiple paths may befollowed and finally, multiple leaf nodes may be reached.

Similarly to the GHT∗, for all qualifying paths having an NNIDpointer in their leaves, the query request is recursively forwardedto identified peers until a BID pointer is found in every leaf. The

113


range search condition is evaluated by the peers in every bucket de-termined by the BID pointers. Finally, all qualifying objects form thequery response set.

7.4 Comparison Background

All the compared systems are dynamic. Each structure maintains aset of available inactive nodes and employs these to split the over-loaded nodes, although other splitting scenarios are possible as well.For the experiments, the systems consisted of up to 300 active net-work nodes. In fact, we have used same pool of computers as forthe evaluation of GHT* structure in Chapter 6 only a bigger set wasutilized. Each of the GHT∗ and VPT∗ peers maintained five bucketswith capacity of 1,000 objects and the MCAN and M-Chord peers hadstorage capacity of 5,000 objects.

We selected the following significantly different real-life datasetsto conduct the experiments on:

VEC 45-dimensional vectors of extracted color image features. Thesimilarity function d for comparing the vectors is a quadratic-form distance [65]. The distribution of the dataset is quite uni-form and such a high-dimensional data space is extremely sparse.Unfortunately, this dataset is different from the one used inChapter 6, since we needed up to one million objects for the ex-periments in this section. However, the distance distributionsare similar in both datasets.

TTL titles and subtitles of Czech books and periodicals collected fromseveral academic libraries. These strings are of lengths from 3to 200 characters and are compared by the edit distance [49] onthe level of individual characters. The distance distribution ofthis dataset is skewed.

DNA protein symbol sequences of length sixteen. The sequences arecompared by a weighted edit distance according to the Needleman-Wunsch algorithm [56]. This distance function has a very lim-ited domain of possible values – the returned values are inte-gers between 0 and 100.

114


Observe that none of these datasets can be efficiently indexed andsearched by a standard vector data structure. If not stated otherwise,the stored data volume is 500,000 objects. When considering the scal-ability with respect to the growing dataset size, the whole datasetsof 1,000,000 objects are used (900,000 for TTL). As for other settingsthat are specific for particular data structures, the MCAN uses 4 piv-ots to build the routing vector space and 40 pivots for filtering. TheM-Chord uses 40 pivots as well. The GHT∗ and VPT∗ structures usevariable number of pivots according to the depth of the AST tree.

All the presented performance characteristics of query process-ing have been taken as an average over 100 queries with randomlychosen query objects.

7.5 Scalability with Respect to the Size of the Query

In the first set of experiments, we have focused on the systems’ scal-ability with respect to the size of the processed query. Namely, welet the structures handle a set of Range(q, r) queries with growingradii r. The size of the stored data was 500,000 objects. The averageload ratio of nodes for all the structures was 60–70% which resultedin approximately 150 active nodes for each tested system.

We present results of these experiments for all the three datasets.All graphs in this section represent dependency of various measure-ments (vertical axis) on the range query radius r (horizontal axis) andthe title of the graphs identify the used dataset. For the VEC dataset,we varied the radii r from 200 to 2,000 and for the TTL and DNAdatasets from 2 to 20.

In the first group of graphs, shown in Figure 7.5, we report on therelation between the query radius size and the number of retrievedobjects. As intuitively clear, the bigger the radius the higher the num-ber of objects satisfying the query. Since we have used the samedatasets, query objects and radii, all the structures return the samenumber of objects. We can see that the number of results grows ex-ponentially with respect to the query radius for all the three datasets.Note that, for example, the biggest radius 2,000 in the VEC datasetselects almost 10,000 objects (2% of the whole database), for the TTLdataset the biggest radius retrieves even more objects. Obviously,

115


0 1000 2000 3000 4000 5000 6000 7000 8000 9000

10000


VEC

all structures

retr

ieve

d ob

ject

s

0 2000 4000 6000 8000

10000 12000 14000 16000


TTL

all structures

retr

ieve

d ob

ject

s

0 1000 2000 3000 4000 5000 6000 7000 8000 9000


DNA

all structures

retr

ieve

d ob

ject

s

Figure 7.5: Number of retrieved objects

such big radii are usually not reasonable for applications (e.g., twotitles with edit distance 20 differ a lot), but we provide the resultsin order to study behavior of the structures also in these cases. Onthe other hand, smaller radii return reasonable amounts of objects,for instance, radius 6 results in approximately 30 objects in the DNAdataset, which is not clearly readable from the graphs.

The number of visited nodes is reported in Figure 7.6. Morespecifically, the graphs show the ratio of the number of nodes thatare involved in a particular range query evaluation to the total num-ber of active peers forming the structure. As mentioned earlier, thetotal number of active peers participating in the network was around150, thus, value 20% in the graph means that approximately 30 peerswere used to complete the results. We can see that the number ofemployed peers grows practically linearly with the size of the radius.The only exception is the GHT∗ algorithm, which visits all the partic-ipating nodes very soon as the radius grows. This is induced by thefact that the generalized hyperplane partitioning does not guaranteea balanced split as opposed to the other three methods. Moreover,because we count all the nodes that evaluate distances as visited, theVPT∗ and the GHT∗ algorithms are a little bit handicapped. Recall

116


0 10 20 30 40 50 60 70 80 90

100


VEC

GHT*VPT*MCANM−Chord

visi

ted

node

s (%

)

0

20

40

60

80

100


TTL


visi

ted

node

s (%

)


0

20

40

60

80

100

120


DNA

visi

ted

node

s (%

)

Figure 7.6: Percentage of visited nodes

that they need to compute distances to pivots during the navigationand thus the nodes that only forwards the query are also counted asvisited.

Note that the used dataset influences the number of visited nodes.For instance, the DNA metric function has a very limited set of dis-crete distance values, thus, both the native and transformation meth-ods are not as efficient as for the VEC dataset and more peers haveto be accessed. From this point of view, the M-Chord structure per-forms best for the VEC dataset and also for smaller radii in the DNAdataset, but it is outperformed by the MCAN algorithm for the TTLdataset.

The next group of experiments, depicted by Figure 7.7, shows thecomputational costs with respect to the query radius. We provide apair of graphs for every dataset. The graphs on the left (a) report thetotal number of distance computations needed to evaluate a rangequery. This measure can be considered to be the query costs in cen-tralized index structures. The graphs on the right (b) illustrate theparallel number of distance computations, i.e. the costs of a query inthe distributed environment.

Since the distance computations are the most time consuming op-

117


0 20000 40000 60000 80000

100000 120000 140000 160000 180000 200000



tota

l dis

tanc

e co

mp. VEC

0 500

1000 1500 2000 2500 3000 3500 4000 4500


VEC


para

llel d

ista

nce

com

p.

0 20000 40000 60000 80000

100000 120000 140000 160000 180000


TTL


tota

l dis

tanc

e co

mp.

0 500

1000 1500 2000 2500 3000 3500 4000 4500 5000


TTL


para

llel d

ista

nce

com

p.

(a)

0 50000

100000 150000 200000 250000 300000 350000 400000


DNA


tota

l dis

tanc

e co

mp.

(b)

0 1000 2000 3000 4000 5000 6000 7000 8000


DNA


para

llel d

ista

nce

com

p.

Figure 7.7: The total (a) and parallel (b) number of distance compu-tations

118


erations during the evaluation, all the structures employ the pivotfiltering criteria (as described in Section 3.2.5) to avoid as much dis-tance computations as possible. As explained, the number of pivotsused for filtering strongly affects its effectiveness, i.e. the more piv-ots we have the more effective the filtering is and the fewer distancesneed to be computed. The MCAN and the M-Chord structures use afixed set of 40 pivots for filtering, as opposed to the GHT∗ and VPT∗

which use the pivots in the AST. Thus, objects in buckets in lowerlevels of the AST have more pivots for filtering and vice versa. Also,the GHT∗ partitioning implies two pivots per inner tree node, butVPT∗ contains only one pivot, resulting in half the number of pivotsthan for the GHT∗. In particular, the GHT∗ has used 48 pivots in itslongest branch and only 10 in the shortest one, while the VPT∗ hasfiltered using maximally 18 and minimally 5 pivots.

Observe the effects of filtering on total distance computationsgraphs in Figure 7.7a. We can see that the M-Chord and MCAN struc-tures have practically the same filtering for all the datasets. On theother hand, the VPT∗ index is always the worst, since it has the low-est number of pivots used for filtering. We can also see that the filter-ing was rather ineffective in the DNA dataset, where the structureshave computed the distances up to twice as many objects as for theTTL and VEC datasets. Then, queries with bigger radii in the DNAdataset have to access about 60% of the whole database, which wouldbe very slow in a centralized index.

Figure 7.7b illustrates the parallel computational costs of the queryprocessing. We can see that the amount of necessary distance com-putations is significantly reduced, which comes out from the fact thatthe computational load is divided among the participating peers run-ning in parallel. We can see that the GHT∗ structure has the best par-allel distance computation and seems to be unaffected by the datasetused. However, its lowest parallel cost is counterbalanced by thehigh percentage of visited nodes (shown in Figure 7.6), which in factis strictly correlated to the parallel distance computations cost for allthe structures.

Note also that the increase of parallel cost is bounded by thevalue of 5,000 distance computations – this is best visible in the TTLdataset. This is a straightforward implication of the fact that everynode has only a limited storage capacity, i.e. if a peer holds up to

119


5,000 objects it cannot evaluate more distance computations betweenthe query and its objects. This seems to be in contradiction with theM-Chord graph for the DNA dataset, for which the following prob-lem has arisen. Due to the small number of possible distance valuesof the DNA dataset, the M-Chord transformation resulted into for-mation of “clusters” of objects mapped onto the same M-Chord key.Those objects had to be kept on one peer only and, thus, the capacitylimit of 5,000 objects could be exceeded.

0 50

100 150 200 250 300 350 400

0 500 1000 1500 2000

tota

l mes

sage

s

range query radius

VEC


0 2 4 6 8

10 12 14 16


VEC


max

imal

hop

cou

nt

0 50

100 150 200 250 300 350

0 5 10 15 20

tota

l mes

sage

s

range query radius

TTL


0 2 4 6 8

10 12 14


TTL


max

imal

hop

cou

nt

(a)

0

100

200

300

400

500

600


DNA


tota

l mes

sage

s

(b)

0

5

10

15

20

25


DNA


max

imal

hop

cou

nt

Figure 7.8: The total number of messages (a) and the maximal hopcount (b)

The last group of measurements in this section, depicted in Fig-ure 7.8, reports on the communication costs, i.e. the traffic load ofthe underlying computer network. Since all the structures exploit themessage passing paradigm and the amount of interchanged data is

120


small (usually fitting in a few packets), we have measured the num-ber of messages needed to solve a range query as the communica-tion cost. By analogy, we show the total messages cost, which can beinterpreted as the overall load of the underlying network infrastruc-ture, and the parallel cost represented by the maximal “hop count”.Recall that the hop count is the number of messages sent in a serialmanner, i.e. the sequence of messages forwarded from one node toanother.

Since the GHT∗ and the VPT∗ count all nodes involved in naviga-tion as visited (as explained earlier), the percentage of visited nodes(Figure 7.6) and the total number of messages (Figure 7.8a) are strictlycorrelated for these structures. The MCAN structure needs the low-est number of messages for small ranges, but as the radius grows thenumber of messages increases quickly. This comes from the fact thatthe MCAN range search algorithm uses multicast to spread the queryand, thus, one peer may be contacted with a specific query requestseveral times. However, every peer evaluates a particular requestonly once. For the M-Chord structure, we can see that the total costis considerably high even for small radii, but it grows very slowlyas the radius increases. In fact, the M-Chord needs to access at leastone peer for every M-Chord cluster even for small range queries, seeSection 7.2. Then, as the radius increases, adjacent peers within someof the clusters need to be contacted and that increases the total mes-sages costs.

The parallel costs, i.e. the maximal hop count, are practically con-stant for different sizes of the radii for all the structures except theM-Chord for which the maximal hop count grows. The increase iscaused by the serial nature of the current algorithm for contactingthe adjacent peers in particular clusters.

In summary, we can say that all the structures scale well withrespect to the size of the radius. In fact, the parallel distance compu-tation costs grow sub-linearly and they are bounded by the capacitylimits of the peers. The parallel communication costs remain practi-cally constant for the GHT∗, VPT∗ and MCAN structures and growslinearly for the M-Chord.

121


7.6 Scalability with Respect to the Size of Datasets

Let us concern the systems’ scalability with respect to the growingvolume of data stored in the structures. We have observed the per-formance of Range(q, r) queries on systems storing from 50,000 to1,000,000 objects. We conducted these experiments on all datasetsfor the following radii: 500, 1,000 and 1,500 for the VEC dataset andradii 5, 10 and 15 for the TTL and DNA datasets.

We include only one graph for each type of measurement if theother graphs exhibit the same trend, because presenting all the col-lected results would occupy too much space with only a limited con-tribution. The title of each graph in this section specifies the useddataset and the search radius r.

The number of retrieved objects – see, e.g., the radius 10 for theTTL dataset in Figure 7.9 – grows precisely linearly because the datawere inserted to the structures in random order.

0 50

100 150 200 250 300 350 400

0 100 200 300 400 500 600 700 800 900

retr

ieve

d ob

ject

s

dataset size (*1000)

TTL for r = 10

all structures

Figure 7.9: Number of retrievedobjects

0 10 20 30 40 50 60 70 80 90

0 200 400 600 800 1000

visi

ted

node

s (%

)


VEC for r = 1000


Figure 7.10: Percentage of visitednodes

Figure 7.10 depicts the percentage of nodes affected by the rangequery processing. For all the structures but the GHT∗ this value de-creases because the dataspace becomes denser and, thus, the nodescover smaller regions of the space. Therefore, the space covered bythe involved nodes comes closer to the exact space portion coveredby the query itself. As mentioned in Section 7.5, the GHT∗ partition-ing is not balanced, therefore, the query processing is spread overlarger number of participating nodes.

Figure 7.11 presents the computational costs in terms of both to-tal and parallel number of distance computations. As expected, thetotal costs (a) increase linearly with the stored data volume. This

122


well-known trend, which corresponds to the costs of centralized so-lutions, is the main motivation for designing distributed structures.The graph exhibits practically the same trend for the M-Chord andMCAN structures since they both use a filtering mechanism basedon a fixed sets of pivots, as explained in Section 7.5. The total costsfor the GHT∗ and the VPT∗ are slightly higher due to the dynamicsets of filter pivots.

(b)(a)


0 200 400 600 800 1000 0 200 400 600 800 1000 0

5000 10000 15000 20000 25000 30000 35000 40000 45000 50000


VEC for r = 1000

tota

l dis

tanc

e co

mp.

0

500

1000

1500

2000

2500VEC for r = 1000



para

llel d

ista

nce

com

p.


The parallel number of distance computations (Figure 7.11b) growsvery slowly. For instance, the parallel costs for the GHT∗ increase by50% while the dataset grows 10 times and the M-Chord exhibits a 10%increase for doubled dataset size from 500,000 to 1,000,000. The in-crease is caused by the fact that the involved nodes contain more ofthe relevant objects while making the data space denser. This corre-sponds with the observable correlation of this graph and the graph inFigure 7.10 – the less nodes the structure involves, the higher the par-allel costs it shows. The transformation techniques, the MCAN andthe M-Chord, obviously concentrate the relevant data on fewer nodesand have higher parallel costs then. The noticeable graph fluctua-tions are caused by rather regular splits of overloaded nodes.

Figure 7.12 presents the same results for DNA dataset. The pivot-based filtering performs less effectively for higher radii (the totalcosts are quite high) and it is more sensitive to the number of piv-ots. The distance function is discrete with relatively small variety ofpossible values. As mentioned in Section 7.5, for this dataset, the M-Chord mapping collisions may result in overloaded nodes that cannotbe split. Then, the parallel costs in Figure 7.12b may be over the split

123


limit of 5,000 objects.

(a) (b)

0 50000

100000 150000 200000 250000 300000 350000 400000 450000

0 200 400 600 800 1000

tota

l dis

tanc

e co

mp.


DNA for r = 15


0 1000 2000 3000 4000 5000 6000 7000 8000 9000

10000

0 200 400 600 800 1000dataset size (*1000)

DNA for r = 15


para

llel d

ista

nce

com

p.


Figure 7.13 shows the communication costs in terms of the totalnumber of messages (a) and the maximal hop count (b). The totalmessage costs for the GHT∗ grow faster because it contacts higherpercentages of nodes. The M-Chord graphs indicate that the totalmessage costs grow slowly but the major increase of the messagessending is in sequential manner which negatively influences the hopcount.

(b)(a)

0

100

200

300

400

500

600

0 100 200 300 400 500 600 700 800 900

tota

l mes

sage

s

TTL for r = 10


0 2 4 6 8

10 12 14

0 100 200 300 400 500 600 700 800 900


max

imal

hop

cou

nt TTL for r = 10

dataset size (*1000)dataset size (*1000)

Figure 7.13: The total messages (a) and the maximal hop count (b)

7.7 Number of Simultaneous Queries

In this section, we focus on scalability of the systems with respect tothe number of queries executed simultaneously. In other words, wemeasure the interquery parallelism [59] of the queries processing.

124


In the conducted experiments, we have simultaneously executedgroups of 10–100 queries – each from a different node. We havemeasured the overall parallel costs of the set of queries as the maxi-mal number of distance computations performed on a single nodeof the system. Since the inter-node communication time costs arelower than the computational costs, this value can be considered asa characterization of the overall response time. We have run this ex-periments for all datasets and have used the same query radii as inSection 7.6.

In order to establish a baseline, we have measured the sum of theparallel costs of the individual queries. The ratio of this value to theoverall parallel costs characterizes the improvement achieved by theinterquery parallelism and we refer to this value as the interqueryimprovement ratio. This value can also be interpreted as the number ofqueries that can be handled by the systems simultaneously withoutslowing them down.

Looking at Figure 7.14a, we can see the overall parallel costs forall the datasets and selected radii. The trend of the progress is identi-cal for all the structures and, surprisingly, the actual values are verysimilar.

Therefore, the difference of the respective interquery improve-ment ratios, shown in the (b) graphs, is introduced mainly by differ-ence of the single query parallel costs. The M-Chord and the MCANhandle multiple queries slightly better than the VPT∗ and better thanGHT∗.

The actual values of improvement ratio for specific datasets arestrongly influenced by the total number of distance computationsspread over the nodes (see Figure 7.7a) and, therefore, the improve-ment is lower for DNA than for VEC.

7.8 Comparison Summary

Though all of the considered approaches have demonstrated a strictlysub-linear scalability in all important aspects of similarity search forcomplex metric functions, the most essential lessons we have learnedfrom the experiments can be summarized in the following table.

125


(a) (b)


0 5000

10000 15000 20000 25000 30000 35000 40000 45000

0 20 40 60 80 100number of simultaneous queries

VEC for r = 1000

over

all p

aral

lel d

. c.

0 1 2 3 4 5 6 7


VEC for r = 1000


inte

rq. i

mpr

ovem

ent r

atio

(a) (b)

0 20000 40000 60000 80000

100000 120000 140000 160000 180000 200000

0 20 40 60 80 100

over

all p

aral

lel d

. c.

number of simultaneous queries

TTL for r = 10


0

0.5

1

1.5

2

2.5


TTL for r = 10


inte

rq. i

mpr

ovem

ent r

atio

(a) (b)

0

50000

100000

150000

200000

250000


DNA for r = 15

0

0.5

1

1.5

2

2.5


DNA for r = 15


inte

rq. i

mpr

ovem

ent r

atio

over

all p

aral

lel d

. c.


Figure 7.14: The overall parallel costs (a) and interquery improve-ment ratio (b)

126


single query multiple queries

GHT∗ excellent poorVPT∗ good satisfactoryMCAN satisfactory goodM-Chord satisfactory very good

In the table, the single query column expresses the power of a corre-sponding structure to speed up the execution of one query. This isespecially useful when the probability of concurrent query requestsis very low (preferably zero), so only one query is executed at a timeand the maximum number of computational resources can be ex-ploited. On the other hand, the multiple queries column expressesthe ability of our structures to serve several queries simultaneouslywithout degrading the performance by waiting.

We can see that there is no clear winner considering both the sin-gle and the multiple query performance evaluation. In general, noneof the considered structures has a poor performance of single queryexecution, but the GHT∗ is certainly the most suitable for this pur-pose. However, it is also the least suitable structure for concurrentquery execution – queries solved by the GHT∗ are practically exe-cuted one after the other. The M-Chord structure has the oppositebehavior. It can serve several queries of different users in parallelwith the least performance degradation, but it takes more time toevaluate a single query.

127

Chapter 8

Conclusions

Traditionally, search has been applied to structured (attribute-type)data yielding records that exactly match the query. A more moderntype of search, similarity search, is used in content-based retrieval forqueries involving complex data types such as images, videos, timeseries, text documents or DNA sequences. Similarity search is basedon approximate rather than exact relevance using a distance metricthat, together with the database, forms a mathematical metric space.The obvious advantage of similarity search is that the results can beranked according to their relevance.

Many interesting metric space techniques improving the effec-tiveness of the similarity query evaluation were published in the lit-erature. The problem of similarity search was also deeply discussedin the recent book [74]. However, currently prevalent centralizedsimilarity search mechanisms are time-consuming and not scalable,thus only suitable for relatively small data collections.

This dissertation thesis investigated the possibility of combin-ing the similarity indexing techniques with the power of distributedcomputing infrastructures. The scalable distributed paradigms, suchas the GRID or Peer-to-Peer systems, offer practically unlimited com-putational and storage resources. However, the existing techniquescannot be simply run in such an environment, instead, they mustbe carefully redesigned and adjusted to cope with the requirementsof the efficient distributed processing. This thesis focused on thescalable decentralized peer-to-peer techniques with emphasis on themetric based similarity search.

129

8. CONCLUSIONS

8.1 Summary

In the first stage, we provide the necessities for understanding themetric spaces. We give examples of distance measures to illustratethe wide applicability of structures based on metric spaces and de-fine several similarity queries. In the next chapter, we try to empha-sis on the problem of scalability. We first show the techniques usedto partition the metric space and we describe two advanced disk-memory based similarity index structures. We also show on a fewexperimental results that the scalability of the centralized structuresis rather limited. Thus, we define our scalability challenge: “Designa metric index structure that has significantly better scalability thanjust linear.” Promising techniques that are able to scale better thenlinearly are based on distributed computing, specifically, the peer-to-peer data networks. Our next chapter thus survey important existingtechniques in this area.

The remaining parts present the contributions of this dissertationthesis. In Chapter 5, we propose a novel scalable and distributedsimilarity search structure based on metric space. It is fully scal-able and can distribute the data over practically unlimited numberof independent peers, growing or shrinking as necessary. It has nocentralized component and thus doubtless hot-spots are avoided. Inaddition, the parallel search time becomes practically constant for ar-bitrary data volumes and thus we meet the requirement of our scal-ability challenge.

Chapter 6 presents experimental results for our novel structureon different real-life datasets. We study not only the performance ofrange and k-nearest neighbors queries, but also the scalability of thewhole system. The last chapter shows some improvements made tothe original structure and provides an extensive comparative studyof our structures and two other recently published distributed simi-larity search techniques.

8.2 Research Directions

We have ideas for many improvements and challenges that we planto design in the future.

130

8. CONCLUSIONS

• The imbalanced tree is the main problem of the Address SearchTree structure, which is the core of inter-peer navigation. Thus,we would like to design a balanced version of the tree and, pos-sibly, modify it to a multi-way search tree.

• During the experiments, we have observed that some peers areaccessed more often and evaluate more distance computationsthan the others. As expected, the load is not divided equallyamong the participating peers. We would like to try to adoptsome load balancing techniques known for peer-to-peer net-works to our similarity search structures.

• We would also like to extend the number of supported similar-ity queries. For example, we would like to process a similarityjoin in a distributed environment.

• Another goal is to verify properties of the proposed structurein real and functional applications. We would like to develop asystem for image retrieval that would support similarity queries.

131

Bibliography

[1] Karl Aberer, Philippe Cudre-Mauroux, Anwitaman Datta, Zo-ran Despotovic, Manfred Hauswirth, Magdalena Punceva, andRoman Schmidt. P-Grid: a self-organizing structured P2P sys-tem. SIGMOD, 32(3):29–33, September 2003.

[2] Karl Aberer and Manfred Hauswirth. An overview of peer-to-peer information systems. In Witold Litwin and Gerard Levy,editors, Distributed Data & Structures 4, Records of the 4th Inter-national Meeting (WDAS 2002), Paris, France, March 20-23, 2002,volume 14 of Proceedings in Informatics, pages 171–188. CarletonScientific, 2002.

[3] Helmut Alt, Bernd Behrends, and Johannes Blomer. Approxi-mate matching of polygonal shapes (extended abstract). In Pro-ceedings of the seventh annual symposium on Computational geome-try, pages 186–193. ACM Press, 1991.

[4] Giuseppe Amato, Fausto Rabitti, Pasquale Savino, and PavelZezula. Region proximity in metric spaces and its use for ap-proximate similarity search. ACM Trans. Inf. Syst., 21(2):192–227,2003.

[5] Farnoush Banaei-Kashani and Cyrus Shahabi. Swam: a familyof access methods for similarity-search in peer-to-peer data net-works. In CIKM ’04: Proceedings of the Thirteenth ACM conferenceon Information and knowledge management, pages 304–313. ACMPress, 2004.

[6] Michal Batko, Claudio Gennaro, Pasquale Savino, and PavelZezula. Scalable similarity search in metric spaces. In Proceed-ings of the Sixth Thematic Workshop of the EU Network of Excel-

133

BIBLIOGRAPHY

lence DELOS on Digital Library Architectures. to appear in LNCS,Springer, June 2004.

[7] Christian Bohm, Stefan Berchtold, and Daniel A. Keim. Search-ing in high-dimensional spaces: Index structures for improvingthe performance of multimedia databases. ACM Computing Sur-veys, 33(3):322–373, September 2001.

[8] Benjamin Bustos, Gonzalo Navarro, and Edgar Chvez. Pivotselection techniques for proximity searching in metric spaces.In Proc. of SCCC01, pages 33–40, 2001.

[9] Edgar Chavez, Gonzalo Navarro, Ricardo A. Baeza-Yates, andJose Luis Marroquın. Searching in metric spaces. ACM Comput-ing Surveys (CSUR), 33(3):273–321, September 2001.

[10] Paolo Ciaccia, Danilo Montesi, Wilma Penzo, and Alberto Trom-betta. Imprecision and user preferences in multimedia queries:A generic algebraic approach. In Proceedings of the First Inter-national Symposium on Foundations of Information and KnowledgeSystems, volume 1762 of Lecture Notes in Computer Science, pages50–71. Springer-Verlag, 2000.

[11] Paolo Ciaccia and Marco Patella. Bulk loading the M-tree. InProceedings of th 9th Australasian Database Conference (ADC’98),Perth, Australia, pages 15–26, 1998.

[12] Paolo Ciaccia and Marco Patella. The M2-tree: Processing com-plex multi-feature queries with just one index. In DELOS Work-shop: Information Seeking, Searching and Querying in Digital Li-braries, 2000.

[13] Paolo Ciaccia, Marco Patella, and Pavel Zezula. M-tree: An ef-ficient access method for similarity search in metric spaces. InMatthias Jarke, Michael J. Carey, Klaus R. Dittrich, Frederick H.Lochovsky, Pericles Loucopoulos, and Manfred A. Jeusfeld, edi-tors, VLDB’97, Proceedings of 23rd International Conference on VeryLarge Data Bases, August 25-29, 1997, Athens, Greece, pages 426–435. Morgan Kaufmann, 1997.

134

BIBLIOGRAPHY

[14] Paolo Ciaccia, Marco Patella, and Pavel Zezula. Processingcomplex similarity queries with distance-based access methods.In Hans-Jorg Schek, Felix Saltor, Isidro Ramos, and GustavoAlonso, editors, Advances in Database Technology - EDBT’98, 6thInternational Conference on Extending Database Technology, Valen-cia, Spain, March 23-27, 1998, Proceedings, volume 1377 of LectureNotes in Computer Science, pages 9–23. Springer, 1998.

[15] Kenneth L. Clarkson. Nearest neighbor queries in metric spaces.In Proceedings of the twenty-ninth annual ACM symposium on The-ory of computing, pages 609–617. ACM Press, 1997.

[16] Gregory Cobena, Serge Abiteboul, and Amelie Marian. Detect-ing changes in XML documents. In IEEE International Conferenceon Data Engineering, San Jose, California, USA (ICDE’02), pages41–52. IEEE Computer Society, 2002.

[17] Douglas Comer. Ubiquitous B-Tree. ACM Computing Surveys(CSUR), 11(2):121–137, 1979.

[18] Adina Crainiceanu, Prakash Linga, Johannes Gehrke, andJayavel Shanmugasundaram. P-tree: a p2p index for resourcediscovery applications. In WWW Alt. ’04: Proceedings of the 13thinternational World Wide Web conference on Alternate track papers &posters, pages 390–391, New York, NY, USA, 2004. ACM Press.

[19] Robert Devine. Design and implementation of DDH: A dis-tributed dynamic hashing algorithm. In Proceedings of the 4thInternational Conference on Foundations of Data Organization andAlgorithms (FODO), volume 730, pages 101–114, Chicago, 1993.

[20] Vlastislav Dohnal. Indexing Structures fro Searching in MetricSpaces. PhD thesis, Faculty of Informatics, Masaryk Universityin Brno, Czech Republic, May 2004.

[21] Vlastislav Dohnal, Claudio Gennaro, Pasquale Savino, andPavel Zezula. D-Index: Distance searching index for metric datasets. Multimedia Tools and Applications, 21(1):9–33, 2003.

135

BIBLIOGRAPHY

[22] Ronald Fagin. Combining fuzzy information from multiple sys-tems. In Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5,1996, Montreal, Canada, pages 216–226. ACM Press, 1996.

[23] Ronald Fagin. Fuzzy queries in multimedia database systems.In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGARTSymposium on Principles of Database Systems, June 1-3, 1998, Seat-tle, Washington, pages 1–10. ACM Press, 1998.

[24] Fabrizio Falchi, Claudio Gennaro, and Pavel Zezula. A content-addressable network for similarity search in metric spaces. InProceedings of DBISP2P, pages 126–137, 2005.

[25] Christos Faloutsos, Ron Barber, Myron Flickner, Jim Hafner,Wayne Niblack, Dragutin Petkovic, and William Equitz. Effi-cient and effective querying by image content. Journal of Intelli-gent Information Systems (JIIS), 3(3/4):231–262, 1994.

[26] Volker Gaede and Oliver Gnther. Multidimensional accessmethods. ACM Computing Surveys, 30(2):170–231, 1998.

[27] Prasanna Ganesan, Beverly Yang, and Hector Garcia-Molina.One torus to rule them all: multi-dimensional queries in p2psystems. In WebDB ’04: Proceedings of the 7th International Work-shop on the Web and Databases, pages 19–24, New York, NY, USA,2004. ACM Press.

[28] Claudio Gennaro, Pasquale Savino, and Pavel Zezula. Similar-ity search in metric databases through hashing. In Proceedingsof the ACM Multimedia 2001 Workshops, pages 1–5. ACM Press,2001.

[29] Gnutella home page. http://www.gnutella.com/.

[30] Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava,and Ting Yu. Approximate XML joins. In Michael J. Franklin,Bongki Moon, and Anastassia Ailamaki, editors, ACM SIGMODInternational Conference on Management of Data, Madison, Wiscon-sin, June 3-6, 2002, pages 287–298. ACM, 2002.

136

BIBLIOGRAPHY

[31] Antonin Guttman. R-Trees: A dynamic index structure for spa-tial searching. In Beatrice Yormark, editor, Proceedings of the 1984ACM SIGMOD International Conference on Management of Data,Boston, Massachusetts, pages 47–57. ACM Press, 1984.

[32] James L. Hafner, Harpreet S. Sawhney, William Equitz, MyronFlickner, and Wayne Niblack. Efficient color histogram index-ing for quadratic form distance functions. IEEE Transactions onPattern Analysis and Machine Intelligence (PAMI), 17(7):729–736,July 1995.

[33] Gısli R. Hjaltason and Hanan Samet. Ranking in spatialdatabases. In Max J. Egenhofer and John R. Herring, editors, Ad-vances in Spatial Databases, 4th International Symposium, SSD’95,Portland, Maine, USA, August 6-9, 1995, Proceedings, volume 951of Lecture Notes in Computer Science, pages 83–95. Springer Ver-lag, 1995.

[34] Gısli R. Hjaltason and Hanan Samet. Incremental similaritysearch in multimedia databases. Technical Report CS-TR-4199,Computer Science Department, University of Maryland, Col-lege Park., November 2000.

[35] Gısli R. Hjaltason and Hanan Samet. Index-driven similaritysearch in metric spaces. ACM Transactions on Database Systems(TODS’03), 28(4):517–580, 2003.

[36] Gisli R. Hjaltason and Hanan Samet. Index-driven similaritysearch in metric spaces. ACM Trans. Database Syst., 28(4):517–580, 2003.

[37] Daniel P. Huttenlocher, Gregory A. Klanderman, and William J.Rucklidge. Comparing images using the Hausdorff distance.IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI’93), 15(9):850–863, 1993.

[38] H. V. Jagadish. Linear clustering of objects with multiple at-tributes. In SIGMOD ’90: Proceedings of the 1990 ACM SIGMODinternational conference on Management of data, pages 332–342,New York, NY, USA, 1990. ACM Press.

137

BIBLIOGRAPHY

[39] H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, andRui Zhang. iDistance: An adaptive b+-tree based indexingmethod for nearest neighbor search. ACM Trans. Database Syst.,30(2):364–397, 2005.

[40] Theodore Johnson and Padmashree Krishna. Lazy updates fordistributed search structure. In Proceedings of the ACM Interna-tional Conference on Management of Data (SIGMOD’93), volume22(2), pages 337–346, 1993.

[41] Michael B. Jones, Marvin Theimer, Helen Wang, and Alec Wol-man. Unexpected complexity: Experiences tuning and extend-ing can. Technical Report MSR-TR-2002-118, Microsoft Re-search, December 2002.

[42] John L. Kelly. General Topology. D. Van Nostrand, New York,1955.

[43] Teuvo Kohonen. Self-Organization and Associative Memory.Springer-Verlag, 1984.

[44] George Kollios, Dimitrios Gunopulos, and Vassilis J. Tsotras.Nearest neighbor queries in a mobile environment. In Proceed-ings of Internation Workshop on Spatio-Temporal Database Manage-ment, Edinburgh, Scotland, September 10-11, 1999 (STDBM’99),volume 1678 of Lecture Notes in Computer Science, pages 119–134.Springer, 1999.

[45] Flip Korn and S. Muthu Muthukrishnan. Influence sets based onreverse nearest neighbor queries. In Proceedings of the 2000 ACMInternational Conference on Management of Data, May 16-18, 2000,Dallas, Texas, USA (SIGMOD’00), pages 201–212. ACM, 2000.

[46] Brigitte Kroll and Peter Widmayer. Distributing a search treeamong a growing number of processors. In Proceedings of the1994 ACM SIGMOD International Conference on Management ofData / SIGMOD ’94, Minneapolis, Minnesota, pages 265–276, 1994.

[47] Per-Ake Larson. Dynamic hashing. BIT (Nordisk tidskrift for in-formationsbehandling), 18(2):184–201, 1978.

138

BIBLIOGRAPHY

[48] Dongwon Lee. Query Relaxation for XML Model. PhD thesis,University of California, Los Angeles, 2002. One chapter aboutmetrics on XML documents (XML data trees).

[49] V. I. Levenshtein. Binary codes capable of correcting spuriousinsertions and deletions of ones. Problems of Information Trans-mission, 1:8–17, 1965.

[50] Vladimir I. Levenshtein. Binary codes capable of correcting spu-rious insertions and deletions of ones. Problems of InformationTransmission, 1:8–17, 1965.

[51] Witold Litwin. Linear hashing: A new tool for file and tableaddressing. In Sixth International Conference on Very Large DataBases, October 1-3, 1980, Montreal, Quebec, Canada, Proceedings,pages 212–223. IEEE Computer Society, 1980.

[52] Witold Litwin, Marie-Anne Neimat, and Donovan A. Schnei-der. Rp*: A family of order preserving scalable distributed datastructures. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo,editors, VLDB’94, Proceedings of 20th International Conference onVery Large Data Bases, September 12-15, 1994, Santiago de Chile,Chile, pages 342–353, 1994.

[53] Witold Litwin, Marie-Anne Neimat, and Donovan A. Schneider.LH* - a scalable, distributed data structure. ACM Transactions onDatabase Systems (TODS’96), 21(4):480–525, 1996.

[54] Napster home page. http://www.napster.com/.

[55] Gonzalo Navarro. A guided tour to approximate string match-ing. ACM Computing Surveys (CSUR), 33(1):31–88, 2001.

[56] S. B. Needleman and C. D. Wunsch. A general method applica-ble to the search for similarities in the amino acid sequence oftwo proteins. Journal Of Molecular Biology, 48:443–453, 1970.

[57] David Novak and Pavel Zezula. M-Chord: A scalable dis-tributed similarity search structure. In Proceedings of First Inter-national Conference on Scalable Information Systems (INFOSCALE2006), Hong Kong, May 30 - June 1. IEEE Computer Society, 2006.

139

BIBLIOGRAPHY

[58] J. A. Orenstein and T. H. Merrett. A class of data structuresfor associative searching. In PODS ’84: Proceedings of the 3rdACM SIGACT-SIGMOD symposium on Principles of database sys-tems, pages 181–190, New York, NY, USA, 1984. ACM Press.

[59] M. Tamer Ozsu and Patrick Valduriez. Distributed and paralleldatabase systems. ACM Comput. Surv., 28(1):125–128, 1996.

[60] Michal Parnas and Dana Ron. Testing metric properties. InProceedings of the thirty-third annual ACM symposium on Theoryof computing (STOC’01), pages 276–285. ACM Press, 2001.

[61] R. Rammal, G. Toulouse, and M. A. Virasoro. Ultrametricity forphysicists. Reviews of Modern Physics, 58(3):765–788, 1986.

[62] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp,and Scott Shenker. A scalable content addressable network. InProc. of ACM SIGCOMM 2001, pages 161–172, 2001.

[63] Sylvia Ratnasamy, Mark Handley, Richard Karp, and ScottShenker. Application-level multicast using content-addressablenetworks. Lecture Notes in Computer Science, 2001.

[64] Thomas Seidl and Hans-Peter Kriegel. Efficient user-adaptablesimilarity search in large multimedia databases. In The VLDBJournal, pages 506–515. Morgan Kaufmann, 1997.

[65] Thomas Seidl and Hans-Peter Kriegel. Efficient user-adaptablesimilarity search in large multimedia databases. In The VLDBJournal, pages 506–515, 1997.

[66] Ioana Stanoi, Divyakant Agrawal, and Amr El Abbadi. Reversenearest neighbor queries for dynamic databases. In ACM SIG-MOD Workshop on Research Issues in Data Mining and KnowledgeDiscovery, pages 44–53, 2000.

[67] Ioana Stanoi, Mirek Riedewald, Divyakant Agrawal, andAmr El Abbadi. Discovery of influence sets in frequently up-dated databases. In The VLDB Journal, pages 99–108. MorganKaufmann, 2001.

140

BIBLIOGRAPHY

[68] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek,and Hari Balakrishnan. Chord: A scalable peer-to-peer lookupservice for internet applications. In Proceedings of ACM SIG-COMM, pages 149–160. ACM Press, 2001.

[69] C. Tang, Z. Xu, and S. Dwarkadas. Peer-to-peer informationretrieval using self-organizing semantic overlay networks, 2002.

[70] Jeffrey K. Uhlmann. Satisfying general proximity/similarityqueries with metric trees. Information Processing Letters,40(4):175–179, 1991.

[71] Beverly Yang and Hector Garcia-Molina. Comparing hybridpeer-to-peer systems. In Proceedings of the Twenty-seventh Inter-national Conference on Very Large Databases, pages 561–570, 2001.

[72] Congjun Yang and King-Ip Lin. An index structure for efficientreverse nearest neighbor queries. In Proceedings of the 17th In-ternational Conference on Data Engineering, April 2-6, 2001, Heidel-berg, Germany (ICDE’01), pages 485–492. IEEE Computer Soci-ety, 2001.

[73] Peter N. Yianilos. Excluded middle vantage point forests fornearest neighbor search. In 6th DIMACS Implementation Chal-lenge, ALENEX’99, Baltimore, MD, 1999.

[74] Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and MichalBatko. Similarity Search: The Metric Space Approach, volume 32 ofAdvances in Database Systems. Springer-Verlag, 2005.

141

Date post:	27-Jul-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Scalable and Distributed Similarity Searchgtsat/collection/p2p proximity... · metric space...

Documents