+ All Categories
Home > Technology > Searching in high dimensional spaces index structures for improving the performance of multimedia...

Searching in high dimensional spaces index structures for improving the performance of multimedia...

Date post: 29-Aug-2014
Category:
Upload: unyil96
View: 208 times
Download: 2 times
Share this document with a friend
Description:
 
Popular Tags:
52
Searching in High-Dimensional Spaces—Index Structures for Improving the Performance of Multimedia Databases CHRISTIAN B ¨ OHM University of Munich, Germany STEFAN BERCHTOLD stb ag, Germany AND DANIEL A. KEIM AT&T Research Labs and University of Constance, Germany During the last decade, multimedia databases have become increasingly important in many application areas such as medicine, CAD, geography, and molecular biology. An important research issue in the field of multimedia databases is the content-based retrieval of similar multimedia objects such as images, text, and videos. However, in contrast to searching data in a relational database, a content-based retrieval requires the search of similar objects as a basic functionality of the database system. Most of the approaches addressing similarity search use a so-called feature transformation that transforms important properties of the multimedia objects into high-dimensional points (feature vectors). Thus, the similarity search is transformed into a search of points in the feature space that are close to a given query point in the high-dimensional feature space. Query processing in high-dimensional spaces has therefore been a very active research area over the last few years. A number of new index structures and algorithms have been proposed. It has been shown that the new index structures considerably Categories and Subject Descriptors: A.1 [General Literature]: Introductory and Survey; E.1 [Data]: Data Structures; F.2 [Theory of Computation]: Analysis of Algorithms and Problem Complexity; G.1 [Mathematics of Computing]: Numerical Analysis; G.2 [Mathematics of Computing]: Discrete Mathematics; H.2 [Information Systems]: Database Management; H.3 [Information Systems]: Information Storage and Retrieval; H.4 [Information Systems]: Information Systems Applications General Terms: Algorithms, Design, Measurement, Performance, Theory Additional Key Words and Phrases: Index structures, indexing high-dimensional data, multimedia databases, similarity search Authors’ addresses: C. B¨ ohm, University of Munich, Institute for Computer Science, Oettingenstr. 67, 80538 M ¨ unchen, Germany; email: [email protected]; S. Berchtold, stb ag, Moritzplatz 6, 86150 Augsburg, Germany; email: [email protected]; D. A. Keim, University of Constance, Department of Computer & Information Science, Box: D 78, Universit ¨ atsstr. 10, 78457 Konstanz, Germany; email: [email protected]. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. c 2001 ACM 0360-0300/01/0900-0322 $5.00 ACM Computing Surveys, Vol. 33, No. 3, September 2001, pp. 322–373.
Transcript
Page 1: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces—Index Structuresfor Improving the Performance of Multimedia Databases

CHRISTIAN BOHMUniversity of Munich, Germany

STEFAN BERCHTOLDstb ag, Germany

AND

DANIEL A. KEIMAT&T Research Labs and University of Constance, Germany

During the last decade, multimedia databases have become increasingly important inmany application areas such as medicine, CAD, geography, and molecular biology. Animportant research issue in the field of multimedia databases is the content-basedretrieval of similar multimedia objects such as images, text, and videos. However, incontrast to searching data in a relational database, a content-based retrieval requiresthe search of similar objects as a basic functionality of the database system. Most of theapproaches addressing similarity search use a so-called feature transformation thattransforms important properties of the multimedia objects into high-dimensional points(feature vectors). Thus, the similarity search is transformed into a search of points inthe feature space that are close to a given query point in the high-dimensional featurespace. Query processing in high-dimensional spaces has therefore been a very activeresearch area over the last few years. A number of new index structures and algorithmshave been proposed. It has been shown that the new index structures considerably

Categories and Subject Descriptors: A.1 [General Literature]: Introductory andSurvey; E.1 [Data]: Data Structures; F.2 [Theory of Computation]: Analysis ofAlgorithms and Problem Complexity; G.1 [Mathematics of Computing]: NumericalAnalysis; G.2 [Mathematics of Computing]: Discrete Mathematics; H.2[Information Systems]: Database Management; H.3 [Information Systems]:Information Storage and Retrieval; H.4 [Information Systems]: Information SystemsApplications

General Terms: Algorithms, Design, Measurement, Performance, Theory

Additional Key Words and Phrases: Index structures, indexing high-dimensional data,multimedia databases, similarity search

Authors’ addresses: C. Bohm, University of Munich, Institute for Computer Science, Oettingenstr. 67,80538 Munchen, Germany; email: [email protected]; S. Berchtold, stb ag, Moritzplatz6, 86150 Augsburg, Germany; email: [email protected]; D. A. Keim, University of Constance,Department of Computer & Information Science, Box: D 78, Universitatsstr. 10, 78457 Konstanz, Germany;email: [email protected].

Permission to make digital/hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists,requires prior specific permission and/or a fee.c©2001 ACM 0360-0300/01/0900-0322 $5.00

ACM Computing Surveys, Vol. 33, No. 3, September 2001, pp. 322–373.

Page 2: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 323

improve the performance in querying large multimedia databases. Based on recenttutorials [Berchtold and Keim 1998], in this survey we provide an overview of thecurrent state of the art in querying multimedia databases, describing the indexstructures and algorithms for an efficient query processing in high-dimensional spaces.We identify the problems of processing queries in high-dimensional space, and weprovide an overview of the proposed approaches to overcome these problems.

1. INDEXING MULTIMEDIA DATABASES

Multimedia databases are of high impor-tance in many application areas such asgeography, CAD, medicine, and molecularbiology. Depending on the application, themultimedia databases need to have differ-ent properties and need to support differ-ent types of queries. In contrast to tradi-tional database applications, where point,range, and partial match queries are veryimportant, multimedia databases requirea search for all objects in the databasethat are similar (or complementary) to agiven search object. In the following, wedescribe the notion of similarity queriesand the feature-based approach to processthose queries in multimedia databases inmore detail.

1.1. Feature-Based Processing ofSimilarity Queries

An important aspect of similarity queriesis the similarity measure. There isno general definition of the similar-ity measure since it depends on theneeds of the application and is thereforehighly application-dependent. Any simi-larity measure, however, takes two objectsas input parameters and determines a pos-itive real number, denoting the similarityof the two objects. A similarity measure istherefore a function of the form

δ: Obj×Obj→ <+0 .

In defining similarity queries, we haveto distinguish between two different tasks,which are both important in multimediadatabase applications: ε-similarity meansthat we are interested in all objects ofwhich the similarity to a given search ob-ject is below a given threshold ε, and NN-similarity (nearest neighbor) means that

we are only interested in the objects whichare the most similar ones with respect tothe search object.

Definition 1. (ε-Similarity, Identity).Two objects obj1 and obj2 are called ε-similar if and only if δ(obj2, obj1)<ε. (Forε= 0, the objects are called identical.)

Note that this definition is independent ofdatabase applications and just describesa way to measure the similarity of twoobjects.

Definition 2. (NN-Similarity). Two ob-jects obj1 and obj2 are called NN-similarwith respect to a database of objects DB ifand only if ∀obj ∈ DB, obj 6= obj1 : δ(obj2,obj1) ≤ δ(obj2, obj).

We are now able to formally define theε-similarity query and the NN-similarityquery.

Definition 3. (ε-Similarity Query, NN-Similarity Query). Given a query objectobjs, find all objects obj from the databaseof objects DB that are ε-similar (identicalfor ε = 0) to objs; that is, determine

{obj ∈ DB | δ(objs, obj) < ε}.

Given a query object objs, find the ob-ject(s) obj from the database of objects DBthat are NN-similar to objs; that is, deter-mine

{obj ∈ DB | ∀obj ∈ DB, obj 6= obj :

δ(objs, obj) ≤ δ(objs, obj)}.

The solutions currently used to solvesimilarity search problems are mostlyfeature-based solutions. The basic idea offeature-based similarity search is to ex-tract important features from the mul-timedia objects, map the features into

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 3: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

324 C. Bohm et al.

Fig. 1 . Basic idea of feature-based similarity search.

high-dimensional feature vectors, andsearch the database of feature vectorsfor objects with similar feature vectors(cf. Figure 1). The feature transformationF is defined as the mapping of the mul-timedia object (obj) into a d -dimensionalfeature vector

F : Obj→ <d .

The similarity of two objects obj1 and obj2can now be determined,

δ(obj1, obj2) = δEuclid(F (obj1), F (obj2)).

Feature-based approaches are usedin many application areas includingmolecular biology (for molecule docking)[Shoichet et al. 1992], information re-trieval (for text matching) [Altschul et al.1990], multimedia databases (for imageretrieval) [Faloutsos et al. 1994; Seidl andKriegel 1997], sequence databases (forsubsequence matching) [Agrawal et al.1993, 1995; Faloutsos et al. 1994], ge-ometric databases (for shape matching)[Mehrotra and Gary 1993, 1995; Kornet al. 1996], and so on. Examples of featurevectors are color histograms [Shawneyand Hafner 1994], shape descriptors[Mumford 1987; Jagadish 1991; Mehro-tra and Gary 1995], Fourier vectors [Wal-lace and Wintz 1980], text descriptors [Ku-kich 1992], and so on. The result of thefeature transformation are sets of high-dimensional feature vectors. The simi-larity search now becomes an ε-queryor a nearest-neighbor query on the fea-ture vectors in the high-dimensional fea-ture space, which can be handled much

more efficiently on large amounts of datathan the time-consuming comparison ofthe search object to all complex multi-media objects in the database. Since thedatabases are very large and consist ofmillions of data objects with several tensto a few hundreds of dimensions, it isessential to use appropriate multidimen-sional indexing techniques to achieve anefficient search of the data. Note that thefeature transformation often also involvescomplex transformations of the multime-dia objects such as feature extraction,normalization, or Fourier transformation.Depending on the application, these oper-ations may be necessary to achieve, for ex-ample, invariance with respect to a scal-ing or rotation of the objects. The detailsof the feature transformations are beyondthe scope of this survey. For further read-ing on feature transformations, the inter-ested reader is referred to the literature[Wallace and Wintz 1980; Mumford 1987;Jagadish 1991; Kukich 1992; Shawneyand Hafner 1994; Mehrotra and Gary1995].

For an efficient similarity search it isnecessary to store the feature vectors in ahigh-dimensional index structure and usethe index structure to efficiently evaluatethe distance metric. The high-dimensionalindex structure used must efficientlysupport

— point queries for processing identityqueries on the multimedia objects;

— range queries for processing ε-similarity queries; and

— nearest-neighbor queries for process-ing NN-similarity queries.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 4: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 325

Note that instead of using a featuretransformation into a vector space, thedata can also be directly processed using ametric space index structure. In this case,the user has to provide a metric that cor-responds to the properties of the similar-ity measure. The basic idea of metric in-dexes is to use the given metric propertiesto build a tree that then can be used toprune branches in processing the queries.The basic idea of metric index structures isdiscussed in Section 5. A problem of metricindexes is that they use less informationabout the data than vector space indexstructures which results in poorer prun-ing and also a poorer performance. A nicepossibility to improve this situation is theFASTMAP algorithm [Faloutsos and Lin1995] which maps the metric data into alower-dimensional vector space and usesa vector space index structure for efficientaccess to the transformed data.

Due to their practical importance, inthis survey we restrict ourselves tovector space index structures. We as-sume we have some given application-dependent feature transformation thatprovides a mapping of the multimedia ob-jects into some high-dimensional space.There are a quite large number of indexstructures that have been developed forefficient query processing in some mul-tidimensional space. In general, the in-dex structures can be classified in twogroups: data organizing structures such asR-trees [Guttman 1984; Beckmann et al.1990] and space organizing structuressuch as multidimensional hashing [Otoo1984; Kriegel and Seeger 1986, 1987,1988; Seeger and Kriegel 1990], GRID-Files [Nievergelt et al. 1984; Hinrichs1985; Krishnamurthy and Whang 1985;Ouksel 1985; Freeston 1987; Hutfleszet al. 1988b; Kriegel and Seeger 1988],and kd-tree-based methods (kd-B-tree[Robinson 1981], hB-tree [Lomet andSalzberg 1989, 1990; Evangelidis 1994],and LSDh-tree [Henrich 1998]).

For a comprehensive description of mostmultidimensional access methods, pri-marily concentrating on low-dimensionalindexing problems, the interested readeris referred to a recent survey presented

in Gaede and Gunther [1998]. That sur-vey, however, does not tackle the problemof indexing multimedia databases whichrequires an efficient processing of nearest-neighbor queries in high-dimensional fea-ture spaces; and therefore, the survey doesnot deal with nearest-neighbor queriesand the problems of indexing high-dimensional spaces. In our survey, wefocus on the index structures that havebeen specifically designed to cope withthe effects occurring in high-dimensionalspace. Since hashing- and GRID-File-based methods do not play an impor-tant role in high-dimensional indexing,we do not cover them in the survey.1The reason why hashing techniques arenot used in high-dimensional spaces arethe problems that arise in such space. Tobe able to understand these problems inmore detail, in the following we discusssome general effects that occur in high-dimensional spaces.

1.2. Effects in High-Dimensional Space

A broad variety of mathematical effectscan be observed when one increases thedimensionality of the data space. Inter-estingly, some of these effects are notof quantitative but of qualitative nature.In other words, one cannot think aboutthese effects, by simply extending two-or three-dimensional experiences. Rather,one has to think for example, at least10-dimensional to even see the effect oc-curring. Furthermore, some are prettynonintuitive. Few of the effects are ofpure mathematical interest whereas someothers have severe implications for theperformance of multidimensional indexstructures. Therefore, in the databaseworld, these effects are subsumed by theterm, “curse of dimensionality.” Generallyspeaking, the problem is that importantparameters such as volume and area de-pend exponentially on the number of di-mensions of the data space. Therefore,

1 The only exceptions to this is a technique forsearching approximate nearest neighbors in high-dimensional spaces that has been proposed in Gioniset al. [1999] and Ouksel et al. [1992].

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 5: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

326 C. Bohm et al.

most index structures proposed so far op-erate efficiently only if the number of di-mensions is fairly small. The effects arenonintuitive because we are used to deal-ing with three-dimensional spaces in thereal world but these effects do not oc-cur in low-dimensional spaces. Many peo-ple even have trouble understanding spa-tial relations in three-dimensional spaces,however, no one can “imagine” an eight-dimensional space. Rather, we alwaystry to find a low-dimensional analogywhen dealing with such spaces. Note thatthere actually is no notion of a “high”-dimensional space. Nevertheless, if peoplespeak about high-dimensional, they usu-ally mean a dimension of about 10 to 16,or at least 5 or 6.

Next, we list the most relevant effectsand try to classify them:

—pure geometric effects concerning thesurface and volume of (hyper) cubes and(hyper) spheres:—the volume of a cube grows expo-

nentially with increasing dimension(and constant edge length),

—the volume of a sphere grows expo-nentially with increasing dimension,and

—most of the volume of a cube is veryclose to the (d − 1)-dimensional sur-face of the cube;

—effects concerning the shape and loca-tion of index partitions:—a typical index partition in high-

dimensional spaces will span the ma-jority of the data space in most di-mensions and only be split in a fewdimensions,

—a typical index partition will not becubic, rather it will “look” like a rect-angle,

—a typical index partition touches theboundary of the data space in mostdimensions, and

—the partitioning of space gets coarserthe higher the dimension;

—effects arising in a database environ-ment (e.g., selectivity of queries):—assuming uniformity, a reasonably

selective range query corresponds to

Fig. 2 . Spheres in high-dimensionalspaces.

a hypercube having a huge extensionin each dimension, and

—assuming uniformity, a reasonablyselective nearest-neighbor query cor-responds to a hypersphere having ahuge radius in each dimension; usu-ally this radius is even larger thanthe extension of the data space ineach dimension.

To be more precise, we present some ofthe listed effects in more depth and detailin the rest of the section.

To demonstrate how much we stickto our understanding of low-dimensionalspaces, consider the following lemma.Consider a cubic-shaped d -dimensionaldata space of extension [0, 1]d . We de-fine the centerpoint c of the data space asthe point (0.5, . . . , 0.5). The lemma, “Ev-ery d -dimensional sphere touching (or in-tersecting) the (d − 1)-dimensional bound-aries of the data space also contains c,” isobviously true for d = 2, as one can takefrom Figure 2. Spending some more effortand thinking, we are able to also provethe lemma for d = 3. However, the lemmais definitely false for d = 16, as the fol-lowing counterexample shows. Define asphere around the point p= (0.3, . . . , 0.3).This point p has a Euclidean distance of√

d · 0.22= 0.8 from the centerpoint. If wedefine the sphere around p with a radiusof 0.7, the sphere will touch (or intersect)all 15-dimensional surfaces of the space.However, the centerpoint is not includedin the sphere. We have to be aware of thefact that effects like this are not only nicemathematical properties but also lead tosevere conclusions for the performance ofindex structures.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 6: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 327

Fig. 3 . Space partitioning in high-dimensional spaces.

The most basic effect is the exponentialgrowth of volume. The volume of a cubein a d -dimensional space is of the formulavol= ed , where d is the dimension of thedata space and e is the edge length of thecube. Now if the edge length is a numberbetween 0 and 1, the volume of the cubewill exponentially decrease when increas-ing the dimension. Viewing the problemfrom the opposite side, if we want to definea cube of constant volume for increasingdimensions, the appropriate edge lengthwill quickly approach 1. For example, ina 2-dimensional space of extension [0, 1]d ,a cube of volume 0.25 has an edge lengthof 0.5 whereas in a 16-dimensional space,the edge length has to be 16

√0.25 ≈ 0.917.

The exponential growth of the volumehas a serious impact on conventional indexstructures. Space-organizing index struc-tures, for example, suffer from the “deadspace” indexing problem. Since space or-ganizing techniques index the whole do-main space, a query window may overlappart of the space belonging to a page thatactually contains no points at all.

Another important issue is the spacepartitioning one can expect in high-dimensional spaces. Usually, index struc-tures split the data space using (d − 1)-dimensional hyperplanes; for example,in order to perform a split, the indexstructure selects a dimension (the splitdimension) and a value in this dimension(the split value). All data items having avalue in the split dimension smaller thanthe split value are assigned to the firstpartition whereas the other data itemsform the second partition. This process ofsplitting the data space continues recur-sively until the number of data items in apartition is below a certain threshold and

the data items of this partition are storedin a data page. Thus, the whole processcan be described by a binary tree, the splittree. As the tree is a binary tree, the heighth of the split tree usually depends loga-rithmically on the number of leaf nodes,that is, data pages. On the other hand, thenumber d ′ of splits for a single data page ison average

d ′ = log2

(N

Ceff (d )

),

where N is the number of data itemsand Ceff (d ) is the capacity of a singledata page.2 Thus, we can conclude that ifall dimensions are equally used as splitdimensions, a data page has been split atmost once or twice in each dimension andtherefore, spans a range between 0.25and 0.5 in each of the dimensions (for uni-formly distributed data). From that, wemay conclude that the majority of the datapages are located at the surface of the dataspace rather than in the interior. In addi-tion, this obviously leads to a coarse dataspace partitioning in single dimensions.However, from our understanding of indexstructures such as the R?-tree that hadbeen designed for geographic applications,we are used to very fine partitions wherethe majority of the data pages are in theinterior of the space and we have to becareful not to apply this understanding tohigh-dimensional spaces. Figure 3 depictsthe different configurations. Note thatthis effect applies to almost any index

2 For most index structures, the capacity of a singledata page depends on the dimensionality since thenumber of entries decreases with increasing dimen-sion due to the larger size of the entries.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 7: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

328 C. Bohm et al.

structure proposed so far because weonly made assumptions about the splitalgorithm.

Additionally, not only index struc-tures show a strange behavior in high-dimensional spaces but also the expecteddistribution of the queries is affected bythe dimensionality of the data space. Ifwe assume a uniform data distribution,the selectivity of a query (the fraction ofdata items contained in the query) di-rectly depends on the volume of the query.In case of nearest-neighbor queries, thequery affects a sphere around the querypoint that contains exactly one data item,the NN-sphere. According to Berchtoldet al. [1997b], the radius of the NN-sphereincreases rapidly with increasing dimen-sion. In a data space of extension [0, 1]d , itquickly reaches a value larger than 1 whenincreasing d . This is a consequence of theabove-mentioned exponential relation ofextension and volume in high-dimensionalspaces.

Considering all these effects, we canconclude that if one builds an index struc-ture using a state-of-the-art split algo-rithm the performance will deterioraterapidly when increasing the dimension-ality of the data space. This has beenrealized not only in the context of mul-timedia systems where nearest-neighborqueries are most relevant, but also in thecontext of data warehouses where rangequeries are the most frequent type ofquery [Berchtold et al. 1998a, b]. Theoret-ical results based on cost models for index-based nearest-neighbor and range queriesalso confirm the degeneration of the queryperformance [Yao and Yao 1985; Berchtoldet al. 1997b, 2000b; Beyer et al. 1999].Other relevant cost models proposed be-fore include Friedman et al. [1997], Cleary[1979], Eastman [1981], Sproull [1991],Pagel et al. [1993], Arya et al, [1995], Arya[1995], Theodoridis and Sellis [1996], andPapadopoulos and Monolopoulos [1997b].

1.3. Basic Definitions

Before we proceed, we need to introducesome notions and to formalize our prob-lem description. In this section we define

our notion of the database and develop atwofold orthogonal classification for vari-ous neighborhood queries. Neighborhoodqueries can either be classified accordingto the metric that is applied to determinedistances between points or according tothe query type. Any combination betweenmetrics and query types is possible.

1.3.1. Database. We assume that in oursimilarity search application, objects arefeature-transformed into points of a vectorspace with fixed finite dimension d . There-fore, a database DB is a set of points ina d -dimensional data space DS. The dataspace DS is a subset of <d . Usually, ana-lytical considerations are simplified if thedata space is restricted to the unit hyper-cube DS = [0..1]d .

Our database is completely dynamic.That means insertions of new pointsand deletions of points are possible andshould be handled efficiently. The numberof point objects currently stored in ourdatabase is abbreviated as n. We notethat the notion of a point is ambiguous.Sometimes, we mean a point object (i.e.,a point stored in the database). In othercases, we mean a point in the data space(i.e., a position), which is not necessarilystored in DB. The most common examplefor the latter is the query point. Fromthe context, the intended meaning of thenotion point is always obvious.

Definition 4 (Database). A databaseDB is a set of n points in a d -dimensionaldata space DS,

DB = {P0, . . . , Pn−1}Pi ∈ DS, i = 0..n− 1, DS ⊆ <d .

1.3.2. Vector Space Metrics. All neigh-borhood queries are based on the notionof the distance between two points P andQ in the data space. Depending on the ap-plication to be supported, several metricsto define distances are applied. Most com-mon is the Euclidean metric L2 definingthe usual Euclidean distance function:

δEuclid(P, Q) = 2

√√√√d−1∑i=0

(Qi − Pi)2.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 8: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 329

Fig. 4 . Metrics for data spaces.

But other Lp metrics such as theManhattan metric (L1) or the maximummetric (L∞) are also widely applied:

δManhattan(P, Q) =d−1∑i=0

|Qi − Pi|

δMax(P, Q) = max{|Qi − Pi|}.

Queries using the L2 metric are (hyper)sphere shaped. Queries using the maxi-mum metric or Manhattan metric are hy-percubes and rhomboids, respectively (cf.Figure 4). If additional weights w0, . . . ,wd−1 are assigned to the dimensions, thenwe define weighted Euclidean or weightedmaximum metrics that correspond toaxis-parallel ellipsoids and axis-parallelhyperrectangles:

δW.Euclid(P, Q) = 2

√√√√d−1∑i=0

wi · (Qi − Pi)2

δW.Max(P, Q) = max{wi· | Qi − Pi |}.

Arbitrarily rotated ellipsoids can be de-fined using a positive definite similaritymatrix W . This concept is used for adapt-able similarity search [Seidl 1997]:

δ2ellipsoid(P, Q) = (P − Q)T ·W · (P − Q).

1.3.3. Query Types. The first classifica-tion of queries is according to the vectorspace metric defined on the feature space.An orthogonal classification is based onthe question of whether the user definesa region of the data space or an intendedsize of the result set.

Point Query. The most simple query typeis the point query. It specifies a point in the

data space and retrieves all point objectsin the database with identical coordinates:

PointQuery(DB, Q) = {P ∈ DB | P = Q}.

A simplified version of the point querydetermines only the Boolean answer,whether the database contains an identi-cal point or not.

Range Query. In a range query, a querypoint Q , a distance r, and a metric Mare specified. The result set comprises allpoints P from the database, which have adistance smaller than or equal to r from Qaccording to metric M :

RangeQuery(DB, Q , r, M )= {P ∈ DB | δM (P, Q) ≤ r}.

Point queries can also be considered asrange queries with a radius r = 0 and anarbitrary metric M . If M is the Euclideanmetric, then the range query defines a hy-persphere in the data space, from whichall points in the database are retrieved.Analogously, the maximum metric definesa hypercube.

Window Query. A window query speci-fies a rectangular region in data space,from which all points in the database areselected. The specified hyperrectangle isalways parallel to the axis (“window”). Weregard the window query as a region queryaround the centerpoint of the window us-ing a weighted maximum metric, wherethe weights wi represent the inverse of theside lengths of the window.

Nearest-Neighbor Query. The range queryand its special cases (point query and win-dow query) have the disadvantage that

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 9: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

330 C. Bohm et al.

the size of the result set is previously un-known. A user specifying the radius r mayhave no idea how many results his querymay produce. Therefore, it is likely that hemay fall into one of two extremes: eitherhe gets no answers at all, or he gets al-most all database objects as answers. Toovercome this drawback, it is common todefine similarity queries with a defined re-sult set size, the nearest-neighbor queries.

The classical nearest-neighbor query re-turns exactly one point object as result,the object with the lowest distance to thequery point among all points stored in thedatabase.3 The only exception from thisone-answer rule is due to tie effects. If sev-eral points in the database have the same(minimal) distance, then our first defini-tion allows more than one answer:

NNQueryDeterm(DB, Q , M ) = {P ∈DB |∀P ′ ∈ DB : δM (P, Q) ≤ δM (P ′, Q)}.

A common solution avoiding the excep-tion to the one-answer rule uses non-determinism. If several points in thedatabase have minimal distance fromthe query point Q , an arbitrary point fromthe result set is chosen and reported as an-swer. We follow this approach:

NNQuery(DB, Q , M ) = SOME{P ∈ DB |∀P ′ ∈ DB : δM (P, Q) ≤ δM (P ′, Q)}.

K-Nearest-Neighbor Query. If a user doesnot only want one closest point as answerupon her query, but rather a natural num-ber k of closest points, she will perform ak-nearest-neighbor query. Analogously tothe nearest-neighbor query, the k-nearest-neighbor query selects k points from thedatabase such that no point among the re-maining points in the database is closer tothe query point than any of the selectedpoints. Again, we have the problem of ties,

3 A recent extension of nearest neighbor queries areclosest pair queries which are also called distancejoins [Hjaltason and Samet 1998; Corral et al. 2000].This query type is mainly important in the area ofspatial databases and therefore, closest pair queriesare beyond the scope of this survey.

which can be solved either by nondeter-minism or by allowing more than k an-swers in this special case:

kNNQuery(DB, Q , k, M ) = {P0 . . . Pk−1

∈ DB | ¬∃P ′ ∈ DB\{P0 . . . Pk−1}∧¬∃i, 0≤ i< k : δM (Pi, Q)>δM (P ′, Q)}.

A variant of k-nearest-neighbor queriesis ranking queries which do not requirethat the user specify a range in the dataspace or a result set size. The first answerof a ranking query is always the nearestneighbor. Then, the user has the possibil-ity of asking for further answers. Uponthis request, the second nearest neighboris reported, then the third, and so on. Theuser decides after examining an answer,if further answers are needed. Rankingqueries can be especially useful in the fil-ter step of a multistep query processingenvironment. Here, the refinement stepusually takes the decision whether the fil-ter step has to produce further answers.

Approximate Nearest-Neighbor Query. Inapproximate nearest-neighbor queries andapproximate k-nearest-neighbor queries,the user also specify a query point anda number k of answers to be reported. Incontrast to exact nearest-neighbor queries,the user is not interested exactly in theclosest points, but wants only points thatare not much farther away from the querypoint than the exact nearest neighbor. Thedegree of inexactness can be specified byan upper bound, how much farther awaythe reported answers may be compared tothe exact nearest neighbors. The inexact-ness can be used for efficiency improve-ment of query processing.

1.4. Query Evaluation Without Index

All query types introduced in the previoussection can be evaluated by a single scanof the database. As we assume that ourdatabase is densely stored on a contigu-ous block on secondary storage all queriescan be evaluated using a so-called sequen-tial scan, which is faster than the accessof small blocks spread over wide parts ofsecondary storage.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 10: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 331

The sequential scan works as follows:the database is read in very large blocks,determined by the amount of main mem-ory available to query processing. Afterreading a block from disk, the CPU pro-cesses it and extracts the required infor-mation. After a block is processed, the nextblock is read in. Note that we assume thatthere is no parallelism between CPU anddisk I/O for any query processing tech-nique presented in this article.

Furthermore, we do not assume any ad-ditional information to be stored in thedatabase. Therefore, the database has thesize in bytes:

sizeof(DB) = d · n · sizeof(float).

The cost of query processing based onthe sequential scan is proportional to thesize of the database in bytes.

1.5. Overview

The rest of the survey is organized as fol-lows. We start with describing the com-mon principles of multidimensional in-dex structures and the algorithms used tobuild the indexes and process the differentquery types. Then we provide a system-atic overview of the querying and index-ing techniques that have been proposedfor high-dimensional data spaces, describ-ing them in a uniform way and discussingtheir advantages and drawbacks. Ratherthan describing the details of all the dif-ferent approaches, we try to focus on thebasic concepts and algorithms used. Wealso cover a number of recently proposedtechniques dealing with optimization andparallelization issues. In concluding thesurvey, we try to stir up further researchactivities by presenting a number of inter-esting research problems.

2. COMMON PRINCIPLES OFHIGH-DIMENSIONAL INDEXINGMETHODS

2.1. Structure

High-dimensional indexing methods arebased on the principle of hierarchical clus-tering of the data space. Structurally,

they resemble the B+-tree [Bayer andMcCreight 1977; Comer 1979]: The datavectors are stored in data nodes such thatspatially adjacent vectors are likely to re-side in the same node. Each data vectoris stored in exactly one data node; thatis, there is no object duplication amongdata nodes. The data nodes are organizedin a hierarchically structured directory.Each directory node points to a set of sub-trees. Usually, the structure of the infor-mation stored in data nodes is completelydifferent from the structure of the direc-tory nodes. In contrast, the directory nodesare uniformly structured among all levelsof the index and consist of (key, pointer)-tuples. The key information is different fordifferent index structures. For B-trees, forexample, the keys are ranges of numbersand for an R-tree the keys are boundingboxes. There is a single directory node,which is called the root node. It serves asan entry point for query and update pro-cessing. The index structures are height-balanced. That means the lengths of thepaths between the root and all data pagesare identical, but may change after insertor delete operations. The length of a pathfrom the root to a data page is called theheight of the index structure. The lengthof the path from a random node to a datapage is called the level of the node. Datapages are on level zero. See Figure 5.

The uniform (key, pointer)-structure ofthe directory nodes also allows an im-plementation of a wide variety of indexstructures as extensions of a generic in-dex structure as done in the general-ized search tree [Hellerstein et al. 1995].The generalized search tree (GiST) pro-vides a nice framework for a fast and reli-able implementation of search trees. Themain requirement for defining a new in-dex structure in GiST is to define thekeys and provide an implementation offour basic methods needed for buildingand searching the tree (cf. Section 3). Ad-ditional methods may be defined to en-hance the performance of the index, whichis especially relevant for similarity ornearest-neighbor queries [Aoki 1998]. Anadvantage of GiST is that the basic datastructures and algorithms as well as main

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 11: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

332 C. Bohm et al.

Fig. 5 . Hierarchical index structures.

portions of the concurrency and recoverycode can be reused. It is also useful as abasis for theoretical analysis of indexingschemes [Hellerstein et al. 1997]. A recentimplementation in a commercial object-relational system shows that GiST-basedimplementations of index structures canprovide a competitive performance whileconsiderably reducing the implementa-tion efforts [Kornacker 1999].

2.2. Management

The high-dimensional access methods aredesigned primarily for secondary storage.Data pages have a data page capacityCmax,data, defining how many data vectorscan be stored in a data page at most.Analogously, the directory page capacityCmax,dir gives an upper limit to the numberof subnodes in each directory node. Theoriginal idea was to choose Cmax,data andCmax,dir such that data and directory nodesfit exactly into the pages of secondarystorage. However, in modern operatingsystems, the page size of a disk drive isconsidered as a hardware detail hiddenfrom programmers and users. Despitethat, consecutive reading of contiguousdata on disk is by orders of magnitude lessexpensive than reading at random posi-tions. It is a good compromise to read datacontiguously from disk in portions be-tween a few kilobytes and a few hundredkilobytes. This is a kind of artificial pagingwith a user-defined logical page size. Howto properly choose this logical page size isinvestigated in Sections 3 and 4. The logi-cal page sizes for data and directory nodesare constant for most of the index struc-tures presented in this section. The onlyexceptions are the X -tree and the DABS-

tree. The X -tree defines a basic page sizeand allows directory pages to extend overmultiples of the basic page size. This con-cept is called supernode (cf. Section 6.2).The DABS-tree is an indexing structuregiving up the requirement of a constantblocksize. Instead, an optimal blocksizeis determined individually for each pageduring creation of the index. This dy-namic adoption of the block size gives theDABS-tree [Bohm 1998] its name.

All index structures presented here aredynamic: they allow insert and delete op-erations in O(log n) time. To cope withdynamic insertions, updates, and deletes,the index structures allow data and direc-tory nodes to be filled under their capacityCmax. In most index structures the rule isapplied that all nodes up to the root nodemust be filled to about 40% at least. Thisthreshold is called the minimum storageutilization sumin. For obvious reasons, theroot is generally allowed to obviate thisrule.

For B-trees, it is possible to analyti-cally derive an average storage utiliza-tion, further on referred to as the effec-tive storage utilization sueff. In contrast,for high-dimensional index structures, theeffective storage utilization is influencedby the specific heuristics applied in insertand delete processing. Since these index-ing methods are not amenable to an an-alytical derivation of the effective storageutilization, it usually has to be determinedexperimentally.4

For comfort, we denote the product ofthe capacity and the effective storage

4 For the hB-tree, it has been shown in Lomet andSalzberg [1990] that under certain assumptions theaverage storage utilization is 67%.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 12: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 333

Fig. 6 . Corresponding pageregions of an indexing struc-ture.

utilization as the effective capacity Ceff ofa page:

Ceff,data = sueff,data · Cmax,data

Ceff,dir = sueff,dir · Cmax,dir.

2.3. Regions

For efficient query processing it is impor-tant that the data are well clustered intothe pages; that is, that data objects whichare close to each other are likely to bestored in the same data page. Assigned toeach page is a so-called page region, whichis a subset of the data space (see Figure 6).The page region can be a hypersphere,a hypercube, a multidimensional cuboid,a multidimensional cylinder, or a set-theoretical combination (union, intersec-tion) of several of the above. For most, butnot all high-dimensional index structures,the page region is a contiguous, solid, con-vex subset of the data space without holes.For most index structures, regions of pagesin different branches of the tree may over-lap, although overlaps lead to bad perfor-mance behavior and are avoided if possibleor at least minimized.

The regions of hierarchically organizedpages must always be completely con-tained in the region of their parent. Analo-gously, all data objects stored in a subtreeare always contained in the page region ofthe root page of the subtree. The page re-gion is always a conservative approxima-tion for the data objects and the other pageregions stored in a subtree.

In query processing, the page regionis used to cut branches of the tree from

further processing. For example, in thecase of range queries, if a page regiondoes not intersect with the query range,it is impossible for any region of a hi-erarchically subordered page to inter-sect with the query range. Neither isit possible for any data object stored inthis subtree to intersect with the queryrange. Only pages where the correspond-ing page region intersects with the queryhave to be investigated further. There-fore, a suitable algorithm for range queryprocessing can guarantee that no falsedrops occur.

For nearest-neighbor queries a relatedbut slightly different property of conser-vative approximations is important. Here,distances to a query point have to be de-termined or estimated. It is importantthat distances to approximations of pointsets are never greater than the distancesto the regions of subordered pages andnever greater than the distances to thepoints stored in the corresponding sub-tree. This is commonly referred to as thelower bounding property.

Page regions always have a represen-tation that is an invertible mapping be-tween the geometry of the region and a setof values storable in the index. For exam-ple, spherical regions can be representedas centerpoint and radius using d+1 float-ing point values, if d is the dimension ofthe data space. For efficient query process-ing it is necessary that the test for inter-section with a query region and the dis-tance computation to the query point inthe case of nearest-neighbor queries canbe performed efficiently.

Both geometry and representation ofthe page regions must be optimized. If thegeometry of the page region is suboptimal,the probability increases that the corre-sponding page has to be accessed more fre-quently. If the representation of the regionis unnecessarily large, the index itself getslarger, yielding worse efficiency in queryprocessing, as we show later.

3. BASIC ALGORITHMS

In this section, we present some ba-sic algorithms on high-dimensional index

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 13: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

334 C. Bohm et al.

structures for index construction andmaintenance in a dynamic environment,as well as for query processing. Althoughsome of the algorithms are published us-ing a specific indexing structure, they arepresented here in a more general way.

3.1. Insert, Delete, and Update

Insert, delete, and update are the opera-tions that are most specific to the corre-sponding index structures. Despite that,there are basic algorithms capturing allactions common to all index structures.In the GiST framework [Hellerstein et al.1995], the buildup of the tree via the in-sert operation is handled using three ba-sic operations: Union, Penalty, and Pick-Split. The Union operation consolidatesinformation in the tree and returns a newkey that is true for all data items in theconsidered subtree. The Penalty operationis used to find the best path for insertinga new data item into the tree by providinga number representing how bad an inser-tion into that path would be. The PickSplitoperation is used to split a data page incase of an overflow.

The insertion and delete operationsof tree structures are usually the mostcritical operations, heavily determiningthe structure of the resulting index andthe achievable performance. Some indexstructures require for a simple insert thepropagation of changes towards the rootor down the children as, for example, inthe cases of the R-tree and kd-B-treeand some that do not as, for example,the hB-tree. In the latter case, the in-sert/delete operations are called local op-erations, whereas in the first case, theyare called nonlocal operations. Inserts aregenerally handled as follows.

—Search a suitable data page dp for thedata object do.

—Insert do into dp.—If the number of objects stored in dp ex-

ceeds Cmax,data, then split dp into twodata pages.

—Replace the old description (the repre-sentation of the region and the back-ground storage address) of dp in the par-

ent node of dp by the descriptions of thenew pages.

—If the number of subtrees stored inthe parent exceeds Cmax,dir, split theparent and proceed similarly with theparent. It is possible that all pages onthe path from dp to the root have tobe split.

—If the root node has to be split, let theheight of the tree grow by one. In thiscase, a new root node is created pointingto two subtrees resulting from the splitof the original root.

Heuristics individual to the specific in-dexing structure are applied for the follow-ing subtasks.

—The search for a suitable data page(commonly referred to as the Pick-Branch procedure): Due to overlapbetween regions and as the data spaceis not necessarily completely coveredby page regions, there are generallymultiple alternatives for the choice ofa data page in most multidimensionalindex structures.

—The choice of the split (i.e., which of thedata objects/subtrees are aggregatedinto which of the newly created nodes).

Some index structures try to avoidsplits by a concept named forced reinsert.Some data objects are deleted from anode having an overflow condition andreinserted into the index. The details arepresented later.

The choice of heuristics in insert pro-cessing may affect the effective storageutilization. For example, if a volume-minimizing algorithm allows unbalancedsplitting in a 30:70 proportion, then thestorage utilization of the index is de-creased and the search performance isusually negatively affected.5 On the otherhand, the presence of forced reinsert op-erations increases the storage utilizationand the search performance.

5 For the hB-tree, it has been shown in Lomet andSalzberg [1990] that under certain assumptions evena 33:67 splitting proportion yields an average storageutilization of 64%.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 14: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 335

ALGORITHM 1. (Algorithm for Exact Match Queries)bool ExactMatchQuery(Point q, PageAdr pa) {

int i;Page p = LoadPage(pa);if (IsDatapage(p))

for (i = 0; i < p.num objects; i++)if (q == p.object[i])

return true;if (IsDirectoryPage(p))

for (i = 0; i < p.num objects; i++)if (IsPointInRegion(q, p.region[i]))

if (ExactMatchQuery(q, p.sonpage[i]))return true;

return false;}

Some work has been undertaken onhandling deletions from multidimensionalindex structures. Underflow conditionscan generally be handled by three differ-ent actions:

—balancing pages by moving objects fromone page to another,

—merging pages, and—deleting the page and reinserting all ob-

jects into the index.

For most index structures it is a difficulttask to find a suitable mate for balancingor merging actions. The only exceptionsare the LSDh-tree [Henrich 1998] and thespace filling curves [Morton 1966; Finkeland Bentley 1974; Abel and Smith 1983;Orenstein and Merret 1984; Faloutsos1985, 1988; Faloutsos and Roseman 1989;Jagdish 1990] (cf. Sections 6.3 and 6.7).All other authors either suggest reinsert-ing or do not provide a deletion algorithmat all. An alternative approach might beto permit underfilled pages and to main-tain them until they are completely empty.The presence of delete operations and thechoice of underflow treatment can affectsueff,data and sueff,dir positively as well asnegatively.

An update-operation is viewed as a se-quence of a delete-operation, followed byan insert-operation. No special procedurehas been suggested yet.

3.2. Exact Match Query

Exact match queries are defined as fol-lows: given a query point q, determinewhether q is contained in the database.Query processing starts with the rootnode, which is loaded into main memory.For all regions containing point q the func-tion ExactMatchQuery() is called recur-sively. As overlap between page regionsis allowed in most index structures pre-sented in this survey, it is possible thatseveral branches of the indexing struc-ture have to be examined for processingan exact match query. In the GiST frame-work [Hellerstein et al. 1995], this situa-tion is handled using the Consistent oper-ation which is the generic operation thatneeds to be reimplemented for different in-stantiations of the generalized search tree.The result of ExactMatchQuery is true ifany of the recursive calls returns true. Fordata pages, the result is true if one of thepoints stored on the data page fits. If nopoint fits, the result is false. Algorithm 1contains the pseudocode for processing ex-act match queries.

3.3. Range Query

The algorithm for range query processingreturns a set of points contained in thequery range as the result of the call-ing function. The size of the result set

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 15: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

336 C. Bohm et al.

ALGORITHM 2. (Algorithm for Range Queries)PointSet RangeQuery(Point q, float r, PageAdr pa) {

int i;PointSet result = EmptyPointSet;Page p = LoadPage(pa);if (IsDatapage(p))

for (i = 0; i < p.num objects; i++)if (IsPointInRange(q, p.object[i], r))

AddToPointSet(result, p.object[i]);if (IsDirectoryPage(p))

for (i = 0; i < p.num objects; i++)if (RangeIntersectRegion(q, p.region[i], r))

PointSetUnion(result, RangeQuery(q, r, p.childpage[i]));return result;

}

is previously unknown and may reachthe size of the entire database. The al-gorithm is formulated independently ofthe applied metric. Any Lp metric, in-cluding metrics with weighted dimen-sions (ellipsoid queries [Seidl 1997; Seidland Kriegel 1997]), can be applied, ifthere exists an effective and efficient testfor the predicates IsPointInRange andRangeIntersectRegion. Also partial rangequeries (i.e., range queries where only asubset of the attributes is specified) canbe considered as regular range querieswith weights (the unspecified attributesare weighted with zero). Window queriescan be transformed into range-queries us-ing a weighted Lmax metric.

The algorithm (cf. Algorithm 2) per-forms a recursive self-call for all child-pages, where the corresponding page re-gions intersect with the query. The unionof the results of all recursive calls is builtand passed to the caller.

3.4. Nearest-Neighbor Query

There are two different approaches toprocessing nearest-neighbor queries onmultidimensional index structures. Onewas published by Roussopoulos et al.[1995] and is in the following referredto as the RKV algorithm. The other,called the HS algorithm, was published in

Henrich [1994] and Hjaltason and Samet[1995]. Due to their importance for ourfurther presentation, these algorithms arepresented in detail and their strengthsand weaknesses are discussed.

We start with the description of theRKV algorithm because it is more simi-lar to the algorithm for range query pro-cessing, in the sense that a depth-firsttraversal through the indexing structureis performed. RKV is an algorithm of the“branch and bound” type. In contrast, theHS algorithm loads pages from differentbranches and different levels of the indexin an order induced by the closeness to thequery point.

Unlike range query processing, thereis no fixed criterion, known a priori, toexclude branches of the indexing struc-ture from processing in nearest neigh-bor algorithms. Actually, the criterion isthe nearest neighbor distance but thenearest neighbor distance is not knownuntil the algorithm has terminated. Tocut branches, nearest neighbor algorithmshave to use pessimistic (conservative) esti-mations of the nearest neighbor distance,which will change during the run of thealgorithm and will approach the nearestneighbor distance. A suitable pessimisticestimation is the closest point amongall points visited at the current stateof execution (the so-called closest point

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 16: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 337

Fig. 7 . MINDIST and MAXDIST.

candidate cpc). If no point has been visitedyet, it is also possible to derive pessimisticestimations from the page regions visitedso far.

3.4.1. The RKV Algorithm. The authorsof the RKV algorithm define two impor-tant distance functions, MINDIST andMINMAXDIST. MINDIST is the actualdistance between the query point anda page region in the geometrical sense,that is, the nearest possible distance ofany point inside the region to the querypoint. The definition in the original pro-posal [Roussopoulos et al. 1995] is lim-ited to R-treelike structures, where re-gions are provided as multidimensionalintervals (i.e., minimum bounding rectan-gles, MBR) I with

I = [lb0, ub0]× · · · × [lbd−1, ubd−1].

Then, MINDIST is defined as follows.

Definition 5 (MINDIST). The distanceof a point q to region I , denotedMINDIST(q, I ) is:

MINDIST2(q, I )

=d−1∑i=0

lbi − qi if qi < lbi

0 otherwiseqi − ubi if ubi < qi

2

.

An example of MINDIST is presentedon the left side of Figure 7. In page regionspr1 and pr3, the edges of the rectanglesdefine the MINDIST. In page regionpr4 the corner defines MINDIST. As the

query point lies in pr2, the correspondingMINDIST is 0. A similar definition canalso be provided for differently shapedpage regions, such as spheres (subtractthe radius from the distance betweencenter and q) or combinations. A similardefinition can be given for the L1 and Lmaxmetric, respectively. For a pessimistic es-timation, some specific knowledge aboutthe underlying indexing structure is re-quired. One assumption which is true forall known index structures, is that everypage must contain at least one point.Therefore, we could define the followingMAXDIST function determining the dis-tance to the farthest possible point insidea region.

MAXDIST2(q, I ) =d−1∑i=0

×({|lbi −qi| if |lbi −qi |> |qi −ubi||qi − ubi | otherwise

)2

.

MAXDIST is not defined in the origi-nal paper, as it is not needed in R-treelikestructures. An example is shown on theright side of Figure 7. Being the greatestpossible distance from the query point toa point in a page region, the MAXDIST isnot equal to 0, even if the query point islocated inside the page region pr2.

In R-trees, the page regions are min-imum bounding rectangles (MBR) (i.e.,rectangular regions), where each sur-face hyperplane contains at least onedatapoint. The following MINMAXDISTfunction provides a better (i.e., lower) but

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 17: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

338 C. Bohm et al.

Fig. 8 . MINMAXDIST.

still conservative estimation of the nearestneighbor distance.

MINMAXDIST2(q, I )

= min0≤k<d

|qk − rmk|2

+∑i 6= k

0 ≤ i < d

| qi − r Mi |2

,

where

rmk ={

lbk if qk ≤ lbk+ubk2

ubk otherwiseand

r Mi ={

lbi if qi ≥ lbi+ubi2

ubi otherwise.

The general idea is that every sur-face hyperarea must contain a point. Thefarthest point on every surface is deter-mined and among those the minimum istaken. For each pair of opposite surfaces,only the nearer surface can contain theminimum. Thus, it is guaranteed that adata object can be found in the regionhaving a distance less than or equal toMINMAXDIST(q, I ). MINMAXDIST(q, I )is the smallest distance providing thisguarantee. The example in Figure 8 showson the left side the considered edges.Among each pair of opposite edges of anMBR, only the edge closer to the querypoint is considered. The point yielding

the maximum distance on each considerededge is marked with a circle. The min-imum among all marked points of eachpage region defines the MINMAXDIST, asshown on the right side of Figure 8.

This pessimistic estimation cannot beused for spherical or combined regions asthese in general do not fulfill a propertysimilar to the MBR property. In this case,MAXDIST(q, I ), which is an estimationworse than MINMAXDIST, has to be used.All definitions presented using the L2 met-ric in the original paper [Roussopouloset al. 1995] can easily be adapted to L1or Lmax metrics, as well as to weightedmetrics.

The algorithm (cf. Algorithm 3) per-forms accesses to the pages of an index ina depth-first order (“branch and bound”).A branch of the index is always completelyprocessed before the next branch is begun.Before child nodes are loaded and recur-sively processed, they are heuristicallysorted according to their probability ofcontaining the nearest neighbor. For thesorting order, the optimistic or pessimisticestimation or a combination thereof maybe chosen. The quality of sorting is criticalfor the efficiency of the algorithm becausefor different sequences of processingthe estimation of the nearest neighbordistance may approach more or lessquickly the actual nearest neighbor dis-tance. Roussopoulos et al. [1995] reportadvantages for the optimistic estima-tion. The list of child nodes is prunedwhenever the pessimistic estimation ofthe nearest neighbor distance changes.Pruning means the discarding of all childnodes having a MINDIST larger than the

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 18: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 339

ALGORITHM 3. (The RKV Algorithm for Finding the Nearest Neighbor)float pruning dist /?The current distance for pruning branches?/= INFINITE; /?Initialization before the start of RKV algorithm?/

Point cpc; /?The closest point candidate. This variable will containthe nearest neighbor after RKV algorithm has completed?/

void RKV algorithm(Point q, PageAdr pa) {int i; float h;Page p = LoadPage(pa);if (IsDatapage(p))

for (i = 0; i < p.num objects; i++) {h = PointToPointDist(q, p.object[i]);if (pruning dist>=h) {

pruning dist = h;cpc = p.object[i];

} }if (IsDirectoryPage(p)) {

sort(p, CRITERION);/?CRITERION is MINDIST or MINMAXDIST?/for (i = 0; i < p.num objects; i++) {

if (MINDIST(q, p.region[i]) <= pruning dist)RKV algorithm(q, p.childpage[i]);

h = MINMAXDIST(q, p.region[i]);if (pruning dist >= h)

pruning dist = h;} } }

pessimistic estimation of the nearestneighbor distance. These pages are guar-anteed not to contain the nearest neigh-bor because even the closest point inthese pages is farther away than an al-ready found point (lower bounding prop-erty). The pessimistic estimation is thelowest among all distances to points pro-cessed thus far and all results of theMINMAXDIST(q, I ) function for all pageregions processed thus far.

In Cheung and Fu [1998], severalheuristics for the RKV algorithm with andwithout the MINMAXDIST function arediscussed. The authors prove that anypage which can be pruned by exploitingthe MINMAXDIST can also be prunedwithout that concept. Their conclusion isthat the determination of MINMAXDISTshould be avoided as it causes an addi-tional overhead for the computation ofMINMAXDIST.

To extend the algorithm to k-nearestneighbor processing is a difficult task.Unfortunately, the authors make it easyby discarding the MINMAXDIST from

pruning, sacrificing the performance pathgains obtainable from the MINMAXDISTpath pruning. The kth lowest among alldistances to points found thus far must beused. Additionally required is a buffer fork points (the k closest point candidate list,cpcl) which allows an efficient deletion ofthe point with the highest distance andan efficient insertion of a random point.A suitable data structure for the closestpoint candidate list is a priority queue(also known as semisorted heap [Knuth1975]).

Considering the MINMAXDIST im-poses some difficulties, as the algorithmhas to assure that k points are closer tothe query than a given region. For each re-gion, we know that at least one point musthave a distance less than or equal to MIN-MAXDIST. If the k-nearest neighbor algo-rithm prunes a branch according to MIN-MAXDIST, it would assume that k pointsmust be positioned on the nearest surfacehyperplane of the page region. The MBRproperty only guarantees one such point.We further know that m points must have

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 19: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

340 C. Bohm et al.

Fig. 9 . The HS algorithm for finding the nearest neighbor.

a distance less than or equal to MAXDIST,where m is the number of points storedin the corresponding subtree. The num-ber m could be, for example, stored inthe directory nodes, or could be estimatedpessimistically by assuming minimal stor-age utilization if the indexing structureprovides storage utilization guarantees. Asuitable extension of the RKV algorithmcould use a semisorted heap with k en-tries. Each entry is a cpc, a MAXDISTestimation, or a MINMAXDIST estima-tion. The heap entry with the greatestdistance to the query point q is used forbranch pruning and is called the pruningelement. Whenever new points or estima-tions are encountered, they are insertedinto the heap if they are closer to the querypoint than the pruning element. When-ever a new page is processed, all estima-tions based on the appropriate page regionhave to be deleted from the heap. Theyare replaced by the estimations based onthe regions of the child pages (or the con-tained points, if it is a data page). This ad-ditional deletion implies additional com-plexities because a priority queue does notefficiently support the deletion of elementsother than the pruning element. All thesedifficulties are neglected in the originalpaper [Roussopoulos et al. 1995].

3.4.2. The HS Algorithm. The problemsarising from the need to estimate thenearest neighbor distance are elegantlyavoided in the HS algorithm [Hjaltason

and Samet 1995]. The HS algorithm doesnot access the pages in an order inducedby the hierarchy of the indexing struc-ture, such as depth-first or breadth-first.Rather, all pages of the index are accessedin the order of increasing distance to thequery point. The algorithm is allowed tojump between branches and levels for pro-cessing pages. See Figure 9.

The algorithm manages an active pagelist (APL). A page is called active if itsparent has been processed but not thepage itself. Since the parent of an activepage has been loaded, the correspondingregion of all active pages is known andthe distance between region and querypoint can be determined. The APL storesthe background storage address of thepage, as well as the distance to the querypoint. The representation of the page re-gion is not needed in the APL. A process-ing step of the HS algorithm comprises thefollowing actions.

—Select the page p with the lowest dis-tance to the query point from the APL.

—Load p into main memory.—Delete p from the APL.—If p is a data page, determine if one

of the points contained in this page iscloser to the query point than the clos-est point found so far (called the closestpoint candidate cpc).

—Otherwise: Determine the distances tothe query point for the regions of allchild pages of p and insert all child

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 20: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 341

pages and the corresponding distancesinto APL.

The processing step is repeated untilthe closest point candidate is closer to thequery point than the nearest active page.In this case, no active page is able to con-tain a point closer to q than cpc due tothe lower bounding property. Also, no sub-tree of any active page may contain such apoint. As all other pages have already beenlooked upon, processing can stop. Again,the priority queue is the suitable datastructure for APL.

For k-nearest neighbor processing, asecond priority queue with fixed length kis required for the closest point candidatelist.

3.4.3. Discussion. Now we compare thetwo algorithms in terms of their spaceand time complexity. In the context ofspace complexity, we regard the availablemain memory as the most important sys-tem limitation. We assume that the stackfor recursion management and all pri-ority queues are held in main memoryalthough one could also provide an im-plementation of the priority queue datastructure suitable for secondary storageusage.

LEMMA 1 (Worst Case Space Complexityof the RKV Algorithm). The RKV algo-rithm has a worst case space complexityof O(log n).

For the proof see Appendix A.

As the RKV algorithm performs a depth-first pass through the indexing structure,and no additional dynamic memory is re-quired, the space complexity is O(log n).Lemma 1 is also valid for the k-nearestneighbor search, if allowance is made forthe additional space requirement for theclosest point candidate list with a spacecomplexity of O(k).

LEMMA 2 (Worst Case Space Complexityof the HS Algorithm). The HS algorithmhas a space complexity ofO(n) in the worstcase.

For the Proof see Appendix B.

In spite of the order O(n), the size of theAPL is only a very small fraction of the sizeof the data set because the APL containsonly the page address and the distance be-tween the page region and query point q.If the size of the data set in bytes is DSS,then we have a number of DP data pageswith

DP = DSSsueff,data · sizeof(DataPage)

.

Then the size of the APL is f times thedata set size:

sizeof(APL) = f ·DSS

= sizeof(float)+ sizeof(address)sueff,data · sizeof(DataPage)

·DSS,

where a typical factor for a page size of4 Kbytes is f = 0.3%, even shrinking witha growing data page size. Thus, it shouldbe no practical problem to hold 0.3% of adatabase in main memory, although theo-retically unattractive.

For the objective of comparing the twoalgorithms, we prove optimality of theHS algorithm in the sense that it ac-cesses as few pages as theoretically pos-sible for a given index. We further show,using counterexamples, that the RKV al-gorithm does not generally reach thisoptimum.

LEMMA 3 (Page Regions Intersecting theNearest Neighbor Sphere). Let nndist bethe distance between the query point andits nearest neighbor. All pages that inter-sect a sphere around the query point hav-ing a radius equal to nndist (the so-callednearest neighbor sphere) must be accessedfor query processing. This condition is nec-essary and sufficient.

For the proof see Appendix C.

LEMMA 4 (Schedule of the HS Algo-rithm). The HS algorithm accesses pagesin the order of increasing distance to thequery point.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 21: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

342 C. Bohm et al.

Fig. 10 . Schedules of RKV and HS algorithm.

For the proof see Appendix D.

LEMMA 5 (Optimality of HS Algo-rithm). The HS algorithm is optimal interms of the number of page accesses.

For the proof see Appendix E.

Now we demonstrate by an examplethat the RKV algorithm does not alwaysyield an optimal number of page accesses.The main reason is that once a branchof the index has been selected, it hasto be completely processed before a newbranch can be begun. In the example ofFigure 10, both algorithms chose pr1 toload first. Some important MINDISTS andMINMAXDISTS are marked in the fig-ure with solid and dotted arrows, respec-tively. Although the HS algorithm loadspr2 and pr21, the RKV algorithm has firstto load pr11 and pr12, because no MIN-MAXDIST estimate can prune the appro-priate branches. If pr11 and pr12 are notdata pages, but represent further subtreeswith larger heights, many of the pages inthe subtrees will have to be accessed.

We have to summarize that the HS algo-rithm for nearest neighbor search is supe-rior to the RKV algorithm when countingpage accesses. On the other side, it has thedisadvantage of dynamically allocatingmain memory of the order O(n), althoughwith a very small factor of less than 1% ofthe database size. In addition, the exten-sion to the RKV algorithm for a k-nearestneighbor search is difficult to implement.

An open question is whether minimiz-ing the number of page accesses will min-imize the time needed for the page ac-

cesses, too. We show later that staticallyconstructed indexes yield an interpageclustering, meaning that all pages in abranch of the index are laid out contigu-ously on background storage. Therefore,the depth-first search of the RKV algo-rithm could yield fewer diskhead move-ments than the distance-driven search ofthe HS algorithm. A new challenge couldbe to develop an algorithm for the nearestneighbor search directly optimizing theprocessing time rather than the numberof page accesses.

3.5. Ranking Query

Ranking queries can be seen as gener-alized k-nearest-neighbor queries with apreviously unknown result set size k. Atypical application of a ranking query re-quests the nearest neighbor first, then thesecond closest point, the third, and so on.The requests stop according to a criterionthat is external to the index-based queryprocessing. Therefore, neither a limitedquery range nor a limited result set sizecan be assumed before the application ter-minates the ranking query.

In contrast to the k-nearest neighbor al-gorithm, a ranking query algorithm needsan unlimited priority queue for the candi-date list of closest points (cpcl). A furtherdifference is that each request of the nextclosest point is regarded as a phase thatis ended by reporting the next resultingpoint. The phases are optimized indepen-dently. In contrast, the k-nearest neighboralgorithm searches all k points in a singlephase and reports the complete set.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 22: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 343

In each phase of a ranking query algo-rithm, all points encountered during thedata page accesses are stored in the cpcl.The phase ends if it is guaranteed thatunprocessed index pages cannot contain apoint closer than the first point in cpcl (thecorresponding criterion of the k-nearestneighbor algorithm is based on the last el-ement of cpcl). Before beginning the nextphase, the leading element is deleted fromthe cpcl.

It does not appear very attractive toextend the RKV algorithm for process-ing ranking queries due to the fact thateffective branch pruning can be per-formed neither based on MINMAXDISTor MAXDIST estimates nor based onthe points encountered during datapage accesses.

In contrast, the HS algorithm for near-est neighbor processing needs only themodifications described above to be ap-plied as a ranking query algorithm. Theoriginal proposal [Hjaltason and Samet1995] contains these extensions.

The major limitation of the HS algo-rithm for ranking queries is the cpcl. Itcan be proven similarly as in Lemma 2that the length of the cpcl is of the orderO(n). In contrast to the APL, the cpcl con-tains the full information of possibly alldata objects stored in the index. Thus itssize is bounded only by the database sizequestioning the applicability not only the-oretically, but also practically. From ourpoint of view, a priority queue implemen-tation suitable for background storage isrequired for this purpose.

3.6. Reverse Nearest-Neighbor Queries

In Korn and Muthukrishnan [2000], theauthors introduce the operation of re-verse nearest-neighbor queries. Given anarbitrary query point q, this operation re-trieves all points of the database to whichq is the nearest neighbor, that is, the setof reverse nearest neighbors. Note that thenearest-neighbor relation is not symmet-ric. If some point p1 is the nearest neigh-bor of p2, then p2 is not necessarily thenearest neighbor of p1. Therefore the re-sult set of the rnn-operation can be empty

Fig. 11 . Indexing for thereverse nearest neighborsearch.

or may contain an arbitrary number ofpoints.

A database point p is in the result setof the rnn-operation for query point q un-less another database point p′ is closerto p than q is. Therefore, p is in the re-sult set if q is enclosed by the sphere cen-tered by p touching the nearest neigh-bor of p (the nearest neighbor sphere ofp). Therefore in Korn and Muthukrishnan[2000] the problem is solved by a spe-cialized index structure for sphere objectsthat stores the nearest neighbor spheresrather than the database points. An rnn-query corresponds to a point query in thatindex structure. For an insert operation,the set of reverse nearest neighbors of thenew point must be determined. The cor-responding nearest neighbor spheres ofall result points must be reinserted intothe index.

The two most important drawbacks ofthis solution are the high cost for the in-sert operation and the use of a highlyspecialized index. For instance, if the rnnhas to be determined only for a subset ofthe dimensions, a completely new indexmust be constructed. Therefore, in Stanoiet al. [2000] the authors propose a solutionfor point index structures. This solution,however, is limited to the two-dimensionalcase. See Figure 11.

4. COST MODELS FOR HIGH-DIMENSIONALINDEX STRUCTURES

Due to the high practical relevance of mul-tidimensional indexing, cost models for es-timating the number of necessary page

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 23: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

344 C. Bohm et al.

accesses were proposed several years ago.The first approach is the well-known costmodel proposed by Friedman et al. [1977]for nearest-neighbor query processingusing the maximum metric. The originalmodel estimates leaf accesses in a kd-tree, but can be easily extended to esti-mate data page accesses of R-trees andrelated index structures. This extensionwas presented in Faloutsos et al. [1987]and with slightly different aspects inAref and Samet [1991], Pagel et al. [1993],and Theodoridis and Sellis [1996]. The ex-pected number of data page accesses in anR-tree is

Ann,mm,FBF = d

√1

Ceff+ 1

d

.

This formula is motivated as follows.The query evaluation algorithm is as-sumed to access an area of the dataspace, which is a hypercube of the volumeV1 = 1/N , where N is the number of ob-jects stored in the database. Analogously,the page region is approximated by a hy-percube with the volume V2 = Ceff/N . Ineach dimension the chance that the pro-jection of V1 and V2 intersect each othercorresponds to d

√V1 + d

√V2 if n→∞. To

obtain a probability that V1 and V2 in-tersect in all dimensions, this term mustbe taken to the power of d . Multiplyingthis result with the number of data pagesN/Ceff yields the expected number of pageaccesses Ann,mm,FBF. The assumptions ofthe model, however, are unrealistic fornearest-neighbor queries on high-dimensional data for several reasons.First, the number N of objects in thedatabase is assumed to approach infinity.Second, effects of high-dimensional dataspaces and correlations are not consideredby the model. In Cleary [1979] the modelpresented in Friedman et al. [1977] isextended by allowing nonrectangularpage regions, but still boundary effectsand correlations are not considered. InEastman [1981] the existing models areused for optimizing the bucket size ofthe kd-tree. In Sproull [1991] the author

Fig. 12 . Evaluation of the model of Friedman et al.[1977].

shows that the number of datapointsmust be exponential in the number ofdimensions for the models to provide ac-curate estimations. According to Sproull,boundary effects significantly contributeto the costs unless the condition holds.

N � Ceff ·(

d

√1

Ceff · VS( 1

2

) + 1

)d

,

where VS(r) is the volume of a hy-persphere with radius r which can becomputed as

VS(r) =√πd

0(d/2+ 1)· rd

with the gamma-function0(x) which is theextension of the factorial operator x! =0(x + 1) into the domain of real numbers:0(x+1) = x·0(x),0(1) = 1, and0( 1

2 ) = √π .For example, in a 20-dimensional data

space with Ceff= 20, Sproull’s formulaevaluates to N� 1.1 · 1011. We show later(cf. Figure 12), how bad the cost estima-tions of the FBF model are if substan-tially fewer than a hundred billion pointsare stored in the database. Unfortunately,Sproull still assumes for his analysis uni-formity and independence in the distri-bution of datapoints and queries; that is,both datapoints and the centerpoints ofthe queries are chosen from a uniformdata distribution, whereas the selectivityof the queries (1/N ) is considered fixed.The above formulas are also generalizedto k-nearest-neighbor queries, where k isalso a user-given parameter.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 24: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 345

The assumptions made in the exist-ing models do not hold in the high-dimensional case. The main reason forthe problems of the existing models isthat they do not consider boundary ef-fects. “Boundary effects” stands for an ex-ceptional performance behavior, when thequery reaches the boundary of the dataspace. Boundary effects occur frequentlyin high-dimensional data spaces and leadto pruning of major amounts of emptysearch space which is not considered bythe existing models. To examine these ef-fects, we performed experiments to com-pare the necessary page accesses withthe model estimations. Figure 12 showsthe actual page accesses for uniformlydistributed point data versus the estima-tions of the Friedman et al. model. Forhigh-dimensional data, the model com-pletely fails to estimate the number ofpage accesses.

The basic model of Friedman et al.[1977] has been extended in different di-rections. The first is to take correlation ef-fects into account by using the concept ofthe fractal dimension [Mandelbrot 1977;Schroder 1991]. There are various defini-tions of the fractal dimension which allcapture the relevant aspect (the correla-tion), but are different in the details of howthe correlation is measured.

In Faloutsos and Kamel [1994] the au-thors used the box-counting fractal di-mension (also known as the Hausdorfffractal dimension) for modeling the perfor-mance of R-trees when processing rangequeries using the maximum metric. Intheir model they assume to have a correla-tion in the points stored in the database.For the queries, they still assume a uni-form and independent distribution. Theanalysis does not take into account ef-fects of high-dimensional spaces and theevaluation is limited to data spaces withdimensions less than or equal to three.In Belussi and Faloutsos [1995] the au-thors used the fractal dimension with adifferent definition (the correlation frac-tal dimension) for the selectivity estima-tion of spatial queries. In this paper, rangequeries in low-dimensional data spacesusing the Manhattan, Euclidean, and

maximum metrics were modeled. Unfortu-nately, the model only allows the estima-tion of selectivities. It is not possible to ex-tend the model in a straightforward way todetermine expectations of page accesses.

Papadopoulos and Manolopoulos[1997b] used the results of Faloutsos andKamel and of Belussi and Faloutsos for anew model published in a recent paper.Their model is capable of estimating datapage accesses of R-trees when processingnearest-neighbor queries in a Euclideanspace. They estimate the distance of thenearest neighbor by using the selectivityestimation presented in Belussi andFaloutsos [1995] in the reverse way. As itis difficult to determine accesses to pageswith rectangular regions for sphericalqueries, they approximate query spheresby minimum bounding and maximumenclosed cubes and determine upperand lower bounds of the number of pageaccesses in this way. This approachmakes the model inoperative for high-dimensional data spaces, because theapproximation error grows exponentiallywith increasing dimension. Note that in a20-dimensional data space, the volume ofthe minimum bounding cube of a sphere isby a factor of 1/VS(1/2) = 4.1 · 107 largerthan the volume of the sphere. The spherevolume, in turn, is by VS(

√d/2) = 27, 000

times larger than the greatest enclosedcube. An asset of Papadopoulos andManolopoulos’ model is that queries areno longer assumed to be taken from auniform and independent distribution. In-stead, the authors assume that the querydistribution follows the data distribution.

The concept of fractal dimension isalso widely used in the domain of spa-tial databases, where the complexityof stored polygons is modeled [Gaede1995; Faloutsos and Gaede 1996]. Theseapproaches are of minor importance forpoint databases.

The second direction, where the basicmodel of Friedman et al. [1977] needsextension, are the boundary effects occur-ring when indexing data spaces of higherdimensionality.

Arya [1995] and Arya et al. [1995] pre-sented a new cost model for processing

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 25: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

346 C. Bohm et al.

Fig. 13 . The Minkowski sum.

nearest-neighbor queries in the context ofthe application domain of vector quantiza-tion. Arya et al. restricted their model tothe maximum metric and neglected corre-lation effects. Unfortunately, they still as-sumed that the number of points was ex-ponential with the dimension of the dataspace. This assumption is justified in theirapplication domain, but it is unrealistic fordatabase applications.

Berchtold et al. [1997b] presenteda cost model for query processing inhigh-dimensional data spaces in theso-called BBKK model. The basic conceptof the BBKK model is the Minkowskisum (cf. Figure 13), a concept from robotmotion planning that was introducedby the BBKK model for the first timefor cost estimations. The general idea isto transform a query having a spatialextension (such as a range query ornearest-neighbor query) equivalently intoa point query by enlarging the pageregion. In Figure 13, the page region hasbeen enlarged such that a point querylies in the enlarged region if (and only if)the original query intersects the originalregion. Together with concepts to estimatethe size of page regions and query regions,the model provides accurate estimationsfor nearest neighbor and range queriesusing the Euclidean metric and considersboundary effects. To cope with correla-tion, the authors propose using the fractaldimension without presenting the details.The main limitations of the model are(1) that no estimation for the maximummetric is presented, (2) that the numberof data pages is assumed to be a power oftwo, and (3) that a complete, overlap-freecoverage of the data space with data pagesis assumed. Weber et al. [1998] use the

cost model by Berchtold et al. without theextension for correlated data to show thesuperiority of the sequential scan in suffi-ciently high dimensions. They present theVA-file, an improvement of the sequentialscan. Ciaccia et al. [1998] adapt the costmodel [Berchtold et al. 1997b] to estimatethe page accesses of the M -tree, an indexstructure for data spaces that are metricspaces but not vector spaces (i.e., only thedistances between the objects are known,but no explicit positions). In Papadopou-los and Manolopoulos [1998] the authorsapply the cost model for declusteringof data in a disk array. Two papers byAgrawal et al. [1998] and Riedel et al.[1998] present applications in the datamining domain.

A recent paper [Bohm 2000] is based onthe BBKK cost model which is presentedin a comprehensive way and extended inmany aspects. The extensions not yet cov-ered by the BBKK model include all esti-mations for the maximum metric, whichare also developed throughout the wholepaper. The restriction of the BBKK modelto numbers of data pages that are a powerof two is overcome. A further extensionof the model regards k-nearest-neighborqueries (the BBKK model is restrictedto one-nearest-neighbor queries). The nu-merical methods for integral approxima-tion and for the estimation of the bound-ary effects were to a large extent beyondthe scope of Berchtold et al. [1997b]. Fi-nally, the concept of the fractal dimension,which was also used in the BBKK modelin a simplified way (the data space dimen-sion is simply replaced by the fractal di-mension) is in this paper well establishedby the consequent application of the frac-tal power laws.

5. INDEXING IN METRIC SPACES

In some applications, objects cannot bemapped into feature vectors. However,there still exists some notion of similaritybetween objects, which can be expressedas a metric distance between the objects;that is, the objects are embedded in a met-ric space. The object distances can be useddirectly for query evaluation.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 26: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 347

Fig. 14 . Example Burkhard–Keller tree(D: data points, v: values of discrete dis-tance function).

Several index structures for pure met-ric spaces have been proposed in theliterature. Probably the oldest referenceis the so-called Burkhard–Keller [1973]trees. Burkhard–Keller trees use a dis-tance function that returns a small num-ber (i) of discrete values. An arbitraryobject is chosen as the root of the treeand the distance function is used to par-tition the remaining data objects into isubsets which are the i branches of thetree. The same procedure is repeated foreach nonempty subset to build up the tree(cf. Figure 14). More recently, a number ofvariants of the Burkhard–Keller tree havebeen proposed [Baeza-Yates et al. 1994].In the fixed queries tree, for example, thedata objects used as pivots are confined tobe the same on the same level of the tree[Baeza-Yates et al. 1994].

In most applications, a continuous dis-tance function is used. Examples of indexstructures based on a continuous distancefunction are the vantage-point tree (VPT),the generalized hyperplane tree (GHT),and the M -tree. The VPT [Uhlmann 1991;Yianilos 1993] is a binary tree that usessome pivot element as the root and parti-tions the remaining data elements basedon their distance with respect to thepivot element in two subsets. The sameis repeated recursively for the subsets(cf. Figure 15). Variants of the VPT arethe optimized VP-tree [Chiueh 1994], the

Multiple VP-tree [Bozkaya and Ozsoyoglu1997], and the VP-Forest [Yianilos 1999].

The GHT [Uhlmann 1991] is also a bi-nary tree that uses two pivot elementson each level of the tree. All data ele-ments that are closer to the first pivot ele-ment are assigned to the left subtree andall elements closer to the other pivot ele-ment are assigned to the other subtree (cf.Figure 16). A variant of the GHT is the ge-ometric near neighbor access tree (GNAT)[Brin 1995]. The main difference is thatthe GNAT is an m-ary tree that uses mpivots on each level of the tree.

The basic structure of the M-tree[Ciaccia et al. 1997] is similar to the VP-tree. The main difference is that the M-tree is designed for secondary memory andallows overlap in the covered areas to al-low easier updates. Note that among allmetric index structures the M-tree is theonly one that is optimized for large sec-ondary memory-based data sets. All othersare main memory index structures sup-porting rather small data sets.

Note that metric indexes are only usedin applications where the distance in vec-tor space is not meaningful. This is truesince vector spaces contain more informa-tion and therefore allow a better structur-ing of the data than general metric spaces.

6. APPROACHES TO HIGH-DIMENSIONALINDEXING

In this section, we introduce and brieflydiscuss the most important index struc-tures for high-dimensional data spaces.We first describe index structures usingminimum bounding rectangles as page re-gions such as the R-tree, the R?-tree, andthe X -tree. We continue with the struc-tures using bounding spheres such as theSS-tree and the TV-tree and conclude withtwo structures using combined regions.The SR-tree uses the intersection solid ofMBR and bounding sphere as the pageregion. The page region of a space-fillingcurve is the union of not necessarily con-nected hypercubes.

Multidimensional access methods thathave not been investigated for queryprocessing in high-dimensional data

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 27: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

348 C. Bohm et al.

Fig. 15 . Example vantage-point tree.

Fig. 16 . Example generalized hyperplane tree.

spaces such as hashing-based methods[Nievergelt et al. 1984; Otoo 1984;Hinrichs 1985; Krishnamurthy andWhang 1985; Ouksel 1985; Kriegel andSeeger 1986, 1987, 1988; Freeston 1987;Hutflesz et al. 1988a, b; Henrich et al.1989] are excluded from the discussionhere. In the VAMSplit R-tree [Jain andWhite 1996] and in the Hilbert-R-tree[Kamel and Faloutsos 1994], methodsfor statically constructing R-trees arepresented. Since the VAMSplit R-treeand the Hilbert-R-tree are more of aconstruction method than an indexingstructure of their own, they are also notpresented in detail here.

6.1. R-tree, R?-tree, and R +-tree

The R-tree [Guttman 1984] family of in-dex structures uses solid minimum bound-ing rectangles (MBR) as page regions.

An MBR is a multidimensional inter-val of the data space (i.e., axis-parallelmultidimensional rectangles). MBRs areminimal approximations of the enclosedpoint set. There exists no smaller axis-parallel rectangle also enclosing the com-plete point set. Therefore, every (d − 1)-dimensional surface area must containat least one datapoint. Space partition-ing is neither complete nor disjoint. Partsof the data space may be not covered atall by data page regions. Overlapping be-tween regions in different branches is al-lowed, although overlaps deteriorate thesearch performance especially for high-dimensional data spaces [Berchtold et al.1996]. The region description of an MBRcomprises for each dimension a lowerand an upper bound. Thus, 2d floatingpoint values are required. This descrip-tion allows an efficient determination ofMINDIST, MINMAXDIST, and MAXDISTusing any Lp metric.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 28: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 349

R-trees have originally been designedfor spatial databases, that is, for themanagement of two-dimensional objectswith a spatial extension (e.g., polygons).In the index, these objects are representedby the corresponding MBR. In contrast topoint objects, it is possible that no overlap-free partition for a set of such objects ex-ists at all. The same problem also occurswhen R-trees are used to index datapointsbut only in the directory part of the in-dex. Page regions are treated as spatiallyextended, atomic objects in their parentnodes (no forced split). Therefore, it is pos-sible that a directory page cannot be splitwithout creating overlap among the newlycreated pages [Berchtold et al. 1996].

According to our framework of high-dimensional index structures, two heuris-tics have to be defined to handle the in-sert operation: the choice of a suitable pageto insert the point into and the manage-ment of page overflow. When searchingfor a suitable page, one out of three casesmay occur.

—The point is contained in exactly onepage region. In this case, the corre-sponding page is used.

—The point is contained in several differ-ent page regions. In this case, the pageregion with the smallest volume is used.

—No region contains the point. In thiscase, the region that yields the small-est volume enlargement is chosen. Ifseveral such regions yield minimum en-largement, the region with the smallestvolume among them is chosen.

The insert algorithm starts with theroot and chooses in each step a childnode by applying the above rules. Pageoverflows are generally handled by split-ting the page. Four different algorithmshave been published for the purpose offinding the right split dimension (alsocalled split axis) and the split hyperplane.They are distinguished according to theirtime complexity with varying page capac-ity C. Details are provided in Gaede andGunther [1998]:

—an exponential algorithm,—a quadratic algorithm,

Fig. 17 . Misled insert oper-ations.

—a linear algorithm, and—Greene’s [1989] algorithm.

Guttman [1984] reports only slightdifferences between the linear and thequadratic algorithm, however, an evalua-tion study performed by Beckmann et al.[1990] reveals disadvantages for the linearalgorithm. The quadratic algorithm andGreene’s algorithm are reported to yieldsimilar search performance.

In the insert algorithm, the suitabledata page for the object is found inO(log n)time, by examining a single path of theindex. It seems to be an advantage thatonly a single path is examined for the de-termination of the data page into which apoint is inserted. An uncontrolled numberof paths, in contrast, would violate the de-mand of an O(n log n) time complexity forthe index construction. Figure 17 shows,however, that inserts are often misled insuch tie situations. It is intuitively clearthat the point must be inserted into pagep2,1, because p2,1 is the only page on thesecond index level that contains the point.But the insert algorithm faces a tie situ-ation at the first index level because bothpages, p1 as well as p2, cover the point. Ac-cording to the heuristics, the smaller pagep1 is chosen. The page p2,1 as a child of p2will never be under consideration. The re-sult of this misled insert is that the pagep1,2 unnecessarily becomes enlarged by alarge factor and an additional overlap situ-ation of the pages p1,2 and p2,1. Therefore,overlap at or near the data level is mostlya consequence of some initial overlap inthe directory levels near the root (whichwould, eventually, be tolerable).

The initial overlap usually stems fromthe inability to split a higher-levelpage without overlap, because all childpages have independently grown extended

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 29: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

350 C. Bohm et al.

page regions. For an overlap-free split, adimension is needed in which the projec-tions of the page regions have no overlapat some point. It has been shown in Berch-told et al. [1996] that the existence of sucha point becomes less likely as the dimen-sion of the data space increases. The rea-son simply is that the projection of eachchild page to an arbitrary dimension isnot much smaller than the correspondingprojection of the child page. If we assumeall page regions to be hypercubes of sidelength A (parent page) and a (child page),respectively, we get a = A· d

√1/Ceff, which

is substantially below A if d is small butactually in the same order of magnitudeas A if d is sufficiently high.

The R?-tree [Beckmann et al. 1990] isan extension of the R-tree based on acareful study of the R-tree algorithmsunder various data distributions. In con-trast to Guttman, who optimizes only fora small volume of the created page regions,Beckmann et al. identify the optimizationobjectives:

—minimize overlap between page regions,—minimize the surface of page regions,—minimize the volume covered by inter-

nal nodes, and—maximize the storage utilization.

The heuristic for the choice of a suit-able page to insert a point is modified inthe third alternative: no page region con-tains the point. In this case, the distinc-tion is made whether the child page is adata page or a directory page. If it is a datapage, then the region is taken that yieldsthe smallest enlargement of the overlap.In the case of a tie, further criteria arethe volume enlargement and the volume.If the child node is a directory page, theregion with the smallest volume enlarge-ment is taken. In case of doubt, the volumedecides.

As in Greene’s algorithm, the splitheuristic has certain phases. In the firstphase, the split dimension is determined:

—for each dimension, the objects aresorted according to their lower boundand according to their upper bound;

—a number of partitionings with a con-trolled degree of asymmetry are encoun-tered; and

—for each dimension, the surface areasof the MBRs of all partitionings aresummed up and the least sum deter-mines the split dimension.

In the second phase, the split plane isdetermined, minimizing these criteria:

—overlap between the page regions, and—when in doubt, least coverage of dead

space.

Splits can often be avoided by the con-cept of forced reinsert. If a node overflowoccurs, a defined percentage of the objectswith the highest distances from the centerof the region are deleted from the node andinserted into the index again, after the re-gion has been adapted. By this means, thestorage utilization will grow to a factor be-tween 71 and 76%. Additionally, the qual-ity of partitioning improves because unfa-vorable decisions in the beginning of indexconstruction can be corrected this way.

Performance studies report improve-ments between 10 and 75% over the R-tree. In higher-dimensional data spaces;the split algorithm proposed in Beckmannet al. [1990] leads to a deteriorated direc-tory. Therefore, the R?-tree is not adequatefor these data spaces; rather it has to loadthe entire index in order to process mostqueries. A detailed explanation of this ef-fect is given in Berchtold et al. [1996]. Thebasic problem of the R-tree, overlap com-ing up at high index levels and then prop-agating down by misled insert operations,is alleviated by more appropriate heuris-tics but not solved.

The heuristic of the R?-tree split to opti-mize for page regions with a small surface(i.e., for square/cubelike page regions)is beneficial, in particular with respectto range queries and nearest-neighborqueries. As pointed out in Section 4 (costmodels), the access probability corre-sponds to the Minkowski sum of the pageregion and the query sphere. TheMinkowski sum primarily consists ofthe page region which is enlarged at eachsurface segment. If the page regions are

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 30: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 351

Fig. 18 . Shapes of page regions and their suitabil-ity for similarity queries.

optimized for a small surface, they directlyoptimize the Minkowski sum. Figure 18shows an extreme, nonetheless typical,example of volume-equivalent pages andtheir Minkowski sums. The square (1 × 1unit) yields with 3.78 a substantially lowerMinkowski sum than the volume equiva-lent rectangle (3× 1

3 ) with 5.11 units. Noteagain, that the effect becomes strongerwith an increasing number of dimensionsas every dimension is a potential source ofimbalance. For spherical queries, however,spherical page regions yield the lowestMinkowski sum (3.55 units). Sphericalpage regions are discussed later.

The R+-tree [Stonebraker et al. 1986;Sellis et al. 1987] is an overlap-free variantof the R-tree. To guarantee no overlap thesplit algorithm is modified by a forced-splitstrategy. Child pages that are an obstaclein overlap-free splitting of some page, aresimply cut into two pieces at a suitable po-sition. It is possible, however, that theseforced splits must be propagated down un-til the data page level is reached. Thenumber of pages can even exponentiallyincrease from level to level. As we havepointed out before, the extension of thechild pages is not much smaller than theextension of the parent if the dimension issufficiently high. Therefore, high dimen-sionality leads to many forced split oper-ations. Pages that are subject to a forcedsplit are split although no overflow has oc-curred. The resulting pages are utilized byless than 50%. The more forced splits areraised, the more the storage utilization ofthe complete index will deteriorate.

A further problem which more or lessconcerns all of the data organizing tech-niques described in this survey is the de-creasing fanout of the directory nodes withincreasing dimension. For the R-tree fam-ily, for example, the internal nodes have tostore 2d high and low bounds in order to

describe a minimum bounding rectanglein d -dimensional space.

6.2. X-Tree

The R-tree and the R?-tree have primar-ily been designed for the managementof spatially extended, two-dimensionalobjects, but have also been used for high-dimensional point data. Empirical stud-ies [Berchtold et al. 1996; White and Jain1996], however, showed a deteriorated per-formance of R?-trees for high-dimensionaldata. The major problem of R-tree-basedindex structures in high-dimensional dataspaces is overlap. In contrast to low-dimensional spaces, there exist only a fewdegrees of freedom for splits in the direc-tory. In fact, in most situations there existsonly a single “good” split axis. An indexstructure that does not use this split axiswill produce highly overlapping MBRs inthe directory and thus show a deteri-orated performance in high-dimensionalspaces. Unfortunately, this specific splitaxis might lead to unbalanced partitions.In this case, a split should be avoided inorder to avoid underfilled nodes.

The X -tree [Berchtold et al. 1996] isan extension of the R?-tree which is di-rectly designed for the management ofhigh-dimensional objects and based onthe analysis of problems arising in high-dimensional data spaces. It extends theR?-tree by two concepts:

— overlap-free split according to a split-history, and

— supernodes with an enlarged page ca-pacity.

If one records the history of data pagesplits in an R-tree-based index structure,this results in a binary tree. The indexstarts with a single data page A coveringalmost the whole data space and insertsdata items. If the page overflows, the in-dex splits the page into two new pages A′and B. Later on, each of these pages mightbe split again into new pages. Thus thehistory of all splits may be described as abinary tree, having split dimensions (andpositions) as nodes and having the currentdata pages as leaf nodes. Figure 19 shows

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 31: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

352 C. Bohm et al.

Fig. 19 . Example for the split history.

an example of such a process. In the lowerhalf of the figure, the appropriate direc-tory node is depicted. If the directory nodeoverflows, we have to divide the set of datapages (the MBRs A′′, B′′, C, D, E) into twopartitions. Therefore, we have to choosea split axis first. Now, what are potentialcandidates for split axes in our example?Say we chose dimension 5 as a split axis.Then, we had to put A′′ and E into oneof the partitions. However, A′′ and E havenever been split according to dimension 5.Thus they span almost the whole dataspace in this dimension. If we put A′′ andE into one of the partitions, the MBR ofthis partition in turn will span the wholedata space. This obviously leads to highoverlap with the other partition, regard-less of the shape of the other partition. Ifone looks at the example in Figure 19, itbecomes clear that only dimension 2 maybe used as a split dimension. The X -treegeneralizes this observation and uses al-ways the split dimension with which theroot node of the particular split tree islabeled. This guarantees an overlap-freedirectory. However, the split tree mightbe unbalanced. In this case it is advanta-geous not to split at all because splittingwould create one underfilled node and an-other almost overflowing node. Thus thestorage utilization in the directory woulddecrease dramatically and the directorywould degenerate. In this case the X -treedoes not split and creates an enlarged di-rectory node instead, a supernode. Thehigher the dimensionality, the more su-

pernodes will be created and the largerthe supernodes become. To also operate onlower-dimensional spaces efficiently, theX -tree split algorithm also includes a ge-ometric split algorithm. The whole splitalgorithm works as follows. In case of adata page split, the X -tree uses the R?-treesplit algorithm or any other topologicalsplit algorithm. In case of directory nodes,the X -tree first tries to split the node us-ing a topological split algorithm. If thissplit leads to highly overlapping MBRs,the X -tree applies the overlap-free splitalgorithm based on the split history as de-scribed above. If this leads to an unbal-anced directory, the X -tree simply createsa supernode.

The X -tree shows a high performancegain compared to R?-trees for all querytypes in medium-dimensional spaces. Forsmall dimensions, the X -tree shows a be-havior almost identical to R-trees; forhigher dimensions the X -tree also has tovisit such a large number of nodes thata linear scan is less expensive. It is im-possible to provide exact values here be-cause many factors such as the numberof data items, the dimensionality, the dis-tribution, and the query type have a highinfluence on the performance of an indexstructure.

6.3. Structures with a kd-Tree Directory

Like the R-tree and its variants, the kd-B-tree [Robinson 1981] uses hyperrectangle-shaped page regions. An adaptive kd-tree

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 32: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 353

Fig. 20 . The kd-tree.

Fig. 21 . Incomplete versus complete decomposition for clustered and correlated data.

[Bentley 1975, 1979] is used for spacepartitioning (cf. Figure 20). Therefore,complete and disjoint space partitioningis guaranteed. Obviously, the page regionsare (hyper) rectangles, but not minimumbounding rectangles. The general ad-vantage of kd-tree-based partitioningis that the decision of which subtree touse is always unambiguous. The deletionoperation is also supported in a betterway than in R-tree variants because leafnodes with a common parent exactly com-prise a hyperrectangle of the data space.Thus they can be merged without violat-ing the conditions of complete, disjointspace partitioning.

Complete partitioning has the disad-vantage that page regions are gener-ally larger than necessary. Particularly inhigh-dimensional data spaces often largeparts of the data space are not occupiedby data points at all. Real data often areclustered or correlated. If the data distri-bution is cluster shaped, it is intuitivelyclear that large parts of the space areempty. But also the presence of correla-tions (i.e., one dimension is more or less de-pendent on the values of one or more otherdimension) leads to empty parts of thedata space, as depicted in Figure 21. Indexstructures that do not use complete parti-

tioning are superior because larger pageregions yield a higher access probability.Therefore, these pages are more often ac-cessed during query processing than min-imum bounding page regions. The secondproblem is that kd-trees in principle areunbalanced. Therefore, it is not directlypossible to pack contiguous subtrees intodirectory pages. The kd-B-tree approachesthis problem by a concept involvingforced splits:

If some page has an overflow condi-tion, it is split by an appropriately chosenhyperplane. The entries are distributedamong the two pages and the split is prop-agated up the tree. Unfortunately, regionson lower levels of the tree may also be in-tersected by the split plane, which must besplit (forced split). As every region on thesubtree can be affected, the time complex-ity of the insert operation is O(n) in theworst case. A minimum storage utilizationguarantee cannot be provided. Therefore,theoretical considerations about the indexsize are difficult.

The hB-tree (holey brick) [Lomet andSalzberg 1989, 1990; Evangelidis 1994]also uses a kd-tree directory to define thepage regions of the index. In this approach,splitting of a node is based on multipleattributes. This means that page regions

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 33: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

354 C. Bohm et al.

Fig. 22 . The kd-B-tree.

Fig. 23 . The Minkowski sum of a holeybrick.

do not correspond to solid rectangles butto rectangles from which other rectangleshave been removed (holey bricks). Withthis technique, the forced split of the kd-B-tree and the R+-tree is avoided.

For similarity search in high-dimensional spaces, we can state thesame benefits and shortcomings of acomplete space decomposition as in thekd-B-tree, depicted in Figure 22. Inaddition, we can state that the cavitiesof the page regions decrease the volumeof the page region, but hardly decreasethe Minkowski sum (and thus the accessprobability of a page). This is illustratedin Figure 23, where two large cavities areremoved from a rectangle, reducing itsvolume by more than 30%. The Minkowskisum, however, is not reduced in the leftcavity, because it is not as wide as theperimeter of the query. In the second cav-ity, there is only a very small area wherethe page region is not touched. Thus thecavities reduce the access probability ofthe page by less than 1%.

The directory of the LSDh-tree [Henrich1998] is also an adaptive kd -tree [Bentley1975, 1979] (see Figure 24). In contrast toR-tree variants and kd-B-trees, the regiondescription is coded in a sophisticated wayleading to reduced space requirements forthe region description. A specialized pag-ing strategy collects parts of the kd -tree

into directory pages. Some levels on thetop of the kd -tree are assumed to be fixedin main memory. They are called the in-ternal directory in contrast to the externaldirectory which is subject to paging. Ineach node, only the split axis (e.g., 8 bitsfor up to 256-dimensional data spaces)and the position, where the split-planeintersects the split axis (e.g., 32 bits fora float number), have to be stored. Twopointers to child nodes require 32 bitseach. To describe k regions, (k − 1) nodesare required, leading to a total amountof 104 · (k − 1) bits for the complete di-rectory. R-treelike structures require foreach region description two float valuesfor each dimension plus the child nodepointer. Therefore, only the lowest levelof the directory needs (32 + 64 · d ) · kbits for the region description. Whilethe space requirement of the R-treedirectory grows linearly with increasingdimension, it is constant (theoreticallylogarithmic, for very large dimension-ality) for the LSDh-tree. Note that thisargument also holds for the hBπ -tree.See Evangelidis et al. [1997] for a moredetailed discussion of the issue. For16-dimensional data spaces, R-treedirectories are more than 10 timeslarger than the corresponding LSDh-treedirectory.

The rectangle representing the region ofa data page can be determined from thesplit planes in the directory. It is calledthe potential data region and not explicitlystored in the index.

One disadvantage of the kd-tree direc-tory is that the data space is completelycovered with potential data regions. Incases where major parts of the data spaceare empty, this results in performance de-generation. To overcome this drawback, aconcept called coded actual data regions,cadr is introduced. The cadr is a multidi-mensional interval conservatively approx-imating the MBR of the points stored in adata page. To save space in the descrip-tion of the cadr, the potential data re-gion is quantized into a grid of 2z·d cells.Therefore, only 2 · z · d bits are addition-ally required for each cadr. The param-eter z can be chosen by the user. Good

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 34: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 355

Fig. 24 . The LSDh-tree.

Fig. 25 . Region approximation using the LSDh-tree.

results are achieved using a value z = 5.See Figure 25.

The most important advantage of thecomplete partitioning using potential dataregions is that they allow a maintenanceguaranteeing no overlap. It has beenpointed out in the discussion of the R-treevariants and of the X -tree that overlap isa particular problem in high-dimensionaldata spaces. By the complete partitioningof the kd -tree directory, tie situations thatlead to overlap do not arise. On the otherhand, the regions of the index pages arenot able to adapt equally well to changesin the actual data distribution as can pageregions that are not forced into the kd -treedirectory. The description of the page re-gions in terms of splitting planes forces theregions to be overlap-free, anyway. Whena point has to be inserted into an LSDh-tree, there exists always a unique poten-tial data region, in which the point hasto be inserted. In contrast, the MBR ofan R-tree may have to be enlarged foran insert operation, which causes over-lap between data pages in some cases. Asituation where no overlap-free enlarge-ment is possible is depicted in Figure 26.The coded actual data regions may haveto be enlarged during an insert opera-tion. As they are completely contained in

Fig. 26 . No overlap-free insert is possible.

a potential page region, overlap cannotarise either.

The split strategy for LSDh-trees israther simple. The split dimension isincreased by one compared to the parentnode in the kd -tree directory. The onlyexception from this rule is that a dimen-sion having too few distinct values forsplitting is left out.

As reported in Henrich [1998], theLSDh-tree shows a performance that isvery similar to that of the X -tree, exceptthat inserts are done much faster in anLSDh-tree because no complex computa-tion takes place. Using a bulk-loadingtechnique to construct the index, both in-dex structures are equal in performance.Also from an implementation point of view,both structures are of similar complexity.The LSDh-tree has a rather complex di-rectory structure and simple algorithms,whereas the X -tree has a rather straight-forward directory and complex algorithms.

6.4. SS-Tree

In contrast to all previously introducedindex structures, the SS-tree [White andJain 1996] uses spheres as page regions.For maintenance efficiency, the spheres

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 35: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

356 C. Bohm et al.

Fig. 27 . No overlap-free split is possible.

are not minimum bounding spheres.Rather, the centroid point (i.e., the aver-age value in each dimension) is used asthe center for the sphere and the minimumradius is chosen such that all objects areincluded in the sphere. The region descrip-tion comprises therefore the centroid pointand the radius. This allows an efficient de-termination of the MINDIST and of theMAXDIST, but not of the MINMAXDIST.The authors suggest using the RKV algo-rithm, but they do not provide any hintson how to prune the branches of the indexefficiently.

For insert processing, the tree is de-scended choosing the child node whosecentroid is closest to the point, regard-less of volume or overlap enlargement.Meanwhile, the new centroid point andthe new radius are determined. When anoverflow condition occurs, a forced rein-sert operation is raised, as in the R?-tree.30% of the objects with highest distancesfrom the centroid are deleted from thenode, all region descriptions are updated,and the objects are reinserted into theindex.

The split determination is merely basedon the criterion of variance. First, the splitaxis is determined as the dimension yield-ing the highest variance. Then, the splitplane is determined by encountering allpossible split positions, which fulfill spaceutilization guarantees. The sum of thevariances on each side of the split planeis minimized.

It was pointed out already in Section 6.1(cf. Figure 18 in particular) that spheresare theoretically superior to volume-equivalent MBRs because the Minkowskisum is smaller. The general problem ofspheres is that they are not amenable to

an easy overlap-free split, as depicted inFigure 27. MBRs have in general a smallervolume, and, therefore, the advantage inthe Minkowski sum is more than compen-sated. The SS-tree outperforms the R?-tree by a factor of two; however, it doesnot reach the performance of the LSDh-tree and the X -tree.

6.5. TV-Tree

The TV-tree [Lin et al. 1995] is designedespecially for real data that are subjectto the Karhunen–Loeve Transform (alsoknown as principal component analysis),a mapping that preserves distances andeliminates linear correlations. Such datayield a high variance and therefore, agood selectivity in the first few dimensionswhereas the last few dimensions are of mi-nor importance for query processing. In-dexes storing KL-transformed data tendto have the following properties.

—The last few attributes are neverused for cutting branches in queryprocessing. Therefore, it is not useful tosplit the data space in the correspond-ing dimensions.

—Branching according to the first few at-tributes should be performed as earlyas possible, that is, in the topmost lev-els of the index. Then, the extensionof the regions of lower levels (espe-cially of data pages) is often zero inthese dimensions.

Regions of the TV-tree are described us-ing so-called telescope vectors (TV), thatis, vectors that may be dynamically short-ened. A region has k inactive dimen-sions and α active dimensions. The inac-tive dimensions form the greatest common

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 36: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 357

prefix of the vectors stored in the sub-tree. Therefore, the extension of the regionis zero in these dimensions. In the α ac-tive dimensions, the region has the formof an Lp-sphere, where p may be 1, 2, or∞. The region has an infinite extensionin the remaining dimensions, which aresupposed either to be active in the lowerlevels of the index or to be of minor im-portance for query processing. Figure 28depicts the extension of a telescope vectorin space.

The region description comprises αfloating point values for the coordinatesof the centerpoint in the active dimen-sions and one float value for the radius.The coordinates of the inactive dimen-sions are stored in higher levels of theindex (exactly in the level, where a di-mension turns from active into inactive).To achieve a uniform capacity of direc-tory nodes, the number α of active dimen-sions is constant in all pages. The conceptof telescope vectors increases the capac-ity of the directory pages. It was experi-mentally determined that a low numberof active dimensions (α= 2) yields the bestsearch performance.

The insert algorithm of the TV-treechooses the branch to insert a point ac-cording to these criteria (with decreasingpriority):

— minimum increase of the number ofoverlapping regions,

— minimum decrease of the number of in-active dimensions,

— minimum increase of the radius, and— minimum distance to the center.

To cope with page overflows, the au-thors propose performing a reinsert op-eration, as in the R?-tree. The split al-gorithm determines the two seed-points(seed-regions in the case of a directorypage) having the least common prefix or(in case of doubt) having maximum dis-tance. The objects are then inserted intoone of the new subtrees using the abovecriteria for the subtree choice in insert pro-cessing, while storage utilization guaran-tees are considered.

The authors report a good speedup incomparison to the R?-tree when applyingthe TV-tree to data that fulfill the precon-dition stated in the beginning of this sec-tion. Other experiments [Berchtold et al.1996], however, show that the X -tree andthe LSDh-tree outperform the TV-tree onuniform or other real data (not amenableto the KL transformation).

6.6. SR-Tree

The SR-tree [Katayama and Satoh 1997]can be regarded as the combination of theR?-tree and the SS-tree. It uses the in-tersection solid between a rectangle anda sphere as the page region. The rect-angular part is, as in R-tree variants,the minimum bounding rectangle of allpoints stored in the corresponding sub-tree. The spherical part is, as in the SS-tree, the minimum sphere around the cen-troid point of the stored objects. Figure 29depicts the resulting geometric object. Re-gions of SR-trees have the most com-plex description among all index struc-tures presented in this section: they com-prise 2d floating point values for the MBRand d + 1 floating point values for thesphere.

The motivation for using a combina-tion of sphere and rectangle, presentedby the authors, is that according to ananalysis presented in White and Jain[1996], spheres are basically better suitedfor processing nearest-neighbor and rangequeries using the L2 metric. On the otherhand, spheres are difficult to maintain andtend to produce much overlap in splitting,as depicted previously in Figure 27. Theauthors believe therefore that a combina-tion of R-tree and SS-tree will overcomeboth disadvantages.

The authors define the following func-tion as the distance between a querypoint q and a region R.

MINDIST(q, R) = max(MINDIST(q,R.MBR, MINDIST(q, R.Sphere)).

This is not the correct minimum dis-tance to the intersection solid, as depictedin Figure 30. Both distances to MBR and

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 37: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

358 C. Bohm et al.

Fig. 28 . Telescope vectors.

sphere (meeting the corresponding solidsat the points MMBR and MSphere, resp.)are smaller than the distance to the in-tersection solid, which is met in point MRwhere the sphere intersects the rectangle.However, it can be shown that the abovefunction MINDIST(q, R) is a lower bound

of the correct distance function. There-fore, it is guaranteed that processing ofrange and nearest-neighbor queries pro-duces no false dismissals. But still, the ef-ficiency can be worsened by the incorrectdistance function. The MAXDIST functioncan be defined to be the minimum among

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 38: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 359

Fig. 29 . Page regions of an SR-tree.

Fig. 30 . Incorrect MINDISTin the SR-tree.

the MAXDIST functions, applied to MBRand sphere, although a similar error ismade as in the definition of MINDIST.Since no MAXMINDIST definition existsfor spheres, the MAXMINDIST functionfor the MBR must be applied. This isalso correct in the sense that no false dis-missals are guaranteed but in this case noknowledge about the sphere is exploitedat all. Some potential for performance in-crease is wasted.

Using the definitions above, range andnearest-neighbor query processing usingboth RKV algorithm and HS algorithm arepossible.

Insert processing and the split algo-rithm are taken from the SS-tree and onlymodified in a few details of minor impor-tance. In addition to the algorithms for theSS-tree, the MBRs have to be updated anddetermined after inserts and node splits.Information of the MBRs is neither con-

sidered in the choice of branches nor inthe determination of the split.

The reported performance results, com-pared to the SS-tree and the R?-tree, sug-gest that the SR-tree outperforms bothindex structures. It is, however, open ifthe SR-tree outperforms the X-tree or theLSDh-tree. No experimental comparisonhas been done yet to the best of the au-thors’ knowledge. Comparing the indexstructures indirectly by comparing both tothe performance of the R?-tree, we coulddraw the conclusion that the SR-tree doesnot reach the performance of the LSDh-tree or the X -tree.

6.7. Space Filling Curves

Space filling curves (for an overview seeSagan [1994]) like Z-ordering [Morton1966; Finkel and Bentley 1974; Abel andSmith 1983; Orenstein and Merret 1984;Orenstein 1990], Gray Codes [Faloutsos1985, 1988], or the Hilbert curve [Falout-sos and Roseman 1989; Jagadish 1990;Kamel and Faloutsos 1993] are map-pings from a d -dimensional data space(original space) into a one-dimensionaldata space (embedded space). Usingspace filling curves, distances are notexactly preserved but points that areclose to each other in the original spaceare likely to be close to each other inthe embedded space. Therefore, thesemappings are called distance-preservingmappings.

Z-ordering is defined as follows. Thedata space is first partitioned into twohalves of identical volume, perpendicu-lar to the d0-axis.The volume on the sideof the lower d0-values gets the name 〈0〉(as a bit string); the other volume getsthe name 〈1〉. Then each of the volumesis partitioned perpendicular to the d1-axis, and the resulting subpartitions of〈0〉 get the names 〈00〉 and 〈01〉, and thesubpartitions of 〈1〉 get the names 〈10〉and 〈11〉, respectively. When all axes areused for splitting, d0 is used for a secondsplit, and so on. The process stops when auser-defined basic resolution br is reached.Then, we have a total number of 2br gridcells, each with an individual numbered

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 39: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

360 C. Bohm et al.

Fig. 31 . Examples of space filling curves.

bit string. If only grid cells with the ba-sic resolution br are considered, all bitstrings have the same lengths, and cantherefore be interpreted as binary repre-sentations of integer numbers. The otherspace filling curves are defined similarlybut the numbering scheme is slightly moresophisticated. This has been done in or-der to achieve more neighboring cells get-ting subsequent integer numbers. Sometwo-dimensional examples of space fillingcurves are depicted in Figure 31.

Datapoints are transformed by assign-ing the number of the grid cell in whichthey are located. Without presenting thedetails, we let SFC(p) be the function thatassigns p to the corresponding grid cellnumber. Vice versa, SFC−1(c) returns thecorresponding grid cell as a hyperrect-angle. Then any one-dimensional index-ing structure capable of processing rangequeries can be applied for storing SFC(p)for every point p in the database. Weassume in the following that a B+-tree[Comer 1979] is used.

Processing of insert and delete opera-tions and exact match queries is very sim-ple because the points inserted or soughthave merely to be transformed using theSFC function.

In contrast, range and nearest-neighborqueries are based on distance calcula-tions of page regions, which have to bedetermined accordingly. In B-trees, be-fore a page is accessed, only the inter-val I = [lb . .ub] of values in this page isknown. Therefore, the page region is theunion of all grid cells having a cell numberbetween lb and ub. The region of an indexbased on a space filling curve is a com-bination of rectangles. Based on this ob-servation, we can define a corresponding

MINDIST and analogously a MAXDISTfunction:

MINDIST(q, I )

= minlb≤c≤ub

{MINDIST(q, SFC−1(c))}MAXDIST(q, I ) == max

lb≤c≤ub{MAXDIST(q, SFC−1(c))}.

Again, no MINMAXDIST function canbe provided because there is no minimumbounding property to exploit. The questionis how these functions can be evaluatedefficiently, without enumerating all gridcells in the interval [lb . .ub]. This is pos-sible by splitting the interval recursivelyinto two parts [lb..s[ and [s . .ub], where shas the form 〈p100 . . .00〉. Here, p standsfor the longest common prefix of lb andub. Then we determine the MINDIST andthe MAXDIST to the rectangular blocksnumbered with the bit strings 〈 p0〉 and〈 p1〉. Any interval having a MINDISTgreater than the MAXDIST of any otherinterval or greater than the MINDIST ofany terminating interval (see later) canbe excluded from further consideration.The decomposition of an interval stopswhen the interval covers exactly one rect-angle. Such an interval is called a terminalinterval. MINDIST(q, I ) is then the min-imum among the MINDISTs of all termi-nal intervals. An example is presented inFigure 32. The shaded area is the pageregion, a set of contiguous grid cell val-ues I . In the first step, the interval issplit into two parts I1 and I2, determin-ing the MINDIST and MAXDIST (not de-picted) of the surrounding rectangles. I1is terminal, as it comprises a rectangle.In the second step, I2 is split into I21

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 40: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 361

Fig. 32 . MINDIST determination using space filling curves.

and I22, where I21 is terminal. Sincethe MINDIST to I21 is smaller than theother two MINDIST values, I1 and I22 arediscarded. Therefore MINDIST(q, I21) isequal to MINDIST(q, I ).

A similar algorithm to determineMAXDIST(q, I ) would exchange the rolesof MINDIST and MAXDIST.

6.8. Pyramid-Tree

The Pyramid-tree [Berchtold et al. 1998b]is an index structure that, similarto the Hilbert technique, maps a d -dimensional point into a one-dimensionalspace and uses a B+-tree to indexthe one-dimensional space. Obviously,queries have to be translated in thesame way. In the data pages of theB+-tree, the Pyramid-tree stores boththe d -dimensional points and the one-dimensional key. Thus, no inverse trans-formation is required and the refinementstep can be done without lookups toanother file. The specific mapping usedby the Pyramid-tree is called Pyramid-mapping. It is based on a special partition-ing strategy that is optimized for rangequeries on high-dimensional data. The ba-sic idea is to divide the data space suchthat the resulting partitions are shapedlike peels of an onion. Such partitions can-not be efficiently stored by R-treelike orkd-treelike index structures. However, thePyramid-tree achieves the partitioning byfirst dividing the d -dimensional space into2d pyramids having the centerpoint of thespace as their top. In a second step, thesingle pyramids are cut into slices par-allel to the basis of the pyramid form-ing the data pages. Figure 33 depicts thispartitioning technique.

Fig. 33 . Partitioning the data space into pyramids.

Fig. 34 . Properties of pyramids: (a) numbering ofpyramids; (b) point in pyramid.

This technique can be used to computea mapping as follows. In the first step,we number the pyramids as shown inFigure 34(a). Given a point, it is easy todetermine in which pyramid it is located.Then we determine the so-called heightof the point within its pyramid that isthe orthogonal distance of the point to thecenterpoint of the data space as shownin Figure 34(b). In order to map a d -dimensional point into a one-dimensionalvalue, we simply add the two numbers,the number of the pyramid in which thepoint is located, and the height of the pointwithin this pyramid.

Query processing is a nontrivial taskon a Pyramid-tree because for a givenquery range, we have to determine the

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 41: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

362 C. Bohm et al.

Table I. High-Dimensional Index Structures and Their PropertiesName Region Disjoint Complete Criteria for Insert Criteria for Split Reinsert

R-tree MBR No No Volume enlargementvolume

(Various algorithms) No

R?-tree MBR No NoOverlap enlargementVolume enlargement

volume

Surface areaOverlap

Dead space coverageYes

X -tree MBR No NoOverlap enlargementVolume enlargement

volume

Split historySurface/overlap

Dead space coverageNo

LSDh -tree

kd-tree-region Yes No/Yes (Unique due to com-

plete, disjoint part.)Cyclic change of dim.

Distinct valuesNo

SS-tree Sphere No No Proximity to centroid Variance Yes

TV-treeSpherewith

reduceddim.

No NoOverlap regions

Inactive dim.Radius of region

Distance to center

Seeds with leastCommon prefix

Maximum distanceYes

SR-treeIntersect.

sphere/MBR

No No Proximity to centroid Variance Yes

Spacefillingcurves

Union ofrectangles Yes Yes (Unique due to com-

plete, disjoint part.)According to space

filling curve No

Pyramidtree

Trunksof

pyramidsYes Yes (Unique due to com-

plete, disjoint part.)According to

pyramid-mapping No

affected pyramids and the affected heightswithin these pyramids. The details ofhow this can be done are explained inBerchtold et al. [1998b]. Although the de-tails of the algorithm are hard to un-derstand, it is not computationally hard;rather it consists of a variety of casesthat have to be distinguished and simplecomputations.

The Pyramid-tree is the only indexstructure known thus far that is notaffected by the so-called curse of dimen-sionality. This means that for uniformdata and range queries, the performanceof the Pyramid-tree gets even better if oneincreases the dimensionality of the dataspace. An analytical explanation of thisphenomenon is given in Berchtold et al.[1998b].

6.9. Summary

Table I shows the index structures de-scribed above and their most importantproperties. The first column contains thename of the index structure, the secondshows which geometrical region is rep-resented by a page, and the third andfourth columns show whether the index

structure provides a disjoint and com-plete partitioning of the data space. Thelast three columns describe the used al-gorithms: what strategy is used to in-sert new data items (column 5), whatcriteria are used to determine the divi-sion of objects into subpartitions in caseof an overflow (column 6), and if the in-sert algorithm uses the concept of forcedreinserts (column 7).

Since, so far, no extensive and objectivecomparison between the different indexstructures has been published, only struc-tural arguments may be used in orderto compare the different approaches. Theexperimental comparisons tend to highlydepend on the data that have been usedin the experiments. Even higher is theinfluence of seemingly minor parameterssuch as the size and location of queries orthe statistical distribution of these. Thehigher the dimensionality of the data,the more these influences lead to differ-ent results. Thus we provide a compari-son among the indexes listing only prop-erties not trying to say anything aboutthe “overall” performance of a single in-dex. In fact, most probably, there is nooverall performance; rather, one index will

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 42: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 363

Table II. Qualitative Comparison of High-Dimensional Index Structures

Name Problems inHigh-D

Sup-portedQueryTypes

Localityof

NodeSplits

Storage Utilization Fanout /Size of Index Entries

R-treePoor split algorithm

leads todeteriorateddirectories

NN,Region,range

Yes Poor Poor, linearlydimension dependent

R?-tree Dto.NN,

Region,range

Yes Medium Poor, linearlydimension dependent

X -tree

High probabilityof queries over-lapping MRS’s

leads topoor performance

NN,Region,range

Yes Medium Poor, linearlydimension dependent

LSDh-tree

Changing datadistribution dete-riorates directory

NN,Region,range

No Medium Very good, dimensionindependent

SS-tree High overlapin directory NN Yes Medium Very good, dimension

independent

TV-tree Only useful forspecific data NN Yes Medium Poor, somehow

dimension dependent

SR-tree Very largedirectory sizes NN Yes Medium Very poor, linearly

dimension dependent

Spacefillingcurves

Poor spacepartitioning

NN,Region,range

Yes MediumAs good as B-tree,

dimensionindependent

Pyramidtree

Problems withasymmetric queries

Region,range Yes Medium

As good as B-tree,dimension

independent

outperform other indexes in a special sit-uation whereas this index is quite uselessfor other configurations of the database.Table II shows such a comparison. Thefirst column lists the name of the index,the second column explains the biggestproblem of this index when the dimensionincreases. The third column lists the sup-ported types of queries. In the fourth col-umn, we show if a split in the directorycauses “forced splits” on lower levels of thedirectory. The fifth column shows the stor-age utilization of the index, which is onlya statistical value depending on the typeof data and, sometimes, even on the orderof insertion. The last column is about thefanout in the directory which in turn de-pends on the size of a single entry in a di-rectory node.

7. IMPROVEMENTS, OPTIMIZATIONS, ANDFUTURE RESEARCH ISSUES

During the past years, a significantamount of work has been invested not to

develop new index structures but to im-prove the performance of existing indexstructures. As a result, a variety of tech-niques has been proposed using or tun-ing index structures. In this section, wepresent a selection of those techniques.Furthermore, we point out a selection ofproblems that have not yet been addressedin the context of high-dimensional index-ing, or the solution of which cannot be con-sidered sufficient.

Tree-Striping

From a variety of cost models that havebeen developed one might conclude that ifthe data space has a sufficiently high di-mensionality, no index structure can suc-ceed. This has been contradicted by thedevelopment of index structures that arenot severely affected by the dimensional-ity of the data space. On the other hand,one has to be very careful to judge the im-plications of a specific cost model. A lessonall researchers in the area of high-dimensional index structures learned was

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 43: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

364 C. Bohm et al.

that things are very sensitive to thechange of parameters. A model of nearest-neighbor queries can not directly be usedto make any claims about the behaviorin the case of range queries. Still, the re-search community agreed that in the caseof nearest-neighbor queries, there exists adimension above which a sequential scanwill be faster than any indexing techniquefor most relevant data distributions.

Tree-striping is a technique that triesto tackle the problem from a differentperspective. If it is hard to solve the d -dimensional problem of query processing,why not try to solve k l -dimensional prob-lems, where k · l =d . The specific workpresented in Berchtold et al. [2000c] fo-cuses on the processing of range queries ina high-dimensional space. It generalizesthe well-known inverted lists and multi-dimensional indexing approaches. A the-oretical analysis of the generalized tech-nique shows that both, inverted lists andmultidimensional indexing approaches,are far from being optimal. A conse-quence of the analysis is that the useof a set of multidimensional indexes pro-vides considerable improvements over oned -dimensional index (multidimensionalindexing) or d one-dimensional indexes(inverted lists). The basic idea of tree-striping is to use the optimal number kof lower-dimensional indexes determinedby a theoretical analysis for efficient queryprocessing. A given query is also splitinto k lower-dimensional queries and pro-cessed independently. In a first step, thesingle results are merged. As the mergingstep also involves I/O costs and these costsincrease with a decreasing dimensionalityof a single index, there exists an optimaldimensionality for the single indexes thatcan be determined analytically. Note thattree-striping has serious limitations es-pecially for nearest-neighbor queries andskewed data where, in many cases, the d -dimensional index performs better thanany lower-dimensional index.

Voronoi Approximations

In another approach [Berchtold et al.1998c, 2000d] to overcome the curse of di-

mensionality for nearest-neighbor search,the results of any nearest-neighbor searchare precomputed. This corresponds to acomputation of the Voronoi cell of each dat-apoint. The Voronoi cell of a point p con-tains all points that have p as a nearest-neighbor. In high-dimensional spaces, theexact computation of a Voronoi cell is com-putationally very hard. Thus rather thancomputing exact Voronoi cells, the algo-rithm stores conservative approximationsof the Voronoi cells in an index struc-ture that is efficient for high-dimensionaldata spaces. As a result, nearest-neighborsearch corresponds to a simple point queryon the index structure. Although the tech-nique is based on a precomputation of thesolution space, it is dynamic; that is, it sup-ports insertions of new datapoints. Fur-thermore, an extension of the techniqueto a k-nearest-neighbor search is given inBerchtold et al. [2000d].

Parallel Nearest-Neighbor Search

Most similarity search techniques map thedata objects into some high-dimensionalfeature space. The similarity search thencorresponds to a nearest-neighbor searchin the feature space that is computation-ally very intensive. In Berchtold et al.[1997a], the authors present a new par-allel method for fast nearest-neighborsearch in high-dimensional featurespaces. The core problem of designing aparallel nearest neighbor algorithm is tofind an adequate distribution of the dataonto the disks. Unfortunately, the knowndeclustering methods do not perform wellfor high-dimensional nearest-neighborsearch. In contrast, the proposed methodhas been optimized based on the specialproperties of high-dimensional spaces andtherefore provides a near-optimal distri-bution of the data items among the disks.The basic idea of this data declusteringtechnique is to assign the buckets cor-responding to different quadrants of thedata space to different disks. The authorsshow that their technique—in contrast toother declustering methods—guaranteesthat all buckets corresponding to neigh-boring quadrants are assigned to different

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 44: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 365

disks. The specific mapping of points todisks is done by the following formula.

col(c)= d−1

XORi=0

({i + 1 if ci = 1

0 otherwise

)10

.

The input is a bit string defining thequadrant in which the point to be declus-tered is located. But not any number ofdisks may be used for this declusteringtechnique. In fact, the number required islinear in the number of the dimensions.Therefore, the authors present an exten-sion of their technique adapted to an arbi-trary number of disks. A further extensionis a recursive declustering technique thatallows an improved adaptation to skewedand correlated data distributions.

An approach for similarity query pro-cessing using disk arrays is presented inPapadopoulos and Monolopoulos [1998].The authors propose two new algorithmsfor the nearest-neighbor search on singleprocessors and multiple disks. Theirsolution relies on a well-known page dis-tribution technique for low-dimensionaldata spaces [Kamel and Faloutsos 1992]called a proximity index. Upon a split,the MBR of a newly created node is com-pared with the MBRs stored in its fathernode (i.e., its siblings). The new node isassigned to the disk that stores the “leastproximal” pages with respect to the newpage region. Thus the selected disk mustcontain sibling nodes that are far from thenew node. The first algorithm, called fullparallel similarity search (FPSS), deter-mines the threshold sphere (cf. Figure 35),an upper bound of the nearest neighbordistance according to the maximumdistance between the query point and thenearest page region. Then, all pages thatare not pruned by the threshold sphereare called in by a parallel request to alldisks. The second algorithm, candidate re-duction similarity search (CRSS), appliesa heuristic that leads to an intermediateform between depth-first and breadth-first search of the index. Pages that arecompletely contained in the thresholdsphere are processed with a higher prior-

Fig. 35 . The threshold spherefor FPSS and CRSS.

ity than pages that are merely intersectedby it. The authors compare FPSS andCRSS with a (not existing) optimal par-allel algorithm that knows the distanceof the nearest neighbor in advance andreport up to 100% more page accesses ofCRSS compared to the optimal algorithm.The same authors also propose a solutionfor shared-nothing parallel architectures[Papadopoulos and Manolopoulos 1997a].Their architecture distributes the datapages of the index over the secondaryservers while the complete directory isheld in the primary server. Their staticpage distribution strategy is based ona fractal curve (sorting according tothe Hilbert value of the centroid of theMBR). The k-nn algorithm first performsa depth-first search in the directory.When the bottom level of the directory isreached, a sphere around the query pointis determined that encloses as many datapages as required to guarantee that kpoints are stored in them (i.e., assumingthat the page capacity is ≥ k, the sphereis chosen such that one page is completelycontained). A parallel range query isperformed, first accessing a smallernumber of data pages obtained by acost model.

Compression Techniques

Recently, the VA-file [Weber et al. 1998]was developed, an index structure that isactually not an index structure. Based onthe cost model proposed in Berchtold et al.[1997b] the authors prove that under cer-tain assumptions, above a certain dimen-sionality no index structure can process anearest-neighbor query efficiently. There-fore, they suggest accelerating the sequen-tial scan by the use of data compression.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 45: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

366 C. Bohm et al.

The basic idea of the VA-file is to keeptwo files: a bit-compressed, quantized ver-sion of the points and their exact repre-sentation. Both files are unsorted; how-ever, the positions of the points in the twofiles agree.

The quantization of the points is deter-mined by an irregular grid laid over thedata space. The resolution of the grid ineach dimension corresponds to 2b, whereb is the number of bits per dimension thatare used to approximate the coordinates.The grid lines correspond to the quan-tiles of the projections of the points tothe corresponding axes. These quantilesare assumed to change rarely. Changingthe quantiles requires a reconstructionof the compressed file. The k-nearest-neighbor queries are processed by the mul-tistep paradigm: The quantized points areloaded into main memory by a sequentialscan (filter step). Candidates that cannotbe pruned are refined; that is, their exactcoordinates are called in from the secondfile. Several access strategies for timingfilter and refinement steps have been pro-posed. Basically, the speedup of the VA-file compared to a simple sequential scancorresponds to the compression rate, be-cause reading large files sequentially fromdisk yields a linear time complexity withrespect to the file length. The computa-tional effort of determining distances be-tween the query point and the quantizeddatapoints is also improved compared tothe sequential scan by precomputation ofthe squared distances between the querypoint and the grid lines. CPU speedups,however, do not yield large factors and areindependent of the compression rate. Themost important overhead in query pro-cessing is the refinements which requirean expensive random disk access each.With decreasing resolution, the number ofpoints to be refined increases, thus limit-ing the compression ratio. The authors re-port a number of five to six bits per dimen-sion to be optimal.

There are some major drawbacks ofthe VA-file. First, the deterioration ofindex structures is much more prevalentin artificial data than in data sets fromreal-world applications. For such data,

Fig. 36 . Structure of the IQ-tree.

index structures are efficiently applicablefor much higher dimensions. The seconddrawback is the number of bits per di-mension which is a system parameter.Unfortunately, the authors do not provideany model or guideline for the selectionof a suitable bit rate. To overcome thesedrawbacks, the IQ-tree has recently beenproposed by Berchtold et al. [2000a],which is a three-level tree index structureexploiting quantization (cf. Figure 36).The first level is a flat directory consistingof MBRs and the corresponding pointersto pages on the second level. The pages onthe second level contain the quantized ver-sion of the datapoints. In contrast to theVA-file, the quantization is not based onquantiles but is a regular decomposition ofthe page regions. The authors claim thatregular quantization based on the pageregions adapts equally well to skewed andcorrelated data distributions as quantilesdo. The suitable compression rate is deter-mined for each page independently accord-ing to a cost model proposed in Berchtoldet al. [2000b]. Finally, the bottom level ofthe IQ-tree contains the exact represen-tation of the datapoints. For processingof nearest-neighbor queries, the authorspropose a fast index scan which essen-tially subsumes the advantages of indexesand scan-based methods. The algorithmcollects accesses to neighboring pages andperforms chained I/O requests. The lengthof such chains is determined accordingto a cost model. In situations where asequential scan is clearly indicated, thealgorithm degenerates automatically to

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 46: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 367

the sequential scan. In other situations,where the search can be directly desig-nated, the algorithm performs the prioritysearch of the HS algorithm. In interme-diate situations, the algorithm accesseschains of intermediate length thus clearlyoutperforming both the sequential scanas well as the HS algorithm. The bottomlevel of the IQ-tree is accessed accordingto the usual multistep strategy.

Bottom-Up Construction

Usually, the performance of dynamicallyinserting a new datapoint into a multidi-mensional index structure is poor. The rea-son for this is that most structures have toconsider multiple paths in the tree wherethe point could be inserted. Furthermore,split algorithms are complex and compu-tationally intensive. For example, a singlesplit in an X -tree might take up to the or-der of a second to be performed. Therefore,a number of bulk-load algorithms for mul-tidimensional index structures have beenproposed. Bulk-loading an index meansbuilding an index on an entire databasein a single process. This can be done muchmore efficiently than inserting the pointsone at a time. Most of the bulk-load al-gorithms such as the one proposed in vanden Bercken et al. [1997] are not especiallyadapted for the case of high-dimensionaldata spaces. In Berchtold et al. [1998a],however, the authors proposed a new bulk-loading technique for high-dimensional in-dexes that exploits a priori knowledge ofthe complete data set to improve bothconstruction time and query performance.The algorithm operates in a manner simi-lar to the Quicksort algorithm and has anaverage run-time complexity of O(n log n).In contrast to other bulk-loading tech-niques, the query performance is addition-ally improved by optimizing the shape ofthe bounding boxes, by completely avoid-ing overlap, and by clustering the pageson disk. A sophisticated unbalanced splitstrategy is used leading to a better spacepartitioning.

Another important issue would be to ap-ply the knowledge that people aggregatedto other areas such as data reduction, data

mining (e.g., clustering), or visualization,where people have to deal with tens tohundreds of attributes and therefore facea high-dimensional data space. Most ofthe lessons learned also apply to these ar-eas. Examples for successful approaches tomake use of these side-effects are Agrawalet al. [1998] and Berchtold et al. [1998d].

Future Research Issues

Although significant progress has beenmade to understand the nature of high-dimensional spaces and to develop tech-niques that can operate in these spaces,there still are many open questions.

A first problem is that most of the under-standing that the research community de-veloped during the last years is restrictedto the case of uniform and independentdata. Not only are all proposed indexingtechniques optimized for this case, also al-most all theoretical considerations such ascost models are restricted to this simplecase. The interesting observation is thatindex structures do not suffer from “real”data. Rather, they nicely take advantagefrom having nonuniform distributions. Infact, a uniform distribution seems to bethe worst thing that can happen to an in-dex structure. One reason for this effectis that often the data are located only ona subspace of the data space and if theindex adapts to this situation, it actuallybehaves as if the data would be lower-dimensional. A promising approach to un-derstanding and explaining this effect the-oretically has been followed in Faloutsosand Kamel [1994] and Bohm [1998] wherethe concept of the fractal dimension is ap-plied. However, even this approach cannotcover “real” effects such as local skewness.

A second interesting research issue isthe partitioning strategies that performwell in high-dimensional spaces. As pre-vious research (e.g., the Pyramid-tree) hasshown, the partitioning does not have to bebalanced to be optimal for certain queries.The open question is what an optimalpartitioning schema for nearest-neighborqueries would be. Does it need to be bal-anced or better unbalanced? Is it basedupon bounding-boxes or on pyramids?

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 47: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

368 C. Bohm et al.

How does the optimum change when thedata set grows in size or dimensionality?There are many open questions that needto be answered.

A third open research issue is an ap-proximate processing of nearest-neighborqueries. The first question is what auseful definition for approximate nearest-neighbor search in high-dimensionalspaces is, and how that fussiness intro-duced by the definition may be exploitedfor an efficient query processing. A firstapproach for an approximate nearest-neighbor search has been proposed inGionis et al. [1999].

Other interesting research issues in-clude the parallel processing of nearest-neighbor queries in high-dimensionalspace and the data mining and visual-ization of high-dimensional spaces. Theparallel processing aims at finding ap-propriate declustering and query process-ing strategies to overcome the difficul-ties in high-dimensional spaces. A firstapproach in this direction has alreadybeen presented in Berchtold et al. [1997a].The efforts in the area of data miningand visualization of high-dimensional fea-ture spaces (for an example see Hinneb-urg and Keim [1998]) try to understandand explore the high-dimensional featurespaces. Also, the application of compres-sion techniques to improve the query per-formance is an interesting and promisingresearch area. A first approach, the VA-file, has recently been proposed in Weberet al. [1998].

8. CONCLUSIONS

Research in high-dimensional index struc-tures has been very active and productiveover the past few years, resulting in a mul-titude of interesting new approaches forindexing high-dimensional data. Since itis very difficult to follow up on this dis-cussion, in this survey we tried to provideinsight into the effects occurring in index-ing high-dimensional spaces and to pro-vide an overview of the principal ideas ofthe index structures that have been pro-posed to overcome these problems. Thereare still a number of interesting open re-

search problems and we expect the fieldto remain a fruitful research area over thenext years. Due to the increasing impor-tance of multimedia databases in variousapplication areas and due to the remark-able results of the research, we also expectthe research on high-dimensional index-ing to have a major impact on many prac-tical applications and commercial multi-media database systems.

APPENDIX

A. LEMMA 1

The RKV algorithm has a worst case spacecomplexity of O(log n).

PROOF. The only source of dynamicmemory assignment in the RKV algo-rithm is the recursive calls of the func-tion RKV algorithm. The recursion depthis at most equal to the height of the in-dexing structure. The height of all high-dimensional index structures presented inthis section is of the complexity O(log n).Since a constant amount of memory (onedata or directory page) is allocated in eachcall, Lemma 1 follows.

B. LEMMA 2

The HS algorithm has a space complexityof O(n) in the worst case.

PROOF. The following scenario de-scribes the worst case. Query processingstarts with the root in APL. The root isreplaced by its child nodes, which are onthe level h − 1 if h is the height of theindex. All nodes on level h−1 are replacedby their child nodes, and so on, until alldata nodes are in the APL. At this state itis possible that no data page is excludedfrom the APL because no datapoint wasencountered yet. The situation describedabove occurs, for example, if all dataobjects are located on a sphere around thequery point. Thus, all data pages are inthe APL and the APL is maximal becausethe APL grows only by replacing a pageby its descendants. If all data pages arein the APL, it has a length of O(n).

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 48: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 369

C. LEMMA 3

Let nndist be the distance between thequery point and its nearest neighbor.All pages that intersect a sphere aroundthe query point having a radius equalto nndist (the so-called nearest neighborsphere) must be accessed for query pro-cessing. This condition is necessary andsufficient.

PROOF.

1. Sufficiency: If all data pages intersect-ing the nn-sphere are accessed, then allpoints in the database with a distanceless than or equal to nndist are knownto the query processor. No closer pointthan the nearest known point can existin the database.

2. Necessity: If a page region intersectswith the nearest-neighbor sphere butis not accessed during query process-ing, the corresponding subtree could in-clude a point that is closer to the querypoint than the nearest neighbor candi-date. Therefore, accessing all intersect-ing pages is necessary.

D. LEMMA 4

The HS algorithm accesses pages in theorder of increasing distance to the querypoint.

PROOF. Due to the lower bounding prop-erty of page regions, the distance betweenthe query point and a page region is al-ways greater than or equal to the distanceof the query point and the region of theparent of the page. Therefore, the min-imum distance between the query pointand any page in the APL can only be in-creased or remain unchanged, never de-creased by the processing step of loading apage and replacing the corresponding APLentry. Since the active page with minimumdistance is always accessed, the pages areaccessed in the order of increasing dis-tances to the query point.

E. LEMMA 5

The HS algorithm is optimal in terms ofthe number of page accesses.

PROOF. According to Lemma 4, the HSalgorithm accesses pages in the order ofincreasing distance to the query point q.Let m be the lowest MINDIST in the APL.Processing stops if the distance of q to thecpc is less than m. Due to the lower bound-ing property, processing of any page in theAPL cannot encounter any points with adistance to q less than m. The distance be-tween the cpc and q cannot fall below mduring processing. Therefore, exactly thepages with a MINDIST less than or equalto the nearest neighbor distance are pro-cessed by the HS algorithm. According toLemma 3, these pages must be loaded byany correct nearest neighbor algorithm.Thus, the HS algorithm yields an optimalnumber of page accesses.

REFERENCES

ABEL, D. AND SMITH, J. 1983. A data structure andalgorithm based on a linear key for a rectangleretrieval problem. Comput. Vis. 24, 1–13.

AGRAWAL, R., FALOUTSOS, C., AND SWAMI, A. 1993. Ef-ficient similarity search in sequence databases.In Proc. 4th Int. Conf. on Foundations of Data Or-ganization and Algorithms, LNCS 730, 69–84.

AGRAWAL, R., GEHRKE, J., GUNOPULOS, D., AND RAGHAVAN,P. 1998. Automatic subspace clustering ofhigh-dimensional data for data mining applica-tions. In Proc. ACM SIGMOD Int. Conf. on Man-agement of Data (Seattle), 94–105.

AGRAWAL, R., LIN, K., SAWHNEY, H., AND SHIM, K.1995. Fast similarity search in the presenceof noise, scaling, and translation in time-seriesdatabases. In Proc. 21st Int. Conf. on Very LargeDatabases, 490–501.

ALTSCHUL, S., GISH, W., MILLER, W., MYERS, E.,AND LIPMAN, D. 1990. A basic local alignmentsearch tool. J. Molecular Biol. 215, 3, 403–410.

AOKI, P. 1998. Generalizing “search” in general-ized search trees. In Proc. 14th Int. Conf. on DataEngineering (Orlando, FL), 380–389.

AREF, W. AND SAMET, H. 1991. Optimization strate-gies for spatial query processing. In Proc. 17thInt. Conf. on Very Large Databases (Barcelona),81–90.

ARYA, S. 1995. Nearest neighbor searching and ap-plications. PhD thesis, University of Maryland,College Park, MD.

ARYA, S., MOUNT, D., AND NARAYAN, O. 1995. Ac-counting for boundary effects in nearest neigh-bor searching. In Proc. 11th Symp. on Computa-tional Geometry (Vancouver, Canada), 336–344.

BAEZA-YATES, R., CUNTO, W., MANBER, U., AND WU, S.1994. Proximity matching using fixed-queries

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 49: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

370 C. Bohm et al.

trees. In Proc. Combinatorial Pattern Matching,LNCS 807, 198–212.

BAYER, R. AND MCCREIGHT, E. 1977. Organizationand maintenance of large ordered indices. ActaInf. 1, 3, 173–189.

BECKMANN, N., KRIEGEL, H.-P., SCHNEIDER, R., AND

SEEGER, B. 1990. The r?-tree: An efficient androbust access method for points and rectangles.In Proc. ACM SIGMOD Int. Conf. on Manage-ment of Data (Atlantic City, NJ), 322–331.

BELUSSI, A. AND FALOUTSOS, C. 1995. Estimatingthe selectivity of spatial queries using the cor-relation fractal dimension. In Proc. 21st Int.Conf. on Very Large Databases (Zurich), 299–310.

BENTLEY, J. 1975. Multidimensional search treesused for associative searching. Commun.ACM 18, 9, 509–517.

BENTLEY, J. 1979. Multidimensional binary searchin database applications. IEEE Trans. Softw.Eng. 4, 5, 397–409.

BERCHTOLD, S. AND KEIM, D. 1998. High-dimensional index structures—Databasesupport for next decades’s applications. In Tuto-rial ACM SIGMOD Int. Conf. on Managementof Data (Seattle, NJ).

BERCHTOLD, S., BOHM, C., BRAUNMULLER, B., KEIM, D.,AND KRIEGEL, H.-P. 1997a. Fast parallel simi-larity search in multimedia databases. In Proc.ACM SIGMOD Int. Conf. on Management ofData.

BERCHTOLD, S., BOHM, C., JAGADISH, H., KRIEGEL, H.-P.,AND SANDER, J. 2000a. Independent quantiza-tion: An index compression technique for high-dimensional data spaces. In Proc. 16th Int. Conf.on Data Engineering.

BERCHTOLD, S., BOHM, C., KEIM, D., AND KRIEGEL,H.-P. 1997b. A cost model for nearest neigh-bor search in high-dimensional data space.In Proc. ACM PODS Symp. on Principles ofDatabase Systems (Tucson, AZ).

BERCHTOLD, S., BOHM, C., KEIM, D., AND KRIEGEL, H.-P.2001. On optimizing processing of nearestneighbor queries in high-dimensional dataspace. Proc. Conf. on Database Theory, 435–449.

BERCHTOLD, S., BOHM, C., KEIM, D., KRIEGEL, H.-P.,AND XU, X. 2000c. Optimal multidimensionalquery processing using tree striping. Dawak,244–257.

BERCHTOLD, S., BOHM, C., AND KRIEGEL, H.-P. 1998a.Improving the query performance of high-dimensional index structures using bulk-loadoperations. In Proc. 6th Int. Conf. on ExtendingDatabase Technology (Valencia, Spain).

BERCHTOLD, S., BOHM, C., AND KRIEGEL, H.-P. 1998b.The pyramid-technique: Towards indexing be-yond the curse of dimensionality. In Proc. ACMSIGMOD Int. Conf. on Management of Data(Seattle, NJ), 142–153.

BERCHTOLD, S., ERTL, B., KEIM, D., KRIEGEL, H.-P.,AND SEIDL, T. 1998c. Fast nearest neighbor

search in high-dimensional spaces. In Proc.14th Int. Conf. on Data Engineering (Orlando,FL).

BERCHTOLD, S., JAGADISH, H., AND ROSS, K. 1998d.Independence diagrams: A technique for visualdata mining. In Proc. 4th Int. Conf. on KnowledgeDiscovery and Data Mining (New York), 139–143.

BERCHTOLD, S., KEIM, D., AND KRIEGEL, H.-P. 1996.The x-tree: An index structure for high-dimensional data. In Proc. 22nd Int. Conf. onVery Large Databases (Bombay), 28–39.

BERCHTOLD, S., KEIM, D., KRIEGEL, H.-P., AND SEIDL,T. 2000d. Indexing the solution space: A newtechnique for nearest neighbor search in high-dimensional space. IEEE Trans. Knowl. DataEng., 45–57.

BEYER, K., GOLDSTEIN, J., RAMAKRISHNAN, R., AND SHAFT,U. 1999. When is “nearest neighbor” mean-ingful? In Proc. Int. Conf. on Database Theory,217–235.

BOHM, C. 1998. Efficiently indexing high-dimensional databases. PhD thesis, Universityof Munich, Germany.

BOHM, C. 2000. A cost model for query processingin high-dimensional data spaces. To appear in:ACM Trans. Database Syst.

BOZKAYA, T. AND OZSOYOGLU, M. 1997. Distance-based indexing for high-dimensional metricspaces. SIGMOD Rec. 26, 2, 357–368.

BRIN, S. 1995. Near neighbor search in large met-ric spaces. In Proc. 21st Int. Conf. on Very LargeDatabases (Switzerland), 574–584.

BURKHARD, W. AND KELLER, R. 1973. Some ap-proaches to best-match file searching. Commun.ACM 16, 4, 230–236.

CHEUNG, K. AND FU, A. 1998. Enhanced near-est neighbour search on the r-tree. SIGMODRec. 27, 3, 16–21.

CHIUEH, T. 1994. Content-based image indexing.In Proc. 20th Int. Conf. on Very Large Databases(Chile), 582–593.

CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1997. M-tree: An efficient access method for similar-ity search in metric spaces. In Proc. 23rd Int.Conf. on Very Large Databases (Greece), 426–435.

CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998. A costmodel for similarity queries in metric spaces. InProc. 17th ACM Symp. on Principles of DatabaseSystems (Seattle), 59–67.

CLEARY, J. 1979. Analysis of an algorithm for find-ing nearest neighbors in Euclidean space. ACMTrans. Math. Softw. 5, 2, 183–192.

COMER, D. 1979. The ubiquitous b-tree. ACM Com-put. Surv. 11, 2, 121–138.

CORRAL, A., MANOLOPOULOS, Y., THEODORIDIS, Y.,AND VASSILAKOPOULOS, M. 2000. Closest pairqueries in spatial databases. In Proc. ACM SIG-MOD Int. Conf. on Management of Data, 189–200.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 50: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 371

EASTMAN, C. 1981. Optimal bucket size for near-est neighbor searching in kd -trees. Inf. Proc.Lett. 12, 4.

EVANGELIDIS, G. 1994. The hBπ -tree: A concurrentand recoverable multi-attribute index structure.PhD thesis, Northeastern University, Boston,MA.

EVANGELIDIS, G., LOMET, D., AND SALZBERG, B. 1997.The hBπ -tree: A multiattribute index support-ing concurrency, recovery and node consolida-tion. VLDB J. 6, 1, 1–25.

FALOUTSOS, C. 1985. Multiattribute hashing usinggray codes. In Proc. ACM SIGMOD Int. Conf. onManagement of Data, 227–238.

FALOUTSOS, C. 1988. Gray codes for partial matchand range queries. IEEE Trans. Softw. Eng. 14,1381–1393.

FALOUTSOS, C. AND GAEDE, V. 1996. Analysis of n-dimensional quadtrees using the Hausdorff frac-tal dimension. In Proc. 22nd Int. Conf. on VeryLarge Databases (Mumbai, India), 40–50.

FALOUTSOS, C. AND KAMEL, I. 1994. Beyond unifor-mity and independence: Analysis of r-trees us-ing the concept of fractal dimension. In Proc.13th ACM SIGACT-SIGMOD-SIGART Symp. onPrinciples of Database Systems (Minneapolis,MN), 4–13.

FALOUTSOS, C. AND LIN, K.-I. 1995. Fast map: A fastalgorithm for indexing, data-mining and visual-ization of traditional and multimedia data. InProc. ACM SIGMOD Int. Conf. on Managementof Data (San Jose, CA), 163–174.

FALOUTSOS, C. AND ROSEMAN, S. 1989. Fractalsfor secondary key retrieval. In Proc. 8thACM SIGACT-SIGMOD Symp. on Principles ofDatabase Systems, 247–252.

FALOUTSOS, C., BARBER, R., FLICKNER, M., AND HAFNER,J. 1994a. Efficient and effective querying byimage content. J. Intell. Inf. Syst. 3, 231–262.

FALOUTSOS, C., RANGANATHAN, M., AND MANOLOPOULOS,Y. 1994b. Fast subsequence matching intime-series databases. In Proc. ACM SIGMODInt. Conf. on Management of Data, 419–429.

FALOUTSOS, C., SELLIS, T., AND ROUSSOPOULOS, N. 1987.Analysis of object-oriented spatial access meth-ods. In Proc. ACM SIGMOD Int. Conf. on Man-agement of Data.

FINKEL, R. AND BENTLEY, J. 1974. Quad trees: Adata structure for retrieval of composite trees.Acta Inf. 4, 1, 1–9.

FREESTON, M. 1987. The bang file: A new kind ofgrid file. In Proc. ACM SIGMOD Int. Conf. onManagement of Data (San Francisco), 260–269.

FRIEDMAN, J., BENTLEY, J., AND FINKEL, R. 1977. Analgorithm for finding best matches in logarith-mic expected time. ACM Trans. Math. Softw. 3, 3,209–226.

GAEDE, V. 1995. Optimal redundancy in spatialdatabase systems. In Proc. 4th Int. Symp. on Ad-vances in Spatial Databases (Portland, ME), 96–116.

GAEDE, V. AND GUNTHER, O. 1998. Multidimen-sional access methods. ACM Comput. Surv. 30, 2,170–231.

GIONIS, A., INDYK, P., AND MOTWANI, R. 1999. Sim-ilarity search in high dimensions via hashing.In Proc. 25th Int. Conf. on Very Large Databases(Edinburgh), 518–529.

GREENE, D. 1989. An implementation and perfor-mance analysis of spatial data access meth-ods. In Proc. 5th IEEE Int. Conf. on DataEngineering.

GUTTMAN, A. 1984. R-trees: A dynamic index struc-ture for spatial searching. In Proc. ACM SIG-MOD Int. Conf. on Management of Data (Boston),47–57.

HELLERSTEIN, J., KOUTSOUPIAS, E., AND PAPADIMITRIOU,C. 1997. On the analysis of indexing schemes.In Proc. 16th SIGACT-SIGMOD-SIGART Symp.on Principles of Database Systems (Tucson, AZ),249–256.

HELLERSTEIN, J., NAUGHTON, J., AND PFEFFER, A. 1995.Generalized search trees for database systems.In Proc. 21st Int. Conf. on Very Large Databases(Zurich), 562–573.

HENRICH, A. 1994. A distance-scan algorithm forspatial access strucures. In Proc. 2nd ACM Work-shop on Advances in Geographic InformationSystems (Gaithersburg, MD), 136–143.

HENRICH, A. 1998. The lsdh-tree: An access struc-ture for feature vectors. In Proc. 14th Int. Conf.on Data Engineering (Orlando, FL).

HENRICH, A., SIX, H.-W., AND WIDMAYER, P. 1989.The lsd-tree: Spatial access to multidimensionalpoint and non-point objects. In Proc. 15th Int.Conf. on Very Large Databases (Amsterdam, TheNetherlands), 45–53.

HINNEBURG, A. AND KEIM, D. 1998. An efficientapproach to clustering in large multimediadatabases with noise. In Proc. Int. Conf. onKnowledge Discovery in Databases (New York).

HINRICHS, K. 1985. Implementation of the grid file:Design concepts and experience. BIT 25, 569–592.

HJALTASON, G. AND SAMET, H. 1995. Ranking inspatial databases. In Proc. 4th Int. Symp.on Large Spatial Databases (Portland, ME),83–95.

HJALTASON, G. AND SAMET, H. 1998. Incrementaldistance join algorithms for spatial databases.In Proc. ACM SIGMOD Int. Conf. on Manage-ment of Data, 237–248.

HUTFLESZ, A., SIX, H.-W., AND WIDMAYER, P. 1988a.Globally order preserving multidimensional lin-ear hashing. In Proc. 4th IEEE Int. Conf. on DataEngineering, 572–579.

HUTFLESZ, A., SIX, H.-W., AND WIDMAYER, P. 1988b.Twin grid files: Space optimizing accessschemes. In Proc. ACM SIGMOD Int. Conf. onManagement of Data.

JAGADISH, H. 1990. Linear clustering of objectswith multiple attributes. In Proc. ACM SIGMOD

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 51: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

372 C. Bohm et al.

Int. Conf. on Management of Data (Atlantic City,NJ), 332–342.

JAGADISH, H. 1991. A retrieval technique for simi-lar shapes. In Proc. ACM SIGMOD Int. Conf. onManagement of Data, 208–217.

JAIN, R. AND WHITE, D. 1996. Similarity indexing:Algorithms and performance. In Proc. SPIE Stor-age and Retrieval for Image and Video DatabasesIV (San Jose, CA), 62–75.

KAMEL, I. AND FALOUTSOS, C. 1992. Parallel r-trees.In Proc. ACM SIGMOD Int. Conf. on Manage-ment of Data, 195–204.

KAMEL, I. AND FALOUTSOS, C. 1993. On packing r-trees. CIKM, 490–499.

KAMEL, I. AND FALOUTSOS, C. 1994. Hilbert r-tree:An improved r-tree using fractals. In Proc.20th Int. Conf. on Very Large Databases, 500–509.

KATAYAMA, N. AND SATOH, S. 1997. The sr-tree: Anindex structure for high-dimensional nearestneighbor queries. In Proc. ACM SIGMOD Int.Conf. on Management of Data, 369–380.

KNUTH, D. 1975. The Art of ComputerProgramming—Volume 3: Sorting and Search-ing. Addison-Wesley, Reading, Mass.

KORN, F. AND MUTHUKRISHNAN, S. 2000. Influencesets based on reverse nearest neighbor queries.In Proc. ACM SIGMOD Int. Conf. on Manage-ment of Data, 201–212.

KORN, F., SIDIROPOULOS, N., FALOUTSOS, C., SIEGEL, E.,AND PROTOPAPAS, Z. 1996. Fast nearest neigh-bor search in medical image databases. InProc. 22nd Int. Conf. on Very Large Databases(Mumbai, India), 215–226.

KORNACKER, M. 1999. High-performance general-ized search trees. In Proc. 24th Int. Conf. on VeryLarge Databases (Edinburgh).

KRIEGEL, H.-P. AND SEEGER, B. 1986. Multidimen-sional order preserving linear hashing with par-tial extensions. In Proc. Int. Conf. on DatabaseTheory, Lecture Notes in Computer Science, vol.243, Springer-Verlag, New York.

KRIEGEL, H.-P. AND SEEGER, B. 1987. Multidimen-sional dynamic quantile hashing is very efficientfor non-uniform record distributions. In Proc.3rd Int. Conf. on Data Engineering, 10–17.

KRIEGEL, H.-P. AND SEEGER, B. 1988. Plop-hashing:A grid file without directory. In Proc. 4th Int.Conf. on Data Engineering, 369–376.

KRISHNAMURTHY, R. AND WHANG, K.-Y. 1985. Multi-level Grid Files. IBM Research Center Report,Yorktown Heights, NY.

KUKICH, K. 1992. Techniques for automati-cally correcting words in text. ACM Comput.Surv. 24, 4, 377–440.

LIN, K., JAGADISH, H., AND FALOUTSOS, C. 1995. Thetv-tree: An index structure for high-dimensionaldata. VLDB J. 3, 517–542.

LOMET, D. AND SALZBERG, B. 1989. The hb-tree: Arobust multiattribute search structure. In Proc.

5th IEEE Int. Conf. on Data Engineering, 296–304.

LOMET, D. AND SALZBERG, B. 1990. The hb-tree:A multiattribute indexing method with goodguaranteed performance. ACM Trans. DatabaseSyst. 15, 4, 625–658.

MANDELBROT, B. 1977. Fractal Geometry of Nature.W.H. Freeman, New York.

MEHROTRA, R. AND GARY, J. 1993. Feature-based re-trieval of similar shapes. In Proc. 9th Int. Conf.on Data Engineering.

MEHROTRA, R. AND GARY, J. 1995. Feature-index-based similar shape retrieval. In Proc. 3rd Work-ing Conf. on Visual Database Systems.

MORTON, G. 1966. A Computer Oriented GeodeticData Base and a New Technique in File Sequenc-ing. IBM Ltd., USA.

MUMFORD, D. 1987. The problem of robust shapedescriptors. In Proc. 1st IEEE Int. Conf. on Com-puter Vision.

NIEVERGELT, J., HINTERBERGER, H., AND SEVCIK, K.1984. The grid file: An adaptable, symmetricmultikey file structure. ACM Trans. DatabaseSyst. 9, 1, 38–71.

ORENSTEIN, J. 1990. A comparison of spatial queryprocessing techniques for native and parame-ter spaces. In Proc. ACM SIGMOD Int. Conf. onManagement of Data, 326–336.

ORENSTEIN, J. AND MERRET, T. 1984. A class of datastructures for associative searching. In Proc. 3rdACM SIGACT-SIGMOD Symp. on Principles ofDatabase Systems, 181–190.

OTOO, E. 1984. A mapping function for the direc-tory of a multidimensional extendible hashing.In Proc. 10th Int. Conf. on Very Large Databases,493–506.

OUKSEL, M. 1985. The interpolation based gridfile. In Proc. 4th ACM SIGACT-SIGMODSymp. on Principles of Database Systems, 20–27.

OUSKEL, A. AND MAYES, O. 1992. The nestedinterpolation-based Grid File. Acta Informatika29, 335–373.

PAGEL, B.-U., SIX, H.-W., TOBEN, H., AND WIDMAYER,P. 1993. Towards an analysis of range queryperformance in spatial data structures. In Proc.12th ACM SIGACT-SIGMOD-SIGART Symp. onPrinciples of Database Systems (Washington,DC), 214–221.

PAPADOPOULOS, A. AND MANOLOPOULOS, Y. 1997a.Nearest neighbor queries in shared-nothing en-vironments. Geoinf. 1, 1, 1–26.

PAPADOPOULOS, A. AND MANOLOPOULOS, Y. 1997b.Performance of nearest neighbor queries inr-trees. In Proc. 6th Int. Conf. on DatabaseTheory, Lecture Notes in Computer Science, vol.1186, Springer-Verlag, New York, 394–408.

PAPADOPOULOS, A. AND MANOLOPOULOS, Y. 1998. Sim-ilarity query processing using disk arrays. InProc. ACM SIGMOD Int. Conf. on Managementof Data.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Page 52: Searching in high dimensional spaces index structures for improving the performance of multimedia databases

Searching in High-Dimensional Spaces 373

RIEDEL, E., GIBSON, G., AND FALOUTSOS, C. 1998. Ac-tice storage for large-scale data mining andmultimedia. In Proc. 24th Int. Conf. on VeryLarge Databases, 62–73.

ROBINSON, J. 1981. The k-d -b-tree: A search struc-ture for large multidimensional dynamic in-dexes. In Proc. ACM SIGMOD Int. Conf. on Man-agement of Data, 10–18.

ROUSSOPOULOS, N., KELLEY, S., AND VINCENT, F. 1995.Nearest neighbor queries. In Proc. ACM SIG-MOD Int. Conf. on Management of Data, 71–79.

SAGAN, H. 1994. Space Filling Curves. Springer-Verlag, New York.

SCHRODER, M. 1991. Fractals, Chaos, Power Laws:Minutes from an Infinite Paradise. W.H.Freeman, New York.

SEEGER, B. AND KRIEGEL, H.-P. 1990. The buddytree: An efficient and robust access method forspatial data base systems. In Proc. 16th Int.Conf. on Very Large Databases (Brisbane), 590–601.

SEIDL, T. 1997. Adaptable similarity search in 3-dspatial database systems. PhD thesis, Univer-sity of Munich, Germany.

SEIDL, T. AND KRIEGEL, H.-P. 1997. Efficient user-adaptable similarity search in large multimediadatabases. In Proc. 23rd Int. Conf. on Very LargeDatabases (Athens).

SELLIS, T., ROUSSOPOULOS, N., AND FALOUTSOS, C.1987. The r+-tree: A dynamic index for multi-dimensional objects. In Proc. 13th Int. Conf. onVery Large Databases (Brighton, GB), 507–518.

SHAWNEY, H. AND HAFNER, J. 1994. Efficient colorhistogram indexing. In Proc. Int. Conf. on ImageProcessing, 66–70.

SHOICHET, B., BODIAN, D., AND KUNTZ, I. 1992.Molecular docking using shape descriptors. J.Comput. Chem. 13, 3, 380–397.

SPROULL, R. 1991. Refinements to nearest neigh-bor searching in k-dimensional trees. Algorith-mica, 579–589.

STANOI, I., AGRAWAL, D., AND ABBADI, A. 2000. Re-verse nearest neighbor queries for dynamic

databases. In Proc. ACM SIGMOD Workshop onResearch Issues in Data Mining and KnowledgeDiscovery, 44–53.

STONEBRAKER, M., SELLIS, T., AND HANSON, E. 1986.An analysis of rule indexing implementations indata base systems. In Proc. Int. Conf. on ExpertDatabase Systems.

THEODORIDIS, Y. AND SELLIS, T. 1996. A model forthe prediction of r-tree performance. In Proc.15th ACM SIGACT-SIGMOD-SIGART Symp. onPrinciples of Database Systems (Montreal), 161–171.

UHLMANN, J. 1991. Satisfying general proxim-ity/similarity queries with metric trees. Inf. Proc.Lett. 145–157.

VAN DEN BERCKEN, J., SEEGER, B., AND WIDMAYER,P. 1997. A general approach to bulk load-ing multidimensional index structures. In Proc.23rd Int. Conf. on Very Large Databases(Athens).

WALLACE, T. AND WINTZ, P. 1980. An efficient three-dimensional aircraft recognition algorithm us-ing normalized Fourier descriptors. Comput.Graph. Image Proc. 13, 99–126.

WEBER, R., SCHEK, H.-J., AND BLOTT, S. 1998. Aquantitative analysis and performance study forsimilarity-search methods in high-dimensionalspaces. In Proc. Int. Conf. on Very LargeDatabases (New York).

WHITE, D. AND JAIN, R. 1996. Similarity indexingwith the ss-tree. In Proc. 12th Int. Conf. on DataEngineering (New Orleans).

YAO, A. AND YAO, F. 1985. A general approach tod-dimensional geometric queries. In Proc. ACMSymp. on Theory of Computing.

YIANILOS, P. 1993. Data structures and algorithmsfor nearest neighbor search in general metricspaces. In Proc. 4th ACM-SIAM Symp. on Dis-crete Algorithms, 311–321.

YIANILOS, P. 1999. Excluded middle vantage pointforests for nearest neighbor search. In Proc.DIMACS Implementation Challenge (Baltimore,MD).

Received August 1998; revised March 2000; accepted November 2000

ACM Computing Surveys, Vol. 33, No. 3, September 2001.


Recommended