Range Reverse Nearest Neighbor Queries

Range Reverse Nearest Neighbor Queries

Reuben Pereira, Abhinav Agshikar, Gaurav Agarwal, Pranav Keni

{reubpereira, abhi2chat, grvmast, pranvprk}@gmail.com

Abstract: Reverse nearest neighbor (RNN) queries have a broad application

base such as decision support, profile-based marketing, resource allocation, data

mining, etc. Previous work on RNN, visible nearest neighbor and visible RNN

has been done considering only point queries. Such point queries are highly un-

likely in the real world. In this paper we introduce a novel variant of RNN que-

ries - Range Reverse Nearest Neighbor queries which consider queries over a

region rather than a point.

Keywords: Data Mining, Query processing, Spatial Databases, Nearest Neigh-

bor Queries.

1 Introduction

Data Mining is a relatively young and interdisciplinary field of computer science

that discovers new patterns from large data sets involving methods at the intersection

of artificial intelligence, machine learning, statistics and database systems.

While the Reverse Nearest Neighbor (RNN) search problem, i.e. finding all objects

in a database that have a given query q among their corresponding k-nearest neigh-

bors, has been studied extensively in the past years, considerably less work has been

done so far to support RNN queries on a region that may not be indexed by a point

access method. Also less research is done for handling obstacles in case of Range

queries. This paper proposes a novel approach to handle queries like “Finding an

apartment in a set of buildings which is closest to a park.”

2 Range Reverse Nearest Neighbor Queries

2.1 Preliminaries

Suppose that we are interested in a particular region and want to check for its reverse

nearest data points. In order to expand the query point defined in R′NN to an area, we

can stress on using a Range Nearest Neighbor Query.

Now suppose we are out to buy an apartment for ourselves but a requirement is

that it must be near a particular monument or a recreational area or even say a road.

We can consider this region of interest as a query region and this will have to be a

Andrzej M.J. Skulimowski (ed.): Proceedings of KICSS'2013, pp. 509-518 © Progress & Business Publishers, Kraków 2013

Range query rather than a point query. The concern is that the apartment must be the

nearest one to the specific area. Conversely we can state that the area must have that

apartment as a Reverse Nearest Neighbor. So, in short, we have a two-fold query that

must perform a Range Query (as the Query is a region rather than a point) and Re-

verse Nearest Neighbor.

2.2 Problem Statement

With the above discussion we have arrived at the problem statement in nearest

neighbor queries that we define as follows:

For a query region, R and a set of data points D, the Range Reverse Nearest Neigh-

bor (R′RNN) query returns the nearest data point d ∈ D that has a point in the region

as its nearest neighbor (NN).

As illustrated with the diagram given below

Fig. 1. R′RNN Query Space

We are interested in finding the Range Reverse Nearest Neighbor for the Query re-

gion R as shown and have data points - d1 to d8.

Depending on the location of the data points, we can categorize them into follow-

ing –

First we have to eliminate the data points that have some other data point as its

nearest neighbor. For example, d2 has d8 as its nearest neighbor and vice versa, hence

both can be eliminated. Also d7 can be eliminated since d5 is its nearest neighbor.

Now, we are left with the points- d1, d3, d4, d5 and d6 to process. As can be seen,

d5 is the nearest neighbor to the region. Hence d5 will be returned as the Range Re-

verse Nearest Neighbor of the query region R.

The above concept can be expanded to K-Nearest Neighbors for any ‘K’ value. For

example, for K=2 we may have d4 or d6 in the above case.

- 510 -

2.3 Metrics used for Nearest Neighbor Queries

The Nearest Neighbor query involves the use of a metric. A relative comparison

amongst different candidate points based on this metric determines the query results.

The metric can be simple Euclidian distance or can be complex like difference be-

tween two patterns. Here we discuss two common metrics used in Nearest Neighbor

analysis – MINDIST and MINMAXDIST.

MINDIST[6]: The distance of a point P in Euclidean space E(n) of dimension n

from a rectangle R in the same space, denoted MINDIST(P, R) is

The square of the distance is used since it uses fewer and less costly computations.

The distance from the point p to any object in MBR R will always be greater than or

equal to MINDIST(P, R). Thus MINDIST is the lower bound for the distance of any

object in the MBR. This lower bound can be used to prune out an MBR whose lower

bound is higher than the current nearest neighbor without having to compute the dis-

tance to every object in the MBR.

T

S

Point A Point B

Fig. 2. MINDIST

MINMAXDIST[6]: For a point P in Euclidean space E(n) of dimension n and an

MBR R = (S, T) of the same dimensionality, MINMAXDIST(P, R) is defined as:

MINDIST

MINDIST TERMS

MBR

- 511 -

This construction computes the minimum of the distances between the query point

and each vertex of the MBR. Thus there will be at least one object within the MBR

whose distance from the point P is less than or equal to MINMAXDIST(P, R).

MINMAXDIST is used as the upper bound for pruning.

Fig. 3. MINMAXDIST in 3D Space

But these definitions are suited only for a point query. Hence we have expanded

the concept of MINDIST and MINMAXDIST for a Range Query as defined below:

The original definitions were for query point but now we have a region defined by

its two endpoints S’ and T’ as shown below:-

- 512 -

Fig. 4. Revised Query Regions for R′RNN

We use the notation qki that denotes either S’ or T’ in the k

th dimension and hence

define the two metrics as follows:-

2.4 Data Structures

Efficient processing of NN queries requires spatial data structures which capitalize on

the proximity of the objects to focus the search on potential neighbors only [6]. The

most widely used structure is the R*-tree.

- 513 -

R-trees were proposed as a natural extension of B-trees in higher than one dimen-

sion. Each R-tree non-leaf node contains an array of (RECT, pointer) while the leaf

node has the object itself in place of the pointer. Here RECT stands for the nth dimen-

sion Minimum Bounding Rectangle (MBR). We divide a given region into MBRs

(sub-regions) as per the placements of the given data points.

Fig. 5. R Tree Construction

As shown above in Figure 6, the entire region is subdivided into three major MBRs

– A, B and C. Within these are the data points D, E, F, G and so on, approximated as

rectangles. In the figure below it, we have the R-tree for the region. The root node-

contains the major MBRs and then eventually the leaf nodes contain the actual data

points.

R*-trees [1] differ from R-trees in the algorithms used for insertion and to split or

join nodes. It has been found that R*-trees perform same or better than R-trees in

most applications. [1]

- 514 -

2.5 Algorithm for Range Reverse Nearest Neighbor

Discussions about the algorithms involved in efficient RNN have already been dis-

cussed[9]. The TPL algorithm involves a two phase algorithm called Filter & Refine-

ment as given below:

Firstly a pruning strategy - the half-plane method is applied. Such pruning elimi-

nates candidates that do not fall into the candidate set for reverse nearest neighbor and

they are put in a refinement set that is used later for removing false positives. Howev-

er existing research exists only for query point. Hence we have to modify the half-

plane concept for a data point and a query region.

We have worked out two cases for the above scenario in two dimensions and im-

plemented them in MATLAB as shown below:

Case 1: Here the data point is located along one face of the query region as shown

in Figure 7. It is within the x dimension of the range. The half-plane in this case con-

sists of a parabola along the face and two perpendicular bisectors beyond the face.

Fig. 6. Half Plane Case 1

Fig. 7.

A half plane is the locus of all points equidistant from the data point and from the

range, the distance to the range being MINDIST. Therefore, in region I we need the

- 515 -

locus of points equidistant from the point p and the line ab. Now with ab as the direct-

rix and point p as the focus, the locus becomes a parabola as shown in figure 6.

In region II, the shortest distance to the range will always be to point a. Therefore,

we need the locus of points equidistant from point p and point a. This is a perpendicu-

lar bisector between the two points [9]. Similarly, the half-plane in region III is the

perpendicular bisector between point p and point b.

Fig. 8. Half Plane Case 2

Fig. 9.

Case 2: Here the data point is not along any face of the query region as shown in

figure 9. The data point is not within any dimension of the range. The half-plane in

this case consists of three perpendicular bisectors and two parabolas as shown in fig-

ure 8.

In region III, the shortest distance of any point to the range will always be to the

point b. Therefore the locus of points equidistant to point p and point b is a perpen-

- 516 -

dicular bisector. Similarly in regions II and V the shortest distances to the range will

be to points a and c respectively.

In regions I and IV, we can consider lines ab and bc to be directrices with point p

as the focus.

The general equation of a parabola is ( )

Given the equation of the directrix as ax+by+c=0 and the focus point as (u, v), the

equation of the parabola is: ( )

( ) ( )

which gives

b2x

2 + a

2y

2 – xy(2ab) – x(2u(a

2+b

2)+2ac) – y(2v(a

2+b

2)+2bc)= c

2 –u

2(a

2+b

2) –

v2(a

2+b

2)

Since the MBRs always have sides parallel to the axes, the parabolas will be sym-

metrical around one of the axes. Therefore the equation of the directrix will have ei-

ther a=0 or b=0. This means that one of the square terms will be 0 and the xy term

will be 0.

The TPL algorithm accesses nodes/points in ascending order of their distance

(MINDIST) from the query point q to retrieve a set of potential candidates, main-

tained by a candidate set Sc. All the points that cannot be an RNN of q are pruned by

the above mentioned pruning strategy, and inserted (without being visited) into a re-

finement point set Sp and eliminated nodes are inserted in a refinement node set Sn. At

the second step, the entries in both Sp and Sn are used to eliminate false hits.

Algorithm for R′RNN will involve a similar approach as TPL but with the modifi-

cation of the MINDIST, MINMAXDIST and the half planes that were discussed ear-

lier. The R′RNN algorithm takes data R-tree Tp, and a query region Q as inputs, and

outputs exactly all the R′RNNs of q.

Algorithm R′RNN (Tp,Q)

1: Initialize sets Sc = ∅, Sp = ∅, Sn = ∅, Sr = ∅

2: Filter (Tp, Q, Sc, Sp, Sn)

3: Refinement (Q, Sc, Sp, Sn, Sr)

4: Return Sr

3 Conclusion

In this paper, we have explained a novel variant of Reverse Nearest Neighbor que-

ries that we term as Range Reverse Nearest Neighbor (R’RNN) queries. A geometric

proof for the modifications needed in this algorithm was also developed. Such queries

are useful in practical scenarios and can be expanded for pattern recognition tech-

niques by using higher dimensional queries. The next step would be to conduct tests

of this algorithm on standard data sets and compare it against using multiple point

- 517 -

queries. Research enthusiasts can dwell more into techniques developed in this paper

and move ahead in the exciting field of data mining.

Acknowledgement

We would like to thank Prof. Gajanan Gawde, Prof. Manisha Naik Gaonkar and

Prof. Sebastian Mesquita for their invaluable guidance throughout our research. We

are also grateful to the Head of Department and Faculty of Computer Engineering

Department at Goa College of Engineering, Farmagudi, Goa-India.

References

1. Beckmann, N., Kriegel, H., Shneider, R., Seeger, B.: The R*-tree: An Efficient and Robust

Access Method for Points and Rectangles. ACM 19.2 (1990) 322-331

2. Gao, Y., Zheng, B., Chen, G., Lee, W., Lee, K. C. K., Li, Q.-Visible Reverse k-Nearest

Neighbor Queries. ICDE'09 Data Engineering, IEEE 25th International Conference on

(2009) 1203-1206

3. Goldstein, J.,Ramakrishnan, R., Shaft, U., Yu, J.: Processing queries by linear constraint

Knowledge and Data Engineering, IEEE Transactions on 18.1 (2006): 78-91

4. Hu, H., Lee, D. L.: Range Nearest Neighbor Query. Knowledge and Data Engineering,

IEEE Transactions on 18.1 (2006) 78-91 5. Korn, F., Muthukrishnan, S.:Influence sets based on reverse nearest neighbor que-

ries.ACM SIGMOD Record 29.2 (2000) 201-212

6. Roussopoulos N., Kelley, S., Vincent, F.: Nearest Neighbor Queries. ACM SIGMOD rec-

ord 24.2 (1995) 71-79

7. Singh, A., Ferhatosmanoglu, H.,Tosun, A.:High Dimensional Reverse Nearest Neighbor

queries. Proceedings of the twelfth international conference on Information and

Knowledge management, ACM, (2003) 91-98

8. Stanoi, I., Agrawal, D., El Abbadi, A.:Reverse nearest neighbor queries for dynamic data-

bases. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Dis-

covery(2000) 44-53

9. Tao, Y., Papadias, D., Lian,X.: Reverse kNN search in arbitrary Dimensionality. Proceed-

ings of the Thirtieth international conference on Very large data bases VLDB Endowment

30 (2004) 744-755

- 518 -

Date post:	24-Feb-2023
Category:	Documents
Upload:	lee-gec
View:	0 times
Download:	0 times

Range Reverse Nearest Neighbor Queries

Documents