Range Reverse Nearest Neighbor Queries
Reuben Pereira, Abhinav Agshikar, Gaurav Agarwal, Pranav Keni
{reubpereira, abhi2chat, grvmast, pranvprk}@gmail.com
Abstract: Reverse nearest neighbor (RNN) queries have a broad application
base such as decision support, profile-based marketing, resource allocation, data
mining, etc. Previous work on RNN, visible nearest neighbor and visible RNN
has been done considering only point queries. Such point queries are highly un-
likely in the real world. In this paper we introduce a novel variant of RNN que-
ries - Range Reverse Nearest Neighbor queries which consider queries over a
region rather than a point.
Keywords: Data Mining, Query processing, Spatial Databases, Nearest Neigh-
bor Queries.
1 Introduction
Data Mining is a relatively young and interdisciplinary field of computer science
that discovers new patterns from large data sets involving methods at the intersection
of artificial intelligence, machine learning, statistics and database systems.
While the Reverse Nearest Neighbor (RNN) search problem, i.e. finding all objects
in a database that have a given query q among their corresponding k-nearest neigh-
bors, has been studied extensively in the past years, considerably less work has been
done so far to support RNN queries on a region that may not be indexed by a point
access method. Also less research is done for handling obstacles in case of Range
queries. This paper proposes a novel approach to handle queries like “Finding an
apartment in a set of buildings which is closest to a park.”
2 Range Reverse Nearest Neighbor Queries
2.1 Preliminaries
Suppose that we are interested in a particular region and want to check for its reverse
nearest data points. In order to expand the query point defined in R′NN to an area, we
can stress on using a Range Nearest Neighbor Query.
Now suppose we are out to buy an apartment for ourselves but a requirement is
that it must be near a particular monument or a recreational area or even say a road.
We can consider this region of interest as a query region and this will have to be a
Andrzej M.J. Skulimowski (ed.): Proceedings of KICSS'2013, pp. 509-518 © Progress & Business Publishers, Kraków 2013
Range query rather than a point query. The concern is that the apartment must be the
nearest one to the specific area. Conversely we can state that the area must have that
apartment as a Reverse Nearest Neighbor. So, in short, we have a two-fold query that
must perform a Range Query (as the Query is a region rather than a point) and Re-
verse Nearest Neighbor.
2.2 Problem Statement
With the above discussion we have arrived at the problem statement in nearest
neighbor queries that we define as follows:
For a query region, R and a set of data points D, the Range Reverse Nearest Neigh-
bor (R′RNN) query returns the nearest data point d ∈ D that has a point in the region
as its nearest neighbor (NN).
As illustrated with the diagram given below
Fig. 1. R′RNN Query Space
We are interested in finding the Range Reverse Nearest Neighbor for the Query re-
gion R as shown and have data points - d1 to d8.
Depending on the location of the data points, we can categorize them into follow-
ing –
First we have to eliminate the data points that have some other data point as its
nearest neighbor. For example, d2 has d8 as its nearest neighbor and vice versa, hence
both can be eliminated. Also d7 can be eliminated since d5 is its nearest neighbor.
Now, we are left with the points- d1, d3, d4, d5 and d6 to process. As can be seen,
d5 is the nearest neighbor to the region. Hence d5 will be returned as the Range Re-
verse Nearest Neighbor of the query region R.
The above concept can be expanded to K-Nearest Neighbors for any ‘K’ value. For
example, for K=2 we may have d4 or d6 in the above case.
- 510 -
2.3 Metrics used for Nearest Neighbor Queries
The Nearest Neighbor query involves the use of a metric. A relative comparison
amongst different candidate points based on this metric determines the query results.
The metric can be simple Euclidian distance or can be complex like difference be-
tween two patterns. Here we discuss two common metrics used in Nearest Neighbor
analysis – MINDIST and MINMAXDIST.
MINDIST[6]: The distance of a point P in Euclidean space E(n) of dimension n
from a rectangle R in the same space, denoted MINDIST(P, R) is
The square of the distance is used since it uses fewer and less costly computations.
The distance from the point p to any object in MBR R will always be greater than or
equal to MINDIST(P, R). Thus MINDIST is the lower bound for the distance of any
object in the MBR. This lower bound can be used to prune out an MBR whose lower
bound is higher than the current nearest neighbor without having to compute the dis-
tance to every object in the MBR.
T
S
Point A Point B
Fig. 2. MINDIST
MINMAXDIST[6]: For a point P in Euclidean space E(n) of dimension n and an
MBR R = (S, T) of the same dimensionality, MINMAXDIST(P, R) is defined as:
MINDIST
MINDIST TERMS
MBR
- 511 -
This construction computes the minimum of the distances between the query point
and each vertex of the MBR. Thus there will be at least one object within the MBR
whose distance from the point P is less than or equal to MINMAXDIST(P, R).
MINMAXDIST is used as the upper bound for pruning.
Fig. 3. MINMAXDIST in 3D Space
But these definitions are suited only for a point query. Hence we have expanded
the concept of MINDIST and MINMAXDIST for a Range Query as defined below:
The original definitions were for query point but now we have a region defined by
its two endpoints S’ and T’ as shown below:-
- 512 -
Fig. 4. Revised Query Regions for R′RNN
We use the notation qki that denotes either S’ or T’ in the k
th dimension and hence
define the two metrics as follows:-
2.4 Data Structures
Efficient processing of NN queries requires spatial data structures which capitalize on
the proximity of the objects to focus the search on potential neighbors only [6]. The
most widely used structure is the R*-tree.
- 513 -
R-trees were proposed as a natural extension of B-trees in higher than one dimen-
sion. Each R-tree non-leaf node contains an array of (RECT, pointer) while the leaf
node has the object itself in place of the pointer. Here RECT stands for the nth dimen-
sion Minimum Bounding Rectangle (MBR). We divide a given region into MBRs
(sub-regions) as per the placements of the given data points.
Fig. 5. R Tree Construction
As shown above in Figure 6, the entire region is subdivided into three major MBRs
– A, B and C. Within these are the data points D, E, F, G and so on, approximated as
rectangles. In the figure below it, we have the R-tree for the region. The root node-
contains the major MBRs and then eventually the leaf nodes contain the actual data
points.
R*-trees [1] differ from R-trees in the algorithms used for insertion and to split or
join nodes. It has been found that R*-trees perform same or better than R-trees in
most applications. [1]
- 514 -
2.5 Algorithm for Range Reverse Nearest Neighbor
Discussions about the algorithms involved in efficient RNN have already been dis-
cussed[9]. The TPL algorithm involves a two phase algorithm called Filter & Refine-
ment as given below:
Firstly a pruning strategy - the half-plane method is applied. Such pruning elimi-
nates candidates that do not fall into the candidate set for reverse nearest neighbor and
they are put in a refinement set that is used later for removing false positives. Howev-
er existing research exists only for query point. Hence we have to modify the half-
plane concept for a data point and a query region.
We have worked out two cases for the above scenario in two dimensions and im-
plemented them in MATLAB as shown below:
Case 1: Here the data point is located along one face of the query region as shown
in Figure 7. It is within the x dimension of the range. The half-plane in this case con-
sists of a parabola along the face and two perpendicular bisectors beyond the face.
Fig. 6. Half Plane Case 1
Fig. 7.
A half plane is the locus of all points equidistant from the data point and from the
range, the distance to the range being MINDIST. Therefore, in region I we need the
- 515 -
locus of points equidistant from the point p and the line ab. Now with ab as the direct-
rix and point p as the focus, the locus becomes a parabola as shown in figure 6.
In region II, the shortest distance to the range will always be to point a. Therefore,
we need the locus of points equidistant from point p and point a. This is a perpendicu-
lar bisector between the two points [9]. Similarly, the half-plane in region III is the
perpendicular bisector between point p and point b.
Fig. 8. Half Plane Case 2
Fig. 9.
Case 2: Here the data point is not along any face of the query region as shown in
figure 9. The data point is not within any dimension of the range. The half-plane in
this case consists of three perpendicular bisectors and two parabolas as shown in fig-
ure 8.
In region III, the shortest distance of any point to the range will always be to the
point b. Therefore the locus of points equidistant to point p and point b is a perpen-
- 516 -
dicular bisector. Similarly in regions II and V the shortest distances to the range will
be to points a and c respectively.
In regions I and IV, we can consider lines ab and bc to be directrices with point p
as the focus.
The general equation of a parabola is ( )
Given the equation of the directrix as ax+by+c=0 and the focus point as (u, v), the
equation of the parabola is: ( )
( ) ( )
which gives
b2x
2 + a
2y
2 – xy(2ab) – x(2u(a
2+b
2)+2ac) – y(2v(a
2+b
2)+2bc)= c
2 –u
2(a
2+b
2) –
v2(a
2+b
2)
Since the MBRs always have sides parallel to the axes, the parabolas will be sym-
metrical around one of the axes. Therefore the equation of the directrix will have ei-
ther a=0 or b=0. This means that one of the square terms will be 0 and the xy term
will be 0.
The TPL algorithm accesses nodes/points in ascending order of their distance
(MINDIST) from the query point q to retrieve a set of potential candidates, main-
tained by a candidate set Sc. All the points that cannot be an RNN of q are pruned by
the above mentioned pruning strategy, and inserted (without being visited) into a re-
finement point set Sp and eliminated nodes are inserted in a refinement node set Sn. At
the second step, the entries in both Sp and Sn are used to eliminate false hits.
Algorithm for R′RNN will involve a similar approach as TPL but with the modifi-
cation of the MINDIST, MINMAXDIST and the half planes that were discussed ear-
lier. The R′RNN algorithm takes data R-tree Tp, and a query region Q as inputs, and
outputs exactly all the R′RNNs of q.
Algorithm R′RNN (Tp,Q)
1: Initialize sets Sc = ∅, Sp = ∅, Sn = ∅, Sr = ∅
2: Filter (Tp, Q, Sc, Sp, Sn)
3: Refinement (Q, Sc, Sp, Sn, Sr)
4: Return Sr
3 Conclusion
In this paper, we have explained a novel variant of Reverse Nearest Neighbor que-
ries that we term as Range Reverse Nearest Neighbor (R’RNN) queries. A geometric
proof for the modifications needed in this algorithm was also developed. Such queries
are useful in practical scenarios and can be expanded for pattern recognition tech-
niques by using higher dimensional queries. The next step would be to conduct tests
of this algorithm on standard data sets and compare it against using multiple point
- 517 -
queries. Research enthusiasts can dwell more into techniques developed in this paper
and move ahead in the exciting field of data mining.
Acknowledgement
We would like to thank Prof. Gajanan Gawde, Prof. Manisha Naik Gaonkar and
Prof. Sebastian Mesquita for their invaluable guidance throughout our research. We
are also grateful to the Head of Department and Faculty of Computer Engineering
Department at Goa College of Engineering, Farmagudi, Goa-India.
References
1. Beckmann, N., Kriegel, H., Shneider, R., Seeger, B.: The R*-tree: An Efficient and Robust
Access Method for Points and Rectangles. ACM 19.2 (1990) 322-331
2. Gao, Y., Zheng, B., Chen, G., Lee, W., Lee, K. C. K., Li, Q.-Visible Reverse k-Nearest
Neighbor Queries. ICDE'09 Data Engineering, IEEE 25th International Conference on
(2009) 1203-1206
3. Goldstein, J.,Ramakrishnan, R., Shaft, U., Yu, J.: Processing queries by linear constraint
Knowledge and Data Engineering, IEEE Transactions on 18.1 (2006): 78-91
4. Hu, H., Lee, D. L.: Range Nearest Neighbor Query. Knowledge and Data Engineering,
IEEE Transactions on 18.1 (2006) 78-91 5. Korn, F., Muthukrishnan, S.:Influence sets based on reverse nearest neighbor que-
ries.ACM SIGMOD Record 29.2 (2000) 201-212
6. Roussopoulos N., Kelley, S., Vincent, F.: Nearest Neighbor Queries. ACM SIGMOD rec-
ord 24.2 (1995) 71-79
7. Singh, A., Ferhatosmanoglu, H.,Tosun, A.:High Dimensional Reverse Nearest Neighbor
queries. Proceedings of the twelfth international conference on Information and
Knowledge management, ACM, (2003) 91-98
8. Stanoi, I., Agrawal, D., El Abbadi, A.:Reverse nearest neighbor queries for dynamic data-
bases. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Dis-
covery(2000) 44-53
9. Tao, Y., Papadias, D., Lian,X.: Reverse kNN search in arbitrary Dimensionality. Proceed-
ings of the Thirtieth international conference on Very large data bases VLDB Endowment
30 (2004) 744-755
- 518 -