EFFICIENT K-NEAREST NEIGHBOR QUERIES USING CLUSTERING WITH CACHING
by
JAIM AHMED
(Under the Direction of Maria Hybinette)
ABSTRACT
We introduce a new algorithm for K-nearest neighbor queries that uses clustering and
caching to improve performance. The main idea is to reduce the distance computation cost
between the query point and the data points in the data set. We use a divide-and-conquer
approach. First, we divide the training data into clusters based on similarity between the data
points in terms of Euclidean distance. Next we use linearization for faster lookup. The data
points in a cluster can be sorted based on their similarity (measured by Euclidean distance) to the
center of the cluster. Fast search data structures such as the B-tree can be utilized to store data
points based on their distance from the cluster center and perform fast data search. The B-tree
algorithm is good for range search as well. We achieve a further performance boost by using B-
tree based data caching. In this work we provide details of the algorithm, an implementation,
and experimental results in a robot navigation task.
INDEX WORDS: K-Nearest Neighbors, Execution, Caching.
EFFICIENT K-NEAREST NEIGHBOR QUERIES USING CLUSTERING WITH CACHING
by
JAIM AHMED
B.S., Southern Polytechnic State University, 1997
A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment
of the Requirements for the Degree
MASTER OF SCIENCE
ATHENS, GEORGIA
2009
© 2009
Jaim Ahmed
All Rights Reserved
EFFICIENT K-NEAREST NEIGHBOR QUERIES USING CLUSTERING WITH CACHING
by
JAIM AHMED
Major Professor: Maria Hybinette
Committee: Eileen T. Kraemer
Khaled Rasheed
Electronic Version Approved:
Maureen Grasso
Dean of the Graduate School
The University of Georgia
May, 2009
iv
DEDICATION
First of all my dedication goes out to my wife Jennifer for her support and inspiration
especially when the going got tough. Also, my dedication goes to my parents for their
unconditional love and motivation. My final dedication goes to my sister and brother-in-law for
their genuine friendship and kindness.
v
ACKNOWLEDGEMENTS
First of all, I express my sincere gratitude to my Major Advisor Dr. Maria Hybinette for
her constant support and encouragement. Dr. Hybinette has been very kind with her time and
wisdom. She has been a shining example of hard work and dedication and will remain a source
of inspiration for me forever.
I would also like to thank my committee members Dr. Eileen Kraemer and Dr. Khaled
Rasheed for their time and consideration. Special thanks Dr. Tucker Balch for his helpful
suggestions and consultations. Also, thanks to the Borg lab for access to example data and their
helpful suggestions.
vi
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS.........................................................................................................v
LIST OF TABLES...................................................................................................................viii
LIST OF FIGURES ...................................................................................................................ix
CHAPTER
1 Introduction ...............................................................................................................1
1.1 Overview.........................................................................................................1
1.2 Problem Domain..............................................................................................3
1.3 What is K-nearest Neighbor Search?................................................................6
1.4 Contributions ...................................................................................................8
2 Related Work...........................................................................................................10
3 Background .............................................................................................................15
3.1 Data Clustering..............................................................................................15
3.2 Data Caching .................................................................................................19
3.3 Basic KNN Search.........................................................................................20
3.4 KD-tree Data Structure ..................................................................................22
4 System Architecture.................................................................................................24
4.1 Pre-processing ...............................................................................................25
4.2 ckSearch Runtime Queries.............................................................................31
5 Experiments & Results ............................................................................................45
vii
5.1 Setup Information ..........................................................................................45
5.2 The effect of the size of the data set ...............................................................46
5.3 The effect of data dimension on the performance ...........................................51
5.4 The effect of search radius on the performance ..............................................54
5.5 The effect of search radius on accuracy..........................................................57
5.6 The effect of the number of clusters ...............................................................58
6 Conclusion...............................................................................................................62
REFERENCES .........................................................................................................................64
APPENDICES ..........................................................................................................................67
A Notation Table .........................................................................................................67
B Implementation Pseudocode ....................................................................................68
viii
LIST OF TABLES
Page
Table 5.1: The effect of data size on performance (k=1) ............................................................47
Table 5.2: Effect of data size on performance (k=3) ..................................................................48
Table 5.3: Effect of data size on performance (k=10) ................................................................49
Table 5.4: ckSearch speedup over linear search .........................................................................50
Table 5.5: The effect of data dimension on performance (N=50K) ............................................52
Table 5.6: The effect of data dimension on performance (N=100K) ..........................................52
Table 5.7: ckSearch speedup over linear search for various dimensions.....................................53
Table 5.8: The effect of search radius on performance (k = 3) ...................................................55
Table 5.9: The effect of search radius on performance (k = 10) .................................................56
Table 5.10: The effect of the search radius on query accuracy ...................................................57
Table 5.11: The effect of the number of clusters on performance (k=1) .....................................59
Table 5.12: The effect of the number of clusters on performance (k=5) .....................................60
Table A.1: List of various notations used in this thesis ..............................................................67
ix
LIST OF FIGURES
Page
Figure 1.1: Autonomous robot being trained to navigate through obstacles..................................4
Figure 1.2: Autonomous robot navigation sensors input ..............................................................5
Figure 1.3: Pictorial representations of KNN search ....................................................................6
Figure 3.1: Data clustering in 2-dimensional space....................................................................16
Figure 3.2: Stages in data clustering ..........................................................................................17
Figure 3.3: Typical application cache structure..........................................................................18
Figure 3.4: Basic KNN search process represented in 2-dimensional space ...............................20
Figure 3.5: Basic KNN Search Algorithm .................................................................................21
Figure 3.6: KD-tree data structure .............................................................................................22
Figure 4.1: Cluster data linearization .........................................................................................26
Figure 4.2: B-tree data structure ...............................................................................................29
Figure 4.3: Data cluster to B-tree correlation .............................................................................34
Figure 4.4: ckSearch algorithm data caching scheme.................................................................36
Figure 4.5: Cluster search rule 1 (Cluster exclusion rule)...........................................................39
Figure 4.6: Cluster search rule 2 (Cluster search region rule).....................................................40
Figure 4.7: Cluster search rule 3 (Cluster contains query sphere)...............................................42
Figure 4.8: Cluster search rule 4 (Cluster intersects query sphere) .............................................43
Figure 5.1: Performance vs. data set size chart (k = 1) ...............................................................47
Figure 5.2: Performance vs. data set size chart (k = 3) ...............................................................48
x
Figure 5.3: Performance vs. data set size chart (k = 10) .............................................................49
Figure 5.4: Chart showing ckSearch speedup over the linear search ..........................................50
Figure 5.5: Data dimension vs. performance chart (N = 50K) ....................................................52
Figure 5.6: Data dimension vs. performance chart (N = 100K) ..................................................53
Figure 5.7: Search radius vs. performance for 10000 data records .............................................55
Figure 5.8: Search radius vs. performance chart for 10,000 data records (k = 10) ......................56
Figure 5.9: Search radius vs. query accuracy chart ....................................................................58
Figure 5.10: The number of clusters vs. performance chart for 50000 data records (k = 1).........59
Figure 5.11: The number of clusters vs. performance chart (k = 5) ............................................60
Figure B.1: ckSearch KNN algorithm........................................................................................68
Figure B.2: SeachClusters(q) pseudocode..................................................................................69
Figure B.3: The SearchCache(q) algorithm pseudocode ............................................................70
Figure B.4: The SearchLeftNodes(leafNodei, keyleft) pseudocode..............................................71
Figure B.5: The SearchRightNodes(leafNodei, keyright) pseudocode ..........................................72
1
CHAPTER 1
INTRODUCTION
In this research, we introduce an efficient algorithm for K-nearest neighbor queries that uses
clustering, a pruning of the search space, and caching to improve performance. We call our
algorithm ckSearch. The main goal of this work is to improve performance of queries in a k-
nearest neighbor (KNN) system.
In this chapter we provide an overview of the KNN algorithm, and brief coverage of
the performance challenges facing KNN implementations. We describe our application and
experimental domain, and then provide details on our approach.
1.1 Overview
The K-nearest neighbor algorithm (KNN) is a well-known statistical search or
learning method used in a wide range of problem solving domains: e.g., robotics navigation
[32], data mining [33], and image processing [11]. In robotic navigation KNN is used to
select an appropriate action of a robot by evaluating similar (K) instances from the ‘nearest
neighbor feature set’ in training data. In forestry KNN is used to map satellite image data to
inventory forest resources [34] and in wine evaluation KNN is used to classify wines, here
the feature space include alcohol level, hue and wine opacity [35]. More formally, KNN
finds the K closest (or most similar) points to a query point among N points in a d-
2
dimensional attribute (or feature) space. K is the number of neighbors that are considered
from a training data set and typically ranges from 1 to 20.
Advantages of KNN algorithm include that it is fairly simple to implement and it is
well suited for multi-modal classes [36]. However, a major disadvantage of KNN
implementations is its high computational costs, especially when coupled with a large
amount of data. The high cost is partly due to computing Euclidian distances between the N
neighboring data points and the query point. Further many KNN implementations degrade in
performance as the data becomes higher dimensional (i.e., they suffer from the “curse of
dimensionality”), typically when the number of features are 20 and above performance starts
to degrade [10]. Another drawback of KNN concerns its significant memory requirements,
especially for Locality Sensitive Hashing (LSH) based KNN systems [6].
A key idea of our ckSearch algorithm is to improve performance by avoiding costly
distance computations for the KNN search. We use a divide-and-conquer approach. First, we
divide the training data into clusters based on similarity between the data points in terms of
Euclidean distance. Next we perform a linearization of data points in each cluster for faster
lookup. The data points in a cluster can be sorted based on their similarity (measured by
Euclidean distance) to the center of the cluster. Our data linearization process takes
advantage of this similarity and produces metric indexes for each data point in a cluster. Fast
search data structures such as the B-tree can be utilized to store data points based on their
metric indexes. Next we load the data points into a memory aware B-tree data structure. We
achieve a further performance boost using B-tree based data caching.
The ckSearch cache policy pre-fetches closer (or more similar) clusters to the query
point into the cache in anticipation of what may be needed next and it avoids checking the
3
cache if the cluster needed has not been put in the cache. This policy avoids some cache
misses. At runtime, the ckSearch system first evaluates the cache upon receiving the query
point and then searches for the k closest points in the cache. The cache is organized
hierarchically in a B-tree structure and thereby reduces distance. In the case of a cache miss,
the ckSearch algorithm searches the main B-tree for the k nearest neighbors using our new
method.
1.2 Problem Domain
A focus of this research is to improve performance of the KNN approach and to
demonstrate its performance in a real-world problem. We assessed our approach using data
from an autonomous robot navigation experiment. The existing solution for this system uses
the KD-tree algorithm that partitions the training data set recursively (KD-trees are
specialized BSP trees). A KD-tree based algorithm provided direction and speed commands
for a robot based on learned perception examples. One of our objectives is to improve
performance of the existing KD approach. In order to improve data processing speed, we
introduce our novel ckSearch algorithm that utilizes data clustering and data caching. In
addition, our ckSearch system utilizes several rules to further reduce or avoid costly distance
calculations. Even though our system has been assessed for efficient execution of KNN
algorithm in a robotics domain, this system is expected to perform well in any domain that
utilizes a KNN algorithm. One such domain could be image processing where a KNN
algorithm is used to classify comparable image pixels.
4
Figure 1.1: An autonomous robot being trained to navigate through obstacles.
Figure 1.1 shows an autonomous robot being trained to navigate through obstacles.
Green lines showing sensors and yellow arrow showing direction. These sensor readings will
be used as training data to classify speed and direction during autonomous run. This image is
used by permission from the Borg Lab at Georgia Institute of Technology.
Autonomous robot navigation in unstructured outdoor environments is a challenging
area of active research. At the core of this navigation task, identifying obstacles and
traversing around these obstacles plays a vital role in reaching the robot’s target destination.
There is a recent trend of using KNN-based approaches in autonomous robotics research
[32]. Autonomous robots have the ability to function and can perform desired tasks in
unstructured environments without continuous human guidance. But, it relies on algorithm
such as KNN for learned data classification. Typically, sensors collect obstacle data and the
decision making system must decide which action to take based on previously learned
behavior [1].
5
Figure 1.2: Autonomous robot navigation sensors input.
Figure 1.2 shows a representation of a robot’s sensor input for navigation. Each green
line represents an estimate of free space from the robot to an obstacle. At each time step there
are 60 such inputs which make up 60 dimensional data point. The yellow arrow shows the
direction input by the robot trainer, and the blue arrowhead shows original direction of the
target path. Later the robot uses these 60 dimensional sensor data and the direction taken by
the trainer as the training data set to decide speed and direction during autonomous run.
In this manner, it has the ability to move through its operating environment without
human assistance using KNN algorithm to dictate which direction to move and what the
speed should be based previous learned data. Needless to say this decision making process
must be efficient, accurate, and swift to enable the robot to cope with its environment and
avoid obstacles. Most of the current KNN algorithms (such as KD-tree) are too slow for the
task. As defined, this is our problem domain. It was determined that the existing KD-tree
based nearest neighbor search algorithm suffered performance devaluation from “curse of
6
dimensionality” and performance needed improvement. In this research we worked to come
up with an apt algorithm to speed up classification of such robots’ direction and speed data.
(a) (b)
Figure 1.3: Pictorial representations of KNN search.
1.3 What is K-nearest Neighbor Search?
The k-nearest neighbor (KNN) is a variation of the nearest neighbor algorithm where it is
required to find k number closest point to the query point. The nearest neighbor search
algorithm along with its variations are frequently used to solve problems in areas such as
robotics, data mining, multi-key database retrieval, and pattern classification. Discovering a
way to reduce the computational complexity of nearest neighbor search is of considerable
interest in these areas.
The KNN search, also known as the similarity search, can be expressed as an
optimization problem for finding closest points in metric spaces [2]. Given a set of N points,
in metric space M, and a query point q where q ∈ M. The problem is to find the closest k
points to the query point q, in set Nset. Usually, M is considered to be a d-dimensional
Euclidean space and distance is measured by Euclidean distance or Manhattan distance.
7
A significant cost of the KNN approach is due to the computation of the O(l) distance
function, especially when an application uses vectors with a high dimensionality such as
sensor data from an autonomous robot [3]. A full search solution involves calculating the
distance between the target vector q, and every vector pi, in order to find the k closest to q.
Although full search ensures the best possible search results, this solution is often unfeasible
due to its O(nl) cost. Autonomous robot decision making applications often involve
searching a large database for a closest match to a query case [4].
A simple solution to the KNN search problem is to compute the distance from the
query point to all the other points in the database, storing the data point smallest calculated
distance yet [5]. This sequential full search finds the k nearest neighbors by progressively
updating the current nearest neighbor pj when a data point is found closer to the query point
than the current nearest neighbor. With each update, the current KNN search radius shrinks
to the actual kth nearest neighbor distance. The final nearest neighbor is one of the data
points inside the current nearest neighbor search radius. Thus, in the sequential full search,
the distances of all N data points to the query point are computed with the search complexity
being N distance computations per query point.
The number of data points in a data set increases the number of distance calculations
for any KNN algorithm. Further, the “curse of dimensionality” increases the number of
calculations tremendously. One approach to reducing the complexity of the nearest neighbor
search is to reduce the number of data points to be searched. Our approach to KNN search
focuses on an inexpensive way of eliminating data points from consideration using
computationally inexpensive rules, thereby avoiding a more expensive distance computation.
8
The rules determine those data points which cannot be nearer to the query point than the
current nearest neighbor.
The computational complexity of KNN queries has increased in recent years.
Moreover, the advent of new research areas using learning algorithms such as autonomous
robotics and other artificial intelligence domains has drawn interest back to nearest neighbor
search. Currently, use of large databases containing millions of image records for a vision
based navigational system is quite common [1]. Naturally, these new challenges have
prompted a fresh look at nearest neighbor search and the ways it can help solve new
problems.
As mentioned above, we apply our cluster-based KNN search method to the task of
steering and speed decision making for an autonomous robot based using training data. In
addition, our approach utilizes data caching strategy to improve performance as well. In
addition, the ckSearch algorithm is general enough to produce good performance in problem
domains such as pattern recognition in image processing, information extraction in data
mining, and classification of texts as well.
1.4 Contributions
Results of our research will be of interest to those investigating high performance
memory-based learning methods. In particular, we have implemented a system that supports
fast and exact KNN queries without scanning the entire data set. Our novel contributions
include:
9
• A geometry-based method for pruning the search space at query time. Some
existing approaches (e.g. Approximate Nearest Neighbor) also prune, but are
not able to provide exact responses to queries.
• Further improved performance using caching.
Our solution is based on a framework consisting of three major components: (1) Pre-
processing of data points into clusters; (2) Data point mapping to a metric data structure; and
(3) Implementation of smart caching. We have designed our caching strategy based on the
assumption that a data cache can boost performance in repeated calculation algorithms such
as KNN. The approach takes advantage of an algorithm that balances the cost and
performance of each component in order to achieve an overall reduction in cost to improve
performance [4]. Using above mentioned techniques along with rules to avoid unnecessary
computation, our algorithm achieves performance improvement over linear search and KD-
tree based KNN algorithms. The performance evaluation section details these experiments
and results.
The rest of the thesis is organized as follows: Chapter 2 discusses related work done
by various other researchers in this area. Chapter 3 presents background information and
various concepts used in this project. Chapter 4 describes in detail our proposed approach
and all the related information. The experiments are discussed and the results are presented in
Chapter 5. Finally, Chapter 6 presents the conclusions of this thesis and describes future
work.
10
CHAPTER 2
RELATED WORK
This related work section reports various recent research works on autonomous robot
navigation as well as KNN algorithm. Navigation is one of the most challenging skills
required from a mobile robot. There has been a recent trend of using KNN algorithm to
classify learned data amongst researchers. In this section, we present some of the related
work done by researchers in this area.
The 6D SLAM (Simultaneous Localization and Mapping) system is based on scan
matching technique where scan matching is based on the well known iterative closest point
(ICP) algorithm [3]. This system employs a cached KD-tree to improve performance of the
iterative closest point algorithm. Since the KD-tree itself suffers from performance break
down with high-dimensional data points, we believe the 6D SLAM will suffer performance
deterioration with high-dimensional navigation data [17].
Another approach by researchers to solve the navigation problem is based on stereo
vision of the robot system. Binary classifiers were used to augment stereo vision for
enhanced autonomous robot navigation. However this system does not use any one optimized
binary classifier. Instead, suggests using several generic classifiers such as SVM, Simple
Fisher Algorithm, and Fisher LDA. This approach also suggests creating and storing learned
models of traversable and non-traversable terrain. We believe generic binary classifiers are
prone to performance degradation which can affect performance of this system [1].
11
Some researchers applied memory-based robot learning to solve similar problems.
Memory-based neural networks were used to learn task to be performed [22]. This task can
be figuring out navigational hot spots or could be decision making. These researchers also
augmented nearest neighbor network with a local model network.
Next, we present several related work done in the KNN search area. There has been a
long flow of research on solving the nearest neighbor search problem. A large number of
solutions have been proposed to improve cost of the nearest neighbor search. The quality and
usefulness of these various proposed solutions are determined by the time complexity of the
queries as well as the space complexity of any search data structures that must be maintained.
The current KNN techniques can be divided into five major approaches. These approaches
are: data partitioning approach, dimensionality reduction approach, locality sensitive hashing
(LSH), scanning based approach, and linearization approach.
The most prominent is the data partitioning approach. It is also known as the space
partitioning, spatial index, or spatial access method. Data partitioning techniques such as
KD-tree [22] or Grid-file [25] iteratively bisects the search space into regions containing
fraction of the points of the parent region. Queries are performed via traversal of the tree
from the root to a leaf by evaluating the query point at each split. One of the main drawbacks
for this concept, is the “curse of dimensonality”. Curse of dimensionality is a problem
caused by the exponential increase in volume associated with adding extra dimensions to a
mathematical space. Data partitioning techniques perform comparable with low-dimension
data points. On the other hand with high dimensional data, partioning technique’s
performance quickly degrades. It is because of the exponential increase in volume associated
with iterative partitioning of the high-dimension euclidean search space. Multi-dimensional
12
indexes such as R-trees [46] have been shown to be inefficient for supporting range queries
in high-dimensional databases [19].
Dimensionality reduction approaches apply “dimensionality reduction” techniques
on the data and insert the data into the indexing trees. The “dimension reduction” is the
process of reducing the number of random variables or attributes being considered. This
process is divided into feature selection and feature extraction. The dimensionality reduction
approach first performs dimension reduction and then inserts the data into indexing trees.
There are cost associated with performing dimension reduction and subsequent data
indexing. That is why this technique performs well on low-dimensional data set but
performance suffers when data dimensions increase.
Locality sensitive hashing (LSH) is comparatively new nearest neighbor search
approach. It is a technique for grouping points into buckets based on distance metric
operation on the points. Points that are close to each other under the chosen metric are
mapped to the same bucket with high probability. Theoretically, for a database of n vectors
of d dimensions, the time complexity of finding the nearest neighbor of an object using
locality sensitive hashing is sub-linear in n and only polynomial in d. A key requirement of
applying LSH to a particular space and distance measure is to identify a family of locality
sensitive functions, satisfying the properties [26]. Thus, locality sensitive hashing is only
applicable for specific spaces and distance measures where such families of functions have
been identified, such as real vector spaces with distance measures, or bit vectors with the
Hamming distance [28]. Also, locality sensitive hashing techniques are based on hashing it
has large memory footprint. A large amount of memory must be allocated to apply LSH
which certainly is a major drawback [27].
13
Scanning based approaches such as the VA-file [17] divide the data space into 2b
rectangular cells, where b denotes a user specified number of bits. Each cell is allocated a bit-
string of length b that approximates the data points that fall into a cell. The VA-file is based
on the idea of object approximation and approximates object shapes by their minimum
bounding box. The VA-file itself is simply an array of these compact, geometric
approximations. The nearest neighbor search starts by scanning the entire file of
approximations and filtering out the irrelevant points based on their approximations. Instead
of hierarchically organizing these cells like in grid-files or R-trees, the VA-file allocates a
unique bit-string of length b for each cell, and approximates data points that fall into a cell by
that bit-string.
Linearization approaches, such as space-filling curve methods (Z-order curve) map d-
dimensional points into a one-dimensional space (curve). As a result, one can issue a range
query along the curve to find k-nearest neighbors.
As it is evident from discussion so far that most of the conventional approach to the
KNN search suffers from drawbacks related either performance or memory space
complexity. Our proposed ckSearch approach, as it will be described in detail later in this
paper, is a novel approach to the KNN search problem. It utilizes clustering to achieve data
partition. It makes smart but balanced use of data caching technique to boost performance. It
avoids curse of dimensionality by mapping d-dimensional points into a one-dimensional
space using “linearization approach”. It uses indexing tree as a data structure to evade large
memory requirement. Moreover, it introduces metric index caching to the KNN algorithm.
As described in this chapter, there have been metric based KNN systems. But, our proposed
data clustering along with smart use of data cache is a unique and a novel approach. Our
14
proposed solution has been carefully designed to overcome the disadvantages many of these
conventional approaches suffer. At the same time, our approach culls the benefits the above
mention techniques enjoy.
15
CHAPTER 3
BACKGROUND
We have provided comprehensive background information of our project in this chapter. It is
important to remind the reader that the main goal of this project is to design a fast cluster-
based KNN algorithm. In addition, this KNN algorithm must be able to process autonomous
robots navigation (sensor) data quickly so that the robot can decide on direction and speed
without stalling or running into obstacles. As mentioned above, the actual algorithm will be
described in the next section but all the necessary background information will be explained
in this section. For ease of exposition, the background section is further divided into three
subsections. These subsections are: data clustering, notations, basic nearest neighbor search,
and KD-tree search algorithm.
3.1 Data Clustering
Data clustering is an essential component of our ckSearch algorithm and considered part of
the pre-processing step. A large portion of the cost of the KNN search is due to the
computation of the O(l) distance function, especially when an application contains points
with large number of dimensions such as navigation sensor readings of an autonomous robot.
The central strategy to reduce these repeated, and in some cases unnecessary, distance
computation is to partition the data space. As the goal is to split the data space into partitions,
data clustering is one of several the ways to achieve this goal.
16
Figure 3.1: Data clustering in 2-dimensional space.
Cluster analysis is the organization of a collection of patterns, usually a vector of
measurements or a point in a multidimensional space, into clusters based on similarity [9].
Ideally, patterns within a valid cluster are more similar to each other than they are to a pattern
belonging to a different cluster. Since data points in a large database or data set are often
clustered or correlated, data clustering as a data partitioning technique seems ideal. The
diversity of techniques for data representation, similarity between data elements, and
categorizing data elements has generated a range of clustering methods.
Typical pattern clustering activity involves the following steps [9]:
(1) pattern representation
(2) definition of a pattern proximity measure appropriate to the data domain
17
(3) clustering or grouping
(4) data abstraction
(5) assessment of output if needed
Figure 3.2: Stages in data clustering
Pattern representation refers to the number of classes, the number of available
patterns, and the features available to a clustering algorithm. It is divided into feature
selection and feature extraction process. Feature selection is the process of recognizing the
most effective of the original features to use in clustering. Feature extraction is the use of
one or more transformations of the input features to produce new prominent features. In
order to use in clustering, either or both of these techniques can be used to obtain an
appropriate set of features.
Pattern proximity is usually measured by a distance function defined on pairs of
patterns. A variety of distance functions are used depending on data domains. Typically,
Euclidean distance function is the most popular of these functions and often used to show
similarity between two patterns. On the other hand, other similarity measures can be used to
show the conceptual similarity between patterns.
The clustering step can be performed in a variety of ways. There are several major
clustering techniques available such as: hierarchical, partitional, fuzzy, probabilistic, and
graph theoretic to name a few. The K-means clustering, a partition-based clustering
18
technique was used in this project. K-means clustering is simple and a perfect fit for data
partitioning required for nearest neighbor search algorithm. There are several existing
clustering schemes in the literature such as BIRCH [30], CLARANS, and DBSCAN [31].
Data abstraction is the next step in the clustering process (except for optional output
assessment step). It is the process of extracting simple representation of the data set. A
typical data abstraction is a compact description of each cluster, usually in terms of cluster
prototype or representative patterns such as the centroid.
In this project, data indexing is not dependent on the underlying clustering method.
But, it is expected that the clustering strategy will have an influence on data retrieval
performance.
Figure 3.3: Typical application cache structure
19
3.2 Data Caching
Data caching is a general technique used to enhance performance of data access where the
original data is expensive to compute compared to the cost of reading the cache. In KNN
search, a large number of high-dimensional dataset is repeatedly accessed with each query. A
data cache can prove extremely effective in such KNN search process. When data is cached,
the most recently accessed data from the high-dimensional data set is stored in a memory
buffer. Thus, this data cache is a temporary storage area where frequently accessed data can
be stored for rapid access. When our ckSearch algorithm needs to access data, it first checks
the cache to see if the data is there. If it finds what it is looking for in the cache, it will use
the data from the cache instead of going to the data source to find it. Thus, using data cache
our proposed algorithm can achieve shorter access time and boost performance. Even though
a data cache is favorable, there are computational costs associated with data caching. This
cost is primarily an accumulation of data retrieval cost, data maintenance cost, and cache
miss cost. Thus, our proposed algorithm implements a comprehensive caching strategy to
keep cache cost from offsetting performance gains.
20
Figure 3.4: Basic KNN search process represented in 2-dimensional space
3.3 Basic KNN Search
In order to search for the k nearest neighbors of a query point q, the distance of the kth nearest
neighbor to q defines the minimum radius rmin required for retrieving the complete answer
set. It is not possible to calculate such a distance preemptively because of the fact that we are
unaware of query point; q’s surrounding points without further scanning. Thus, iteratively
increasing the search radius and examining the neighbors within that search sphere is a viable
approach.
Describing this algorithm, the query point in question is q. The task is to find k
nearest neighbor for this query point, q. This search process starts with a query sphere
defined by a relatively small radius r about query point q, SearchSphere(q,r). Naturally, all
data spaces the query sphere intersects have to be searched for potential k nearest neighbors.
Iteratively, the search sphere is expanded until all k nearest neighbor points are found. In this
21
process, all the data subspaces intersecting the current query space are checked. If
enlargement of the query sphere does not introduce new nearest neighbor points, the current
KNN result set, R is considered the nearest neighbors (assuming the size of the current result
set is k). At this point, the search query is started with a small initial radius which in turn
keeps the search space small to avoid unwanted calculations. The goal here is to minimize
unnecessary search costs. Arguably, a search sphere with larger radius may contain all the k
nearest points but cost of going through all the data points out weighs the benefits.
Basic KNN Search(k):
1 R = empty; // The result set
2 Search sphere radius, r = as small as possible;
3 Find all data spaces intersection current query sphere;
4 Check all intersecting data spaces for k-nearest neighbor;
5
6 if R.Size() == k (where k-nearest neighbors are found)
7 exit;
8 else
9 increase search radius;
10 goto line 3;
11 start the search process again;
END;
Figure 3.5: Basic KNN Search Algorithm
22
We performed several performance comparisons between the ckSearch algorithm and
the KD-tree based multi-dimensional indexing structure. It has been detailed in the
experiment section of this project. We believe it is important to understand the KD-tree
algorithm to understand the performance comparisons we have conducted in the experiment
section of this paper. So, a comprehensive account on the KD-tree is included in the
following section.
Figure 3.6: KD-tree data structure
3.4 KD-tree data structure
K-Dimensional search trees, i.e. KD-trees, are a generalization of binary search trees
designated to handle the case of multidimensional records. In KD-tree, a multidimensional
record is identified with its corresponding multidimensional key x = (x(1)
, x(2)
, . . ., x(K)
),
23
where each x(n)
, 1 ≤ n ≤ K, refers to the value of the nth
attribute of the key x. Each x(n)
belongs to some totally ordered domain Dn, and x is an element of D = D1 × D2 × . . . × DK.
Therefore, each multidimensional key may be viewed as a point in a K-dimensional
space, and its nth
attribute can be viewed as the nth coordinate of such a point. Without loss of
generality, we assume that Dn = [0,1] for all 1 ≤ n ≤ K, and hence that D is the hypercube
[0,1]K [10]. A KD-tree for a set of K-dimensional records is a binary tree such that,
(1) Each node contains a K-dimensional record and has an associated discriminant
n ∈ 1, 2, . . . , K
(2) For every node with key x and discriminant n, it is true that any record in the left sub-
tree with key y satisfies y(n)
< x(n)
and any record in the right sub-tree with key y
satisfies y(n)
> x(n)
.
(3) The root node has depth 0 and discriminant 1. All nodes at depth d have discriminant
(d mod K) + 1.
There are many implementations of KD-trees. There are homogeneous and non-
homogeneous KD-trees. The non-homogeneous KD-trees contain only one value in internal
nodes and pointers to its left and right sub-trees. For non-homogeneous KD-trees, all records
are stored in external nodes. The expected cost of a single insertion in a random KD-tree is
O(logn). On the other hand, the expected cost of building the whole tree is O(nlogn). On
average, the deletions in KD-trees have expected cost of O(logn). The nearest neighbor
queries are supported in O(logn) time in KD-trees [10].
24
CHAPTER 4
SYSTEM ARCHITECTURE
In this section, we describe the system architecture of ckSearch, which includes our scalable
and efficient KNN search mechanism.
A number of solutions have been introduced to reduce the cost of the KNN search.
The quality and usefulness of these various solutions are limited by the computational time
complexity of computing queries as well as the space complexity of the relevant search data
structures. As mentioned in the Related Work chapter, solutions face tradeoffs that affect
performance and are prone to the curse of dimensionality phenomena as number of attributes
increases. When the number of attributes is large KNN implementations either require large
memory allocations (due to space complexity) or fall victim to time complexity. A rule of
thumb is that KNN algorithms work well for 20 or less attributes [10].
Our ckSearch technique balances both time and space complexities to achieve an
overall reduction in both. Our cluster-based approach uses caching to minimize the cost of
searching high-dimensional data. Our solution, detailed in this section, includes two phases:
(1) The pre-processing of data points; and (2) Runtime queries. In the pre-processing step the
d-dimensional data set is partitioned into data clusters based on similarity between the data
points. We discuss both phases in detail in the preprocessing and runtime query sections
below. The following observations influenced the design of the ckSearch system:
25
Observation 1: (Data partitioning)
Data space partitioning can reduce redundant distance computations while searching for k
nearest neighbors in a high dimensional data domain. Simple clustering algorithms such as
K-means clustering can reduce computational cost by separating high-dimensional data
points into clusters based on similarity.
Observation 2: (Data reference)
Reference to a cluster centroid may expose similarity or dissimilarity between data points
within a cluster and data points across different clusters. Moreover, data points in a cluster
can be sorted based on their distance from a reference point such as the cluster centroid.
Observation 3: (Data Caching)
Data caching can substantially reduce search time by pre-fetching and reduce cost of distance
calculation for the KNN search. Cache miss expenditure must be kept in-check by using
smart cache strategies and rules to predict cache miss scenario.
4.1 Pre-Processing
Step 1: Data Partitioning – K-means Clustering
Data clustering is an essential component of our algorithm. By clustering as a pre-processing
step, we are able to improve the performance of queries at runtime. A direct approach to
reducing the complexity of the nearest neighbor search is to reduce the number of data points
investigated. The central strategy to reduce these repeated, and in some cases unnecessary,
distance computation is to partition the data space. CkSearch splits the data space into
26
partitions and uses data clustering to avoid examining unnecessary data points in
multidimensional data by clustering based on data similarities (Observation 1). The first step
is to cluster the data set using an existing K-means clustering algorithm.
K-means clustering is a simple, partition-based clustering technique. It is a good fit
for the data partitioning required for nearest neighbor search. It is important to mention that
even though our approach uses K-means clustering; it does not depend on this particular
clustering technique. We could just as easily have selected another clustering algorithm such
as: DBSCAN [31], CLARANS, or BIRCH [30]. In our algorithm, the number clusters is
selected based on number of records present in the data set. We choose 5 clusters up to 10000
records and go up by 2 clusters for every 5000 records.
Figure 4.1: Cluster data linearization
27
Once we have selected a number of cluster centers, we can use them to index our
data. Figure 4.1 shows cluster data linearization based on the distance between the center
and individual data point in that cluster. The cluster center is the starting point of the segment
and the cluster boundary being the maximum for the segment.
Step 2: Data index construction & Index structure
After the clustering phase, our algorithm constructs the data index. This data index is a single
dimensional value based on the distance between the data point and a reference point in a
data partition. During this part of the process, each high-dimensional point is transformed
into a point in a single dimensional space.
This conversion is commonly known as data linearization or data mapping.
Linearization is achieved by selecting a reference point and then ordering all partitions
according to their distances from the selected reference point. This extends well with
Observation 2 which states that reference to a cluster center may expose similarity or
dissimilarity of data points in a cluster. This similarity or dissimilarity is exposed by
linearization in the form of data mapping. There are several types of reference points that can
be used for the linearization process. Typically the center of a cluster is used as a reference
point. But, some linearization techniques use either a boundary (edge) point or a random
point as reference. Ad hoc linearization approaches, such as space-filling curve methods such
as Z-order curve [15] map d-dimensional points into a one-dimensional space (curve). For
further cost reduction, we use a three step data linearization algorithm. In the following
section, the data index construction (i.e. linearization) using the three step algorithm is
described.
28
First, a reference point is identified for each partition or data cluster. The center of
each partition or cluster is selected as reference point. In the second step, the Euclidean
distance between the data point pi and the reference cluster center Ci is computed. And, in the
final step, the following simplistic linear function is used to complete the conversion (i.e., the
data mapping). Each high-dimensional data point is transformed into a key, keyi, in a single
dimensional space.
keyi = distance(pi, Ci) + m × µ; (4.1)
In the above function (4.1), the term keyi represents the single dimensional index
value for a data point after the linearization process [11]. According to the research work on
data partitioning by Agbhari & Makinouchi [11], data points in a cluster can be referenced
and mapped by a fixed data point such as the cluster center. We utilized this concept to
perform data linearization in this project. The distance function, distance(pi, Ci) represents
distance function between the data point pi and the cluster center reference point Ci. The
function distance(pi, Ci) is a Euclidean function and returns single dimensional distance
value. The next parameter, m is the number of the data cluster being processed. If there are
total M clusters, then the m value is in between 0 and M – 1, such that 0 ≤ m ≤ M – 1. If there
are 10 clusters for example, then m value will be one of the values in the [0,1,2, . . . . . , 9]
range.
The last parameter µ is a constant to stretch the data ranges. The constant µ serves as
a multiplier to parameter m so that all points in a partition or cluster can be mapped to a
region within m × µ and ((m + 1) × µ). Because of the µ multiplier, the function (1.1)
29
correctly maps the cluster center as the minimum boundary or starting index of this region
and furthest data point (in this cluster) as the maximum boundary or index of this region.
Moreover, all the data points (in this cluster) appropriately map in between the minimum and
maximum indices. As a result, one can issue a range query to find the nearest neighbor
enabling use of efficient single dimensional index structure such as the B-tree.
Figure 4.2: B-tree data structure
Figure 4.2 above shows a B-tree data structure. The leaf nodes contain data points. B-
tree is especially optimized for search operations.
Step 3: Data structure & data loading
The selection of appropriate data structures is an integral part of any efficient search
algorithm design. For fast data retrieval algorithms such as our ckSearch, it is vital to use a
speedy data structure. In ckSearch system, we used three different data structures. The core
structure is the B-tree. A B-tree was used as the main data storage for our system. We have
also utilized one-dimensional arrays and two-dimensional arrays. The two-dimensional array
was used to store the minimum and maximum data distance for each cluster. Any balanced
tree such as a B-tree works well as a fast cache data structure because of its rapid data
30
retrieval time. Accordingly, an instance of the B-tree algorithm is used for the data caching
implementation as well.
B-tree is a data structure that keeps data sorted and allows searches, insertions, and
deletions in logarithmic time. It is optimized for systems that read and write segments of data
such as: data clusters, databases, and file-systems. In B-trees, non-leaf nodes can have
variable number of child nodes and are used as guides to leaf-nodes. Search operations with
in-memory B-trees are significantly faster than the in-memory red-black trees and AVL trees
[28]. It fits perfectly for our ckSearch algorithm because costly B-tree insertion operations
are only performed during the pre-processing index loading time. During actual ckSearch
runtime, only inexpensive search operations are performed (on the B-tree) to locate k nearest
neighbors. This strategy further aids our algorithm in improving overall processing time.
After the data linearization process, as described in the previous section, the mapped
points are loaded in the B-tree. The transformed data point indexes worked as keys for our
data structure where only leaf-nodes store the actual data points. The conventional B-tree was
modified so that each leaf node was linked to neighboring leaf nodes on both sides. This
modification assisted in further speedy retrieval of nearest neighbor points.
In our algorithm, a two-dimensional array is used to store the maximum distance,
distMaxi between each cluster center Ci and the furthest data point pi in that cluster.
Similarly, the minimum distances distMini are stored in this two-dimensional array as well.
Our algorithm uses the distMaxi and the distMini distance values to eliminate unnecessary out
of boundaries (data space) computations. A separate single dimensional array is used to store
cluster centers.
31
4.2 ckSearch Runtime Queries
In this section, we describe the ckSearch query. After loading the indexes into the tree-based
data structure, the pre-processing part of the algorithm concludes. At this point, our algorithm
performs the fast KNN search.
How ckSearch Works
In this section, we describe the search process of the ckSearch. The overall technique is to
iteratively solve the KNN problem. It begins selecting a small radius ri defining a small area
around the query point and then iteratively increase the radius up to a radius, rmax. The search
space is iteratively increases until all k nearest neighbor values are found or the “STOP”
criterion has been met (when r is rmax).
As explained above, during the pre-processing process the data points are clustered (by
using K-means clustering), reference points are selected (cluster centers), data linearization
are completed, and data points are loaded in a B-tree data structure. The actual search begins
by consulting cache hit-miss strategy and determining the outcome based on the cache rules
as described below in the “cache strategy” section. Regardless of the outcome of the cache
strategy, the ckSearch algorithm next inspects the following two stopping criteria:
• The search radius ri has reached its maximum rmax threshold value and still have not
found k-nearest neighbors.
• The distance(pmax,q) value, the distance between query point q and the furthest data
point pmax in the result set R, is less than or equal to the current search radius ri and
the size of the result set is k. In this case, we can be sure that the algorithm has found
32
all the k nearest neighbors to query point q and further increase of the query area (i.e.
search radius ri) will only result in redundant computational cost.
Next, if the outcome of the cache hit-miss strategy is a hit the algorithm enters
SeachCache(q) sub-routine (see Appendix B, figure B.3). The data cache is a B-tree index
structure modified to access left and right leaf nodes. At this point, the algorithm iteratively
runs the SeachCache(q) sub-routine until stopped by the stopping criteria mention in the
“cache strategy” section below. For each iteration, it increases the search radius ri by
increment amount, rincrement to widen the search space. Instead of a cache hit if a cache miss
occurs at the beginning of the search, our ckSearch algorithm enters a loop where it first
checks the stopping criteria and then enters the SearchClusters(q) routine.
The SearchClusters(q) (see Appendix B, figure B.2) search routine is an important part
of our algorithm because it applies the “Cluster search rules” to reduce significant
computation cost. It checks every cluster iteratively and takes one of the following three
actions:
• Exclude the cluster from the search: If the cluster in question does not contain or
intersect the search sphere of the query point q and falls under the Cluster exclusion
rule (Rule 1), the cluster is then exempted from the KNN search. Thus, a significant
reduction of computation cost occurs.
• Call SearchLeftNode(), Search the cluster inwards and ignore nodes to the right: If
the cluster in question “intersects” the query search sphere, according to the Cluster
intersects query sphere rule (Rule 4) criterion, data space inward toward the cluster
33
center must be search. In this case, only nodes left of the query node in the B-tree
need to be search. Moreover, nodes to the right (in the B-tree) are ignore. Because
these nodes reside outside of this cluster boundary. Thus, our algorithm only calls the
SearchLeftNodes(leafNodei, keyleft) (Appendix B, figure B.4) sub-routine in the next
step to search for k nearest neighbors.
• Perform an Exhaustive Search: If the data cluster “contains” the query point q
determined by the Cluster contains query sphere rule (Rule 3), then an exhaustive
search of the cluster must be completed to find the k nearest neighbors. The data
space is sufficiently traversed to complete the search. This is be done by searching
inward and outward of the cluster center accordingly. Because potential nearest
neighbors can be left or right of the query node in the B-tree. The search routines,
SearchLeftNodes(leafNodei, keyleft) and SearchRightNodes(leafNodei, keyright) are
used for searching inward and outward of the cluster center.
Next, our ckSearch algorithm locates the leaf node, leafNodei (from the B-tree)
whereby query point q with index keyquery may be stored. Intuitively we can say that this
leafNodei has the high probability of having the nearest neighbors of the query point.
Because the data points stored in leafNodei has similar distance from the cluster center as the
query point q. Thus, resides in the same region as the query point in the data space. The sub-
routine getQueryLeaf(btree, keyquery) returns this leaf node.
Next, based on the Cluster search rules (as mentioned in the “Cluster search rules”
section), the ckSearch algorithm either calls SearchLeftNodes(leafNodei, keyleft) for Rule 3
34
or calls both SearchLeftNodes(leafNodei, keyleft) and SearchRightNodes(leafNodei, keyright)
for Rule 4. Each of these sub-routines has built-in loops to check for k nearest neighbors in
the leafNodei. Moreover, these routines check left and right leaf nodes based on inward or
outward data search (Rule 3 or Rule 4).
Figure 4.3: Data cluster to B-tree correlation
In the above figure 4.3 Data cluster to B-tree correlation is shown. This figure shows
how the data points in a cluster are stored in the B-tree leaf nodes (bottom level). The data
points are sorted based on the 1-dimensional linear transformed distance from the cluster
center (used as keys).
It is import to mention that the actual discovery of the nearest neighbors happen in the
SearchLeftNodes(leafNodei, keyleft) and SearchRightNodes(leafNodei, keyright) sub-routines.
Because each of these two search routines iterative calculates the distance between each data
point in leafNodei and the query point q. The k data points with shortest distance to query
point are returned as a result set.
35
If the query sphere contains the first element of a node, then it is likely that its
predecessor with respect to distance from the cluster center may also be close to q. Thus, the
SearchLeftNodes(leafNodei, keyleft) also examines its left sibling leaf node for nearest
neighbors. On the other hand if the query sphere contains the last element of a node, for the
same reason as stated above, the SearchRightNodes(leafNodei, keyright) routine examines its
right sibling leaf node for nearest neighbors.
Now, at the end of these phases the algorithm re-examines the two stopping criteria
mentioned above. It checks for KNN result set R and stops if the k nearest neighbors has
been identified. Moreover, ensures that the further enlargement of the search sphere does not
change the KNN result set. The search process only stops if the distance of the furthest data
point in the answer set, R, from query point q is less than or equal to the current search radius
ri. Otherwise, it increases the search radius and repeats the entire process. Figures B.1 – B.5
in appendix B illustrate the algorithm pseudocodes of the sub-routines mentioned above.
36
Figure 4.4: ckSearch algorithm data caching scheme
Figure 4.4 shows the ckSearch algorithm data caching scheme. In this example, the
query point q and data point A reside in the same leaf node of the ckSearch cache. This is a
cache hit scenario.
Cache Strategy
Data caching is an important component of our KNN search algorithm. A data cache can
prove extremely effective in a KNN search process where repeated access of a large number
of high-dimensional dataset is performed. A fast cache implementation can dramatically
reduce the number of distance computations by simply storing frequently accessed data into a
37
data cache. On the other hand, expensive cache misses can degrade performance. Thus, we
have developed a cache strategy to reduce redundant computation while avoid expensive
cache misses (and therefore reduce costly B-tree insertion operations). This cache strategy is
comprised of the following rules:
• Reduce the cost of the insertion operation as much as possible by reducing frequent
cache updates. The underlying data structure of our cache strategy is a B-tree data
structure ideal for fast cache implementations. In B-tree, inserting a record requires
O(log n) operations in worst case.
• Conduct preliminary checks before performing costly cache searches to reduce cache-
miss cost. We have decided to take this conservative approach to make sure that the
cache-hits remain as performance boost for the ckSearch system and do not get
overwhelmed by too many cache misses. For a given query point, we find the closest
cluster to query point by calculating the distance between the query point and the
cluster centers. Then, we check if the closest cluster to the query point is the same as
the cluster stored in data cache (a B-tree structure). Our assumption here is that two
consecutive query points will fall in the same cluster and possibly around the same
region of that cluster. Thus, their k nearest neighbors will also be in the same region
of the cluster.
• Perform an additional check by matching query point leaf node from the data cache
B-tree with leaf node from the actual data storage B-tree. These two leaf nodes
essentially indicate same region of the same data cluster. If these two leaf nodes turn
38
out to be the same, then the current query point falls in the same data region as the
previous query point because our data structure keeps the leaf nodes sorted based on
distance from the cluster center. So, in order for two leaf nodes to be same the stored
in these leaf nodes must be located in the same region in a cluster. Thus, our ckSearch
algorithm proceeds to perform search to retrieve k nearest neighbors from the data
cache.
• If it is a cache-miss scenario based on the above-mentioned strategies, our algorithm
skips the data cache and the search is performed on the main data storage B-tree data
structure. At the end of the query search, the leaf nodes containing the nearest
neighbors are loaded on to the data cache B-tree for next query iteration under the
CacheUpdate process.
Cluster Search Rules
Our online search strategy depends critically on a search radius parameter ri. We initially
select ri to be conservatively small. If there are not enough points returned from a query, we
can gradually increase the value of ri. In this section we describe several cluster search rules
based on query radius, cluster boundary, and location of the query point. Using the above
mentioned parameters and simple geometric calculations, it possible to figure out with
certainty that some clusters will not contain any of the k nearest neighbors. Thus, these
clusters can be completely excluded from computations and reduce significant amount of
computational cost. These following rules are applied during the query time (runtime) of the
ckSearch.
39
Figure 4.5: Cluster search rule 1 (Cluster exclusion rule)
The above figure (figure 4.5) illustrates the cluster search rule 1 (Cluster exclusion
rule). In this example, the query point is outside the cluster M1. This cluster can be excluded
from the KNN search operations. Thus, reducing expensive distance computation cost.
Rule 1: The cluster exclusion rule
A cluster can be excluded from nearest neighbor search if the following condition is true,
distance(Ci, q) - ri > distMaxi (4.2)
Employing an exclusion strategy, it is possible to exclude a cluster and its data points
from KNN search. Naturally, by excluding a cluster from distance computations,
computation cost can be reduced.
40
Let Ci be the reference point (cluster center) of the cluster Mi. Now, the query point q
has a search radius ri. As described above, ri is the search radius of the search area where the
ckSearch system looks for possible nearest points. The distance between the cluster center
and the query point q is denoted by distance(Ci, q). Moreover, the distance between Ci and
the furthest data points in cluster Mi is denoted by distMaxi. Given the condition
distance(Ci, q) > distMaxi, we can say that the cluster Mi can be excluded from KNN search
if the query point q and query sphere rests out side the cluster boundary.
Figure 4.6: Cluster search rule 2 (Cluster search region rule)
The above figure (Figure 4.6) shows Cluster search rule 2 (Cluster search region
rule). This rule describes the valid search region for a query point in a cluster. This rule
ensures valid search computations in a cluster and avoids unnecessary iterations in invalid
region.
41
Rule 2: Cluster search region rule
When a cluster is searched for nearest neighbor point, the effective search range is,
distmax = max0, distMini
distmin = mindistMaxi, (distance(Ci,q) + ri)
Then, the effective search region is within, [distmin, distmax] (4.3)
Carefully selected search region can further reduce cost for nearest neighbor search.
Moreover, range query can be performed using the search range within an affected cluster.
And, most importantly search termination rules can be set up based on this search range
while searching the leaf nodes in B-tree index structure for the nearest neighbor. This speeds
up the data retrieval from the B-tree.
Let the distance between the cluster center Ci and the query point q is denoted by
distance(Ci, q). Now, the query point q has a search radius ri. Moreover, the distances to the
furthest and the closest data points from the cluster center Ci in cluster Mi is denoted by
distMaxi and distMini. From these given conditions, we can deduce effective search region of
a cluster because no data point lies beyond this search region.
42
Figure 4.7: Cluster search rule 3 (Cluster contains query sphere)
Figure 4.7 above illustrates Cluster search rule 3 (Cluster contains query sphere). This
rule describes that the query point q and its search region based on query radius r1 is
completely inside the cluster M1. Thus, cluster M1 contains the q’s query sphere.
Rule 3: Cluster contains query sphere rule
The query sphere with radius ri is completely contained in the affected cluster Mi, if the
following conditions are true.
distance(Ci, q) + ri ≤ distMaxi (4.4)
It is an important piece of information if the query point q and its query search sphere
are completely contained in the partition or cluster. The reason is this information then can be
used to formulate smarter nearest neighbor search and in turn reduce search related
computation cost.
43
Let, distance(Ci, q) be the distance function between cluster center Ci and the query
point q. The radius of the cluster Mi is distMaxi. Given, distance(Ci, q) ≤ distMaxi we can
correctly formulate that if, distance(Ci, q) + ri ≤ distMaxi condition is true then the cluster Mi
will completely contain the query sphere (see figure 4.7).
Figure 4.8: Cluster search rule 4 (Cluster intersects query sphere)
Figure 4.8 above shows cluster search rule 4 (Cluster intersects query sphere). This
rule describes that the query point q only intersects the cluster M1. Thus, it is possible that the
k-nearest neighbor may not be in the cluster M1.
Rule 4: Cluster intersects query sphere rule
The query sphere with radius ri intersects the affected cluster Mi, if the following conditions
are true.
distance(Ci, q) - ri ≤ distMaxi (4.5)
44
Similar to the above section, it is important to know if a cluster is intersecting with
the search sphere of the query point q. In this case, the nearest neighbor point may be in the
cluster in question. It is also possible that the nearest neighbor is located in another cluster.
Thus, the iterative process of searching may continue.
Let, distance(Ci, q) be the distance function between cluster center Ci and the query
point q. The radius of the cluster Mi is distMaxi. Assuming, distance(Ci, q) > distMaxi where
the query point is outside the affected cluster. We can correctly put together the rule that if,
distance(Ci, q) - ri ≤ distMaxi condition is true then the cluster Mi will partially intersect the
query sphere (see figure 4.8).
45
CHAPTER 5
EXPERIMENTS & RESULTS
In this section, we detail experimental setups, describe our experiments, and present the
results of those experiments. The main objective of our experiments is to evaluate the
performance of our ckSearch system. The indexing strategies of ckSearch are tested on
different data sets varying data set size, dimensions, and data distribution.
We use the KD-tree algorithm as a benchmark for comparison. The KD-tree
algorithm is an effective and commonly used KNN method based on a multi-dimensional
indexing structure. Moreover, the KD-tree algorithm is especially appealing for comparison
as it is similar to our ckSearch multi-dimensional indexing tree structure. The focus of our
research is to speed up the learned data classification process, and is especially applicable for
an existing autonomous robot that currently use a KD-tree based KNN system. According to
Arya and Silverman [2], linear search serves as an effective KNN search technique. So, for
completeness we also compare the performance of ckSearch with the linear search KNN
technique.
5.1 Setup information
The ckSearch search algorithm and related K-means clustering technique were implemented
in Java. A tree-based indexing structure was used as the primary data structure along with a
two dimensional array to store cluster information. The linear search and the KD-tree
46
implementations are Java based implementations. The KD-tree code was obtained from our
colleagues at Georgia Institute of Technology [37]. Experiments were performed on a 1.5-
GHz PC with 512 megabytes main memory, running Microsoft Windows XP version 2002
SP3.
For our training set we used a training data generated by an autonomous robot guided
by a human. We also created synthetic test data sets ranging from 10,000 to 100,000 records
with various dimensions (such as: 9, 18, 36, 50, and 60). For each query, a d-dimensional
point is used. One hundred query trials were used for each experiment. Then, we averaged
the total performance time to even out the I/O cost of the performance.
5.2 The effect of the size of the data set
The size of the data set can play a significant role in the performance of a KNN algorithm
where searches are O(n) with respect to the number of items stored for the query. In order to
evaluate the performance of our system with this criterion, we conducted a series of
experiments: We used a 60-dimensional data set with k set to 1, 3, and 10. During these
experiments we gradually increased the number of data points in the data set. We started with
5000 data points and increased it to 10000, 20000, 40000, 50000, 60000, 80000, and 100000.
With each data set we also recorded the performance time of the ckSearch and compared it
with the performance time of the linear search implementation. The results are tabulated in
table 5.1.
The following table shows the effect of the size of the data set on the ckSearch and the
linear search implementations of KNN algorithm. The size of the data set was gradually
increased and the performance times were recorded.
47
Data Size Dimension Linear search (ms) ckSearch (ms) k
5000 60 44.010621 3.574 1
10000 60 79.029039 5.663 1
20000 60 159.794052 15.7322 1
40000 60 219.094885 12.4334 1
50000 60 264.959094 16.3343 1
80000 60 264.959094 26.3441 1
100000 60 721.357044 34.59425 1
Table 5.1: The effect of data size on performance (k=1)
Performance vs. Data Set Size
k = 1
0
100
200
300
400
500
600
700
800
5K 10K 20K 40K 50K 80K 100K
Data Size
Execu
tio
n T
ime (
ms)
Linear Search
ckSearch
Figure 5.1: Performance vs. data set size chart (k = 1)
48
Data Size Dimension Linear search (ms) ckSearch (ms) k
5000 60 73.031907 13.91388 3
10000 60 98.910615 29.22268 3
20000 60 174.237228 40.72744 3
40000 60 278.986854 87.66623 3
50000 60 325.127635 68.59792 3
80000 60 520.718415 155.5077 3
100000 60 873.28166 122.8424 3
Table 5.2: Effect of data size on performance (k=3)
Performance vs. Data Set Size
k = 3
0
100
200
300
400
500
600
700
800
900
1000
5K 10K 20K 40K 50K 80K 100K
Data size
Ex
ec
uti
on
Tim
e (
ms
)
Linear Search
ckSearch
Figure 5.2: Performance vs. data set size chart (k = 3)
49
Data Size Dimension Linear search (ms) ckSearch (ms) k
5000 60 73.031907 13.91388 10
10000 60 98.910615 29.22268 10
20000 60 174.237228 40.72744 10
40000 60 278.986854 87.66623 10
50000 60 325.127635 68.59792 10
80000 60 520.718415 155.5077 10
100000 60 873.28166 122.8424 10
Table 5.3: Effect of data size on performance (k=10)
Performance vs. Data Set Size
k = 10
0
200
400
600
800
1000
1200
1400
5K 10K 20K 40K 50K 80K 100K
Data Size
Ex
ec
uti
on
Tim
e (
ms
)
Linear Search
ckSearch
Figure 5.3: Performance vs. data set size chart (k = 10)
50
Data Size Dimension k = 1 k = 3 k = 10
5000 60 12.32791 5.248852694 3.226432
10000 60 13.96273 3.384720805 2.27282
20000 60 10.17797 4.278128799 2.999871
40000 60 17.66894 3.182375559 3.516738
50000 60 16.55994 4.739613768 2.510961
80000 60 10.19073 3.348506537 4.066482
100000 60 20.85194 7.10895797 3.356904
Table 5.4: ckSearch speedup over linear search
ckSearch Speedup over Linear Search
0
5
10
15
20
25
30
35
40
45
50
5K 10K 20K 40K 50K 80K 100K
Data Size
Sp
eed
up
Speedup, K=1
Speedup, K=3
Speedup, K=10
Figure 5.4: Chart showing ckSearch speedup over the linear search
The tables 5.1, 5.2, and 5.3 show the results of the “effect of the size of the data set”
experiment. In this experiment we evaluated the performance of the ckSearch algorithm
against an implementation of the linear search KNN algorithm. The results clearly show that
the ckSearch performed far better than the linear search. The speedup chart (figure 5.4)
verifies that the ckSearch achieves and maintains a steady speedup over the linear search
51
method on several values of k. The ckSearch process copes much better than the linear search
where the increased number of computations takes place due to larger data sets.
5.3 The effect of data dimension on the performance
The number of dimensions of a data set can influence the performance of a KNN algorithm.
This happens due to increased complexity of Euclidean distance computations associated
with high-dimensional data. An autonomous robots’ navigational data set can be high-
dimensional. An autonomous robot system may use high-dimensional sensor arrays or high-
dimensional image processing for navigation. Thus, we focused on evaluating the ckSearch
system performance with high-dimensional data criteria.
In this experiment, we used large data sets with 50000 and 100000 records. The k value
was set to 1 for the first experiment and the data set size was set to 50000. For the second
experiment, the k value was set to 3 and the data set with 100000 records was used. For each
of these experiments, we used a 9-dimensional data set to begin with. Then, the number of
dimensions was gradually increased to 18, 36, 60, and 75. The performance time of each
experiment was recorded. In order to perform comparisons the same experiments were
performed with a KD-tree implementation and a linear search technique. We tabulated the
experiment results in the following table.
The following table shows the effect of dimension of the data set on ckSearch and KD-
tree implementations of KNN algorithm. The dimension of the data set was gradually
increased and the performance times were recorded.
52
Dimension Data Set Size KD-tree (ms) Linear Search (ms) ckSearch (ms)
9 50000 125.3342 268.728288 22.26529
18 50000 112.7768 291.717802 19.05041
36 50000 140.987 324.284511 18.37775
60 50000 157.7129 345.627796 30.71787
75 50000 222.6564 561.185849 27.38437
Table 5.5: The effect of data dimension on performance (N=50K)
Data Dimension vs. Performance
N = 50000 & k = 1
0
100
200
300
400
500
600
9 18 36 60 75
Dimensions
Execu
tio
n T
ime (
ms)
linear search
Kd-tree
ckSearch
Figure 5.5: Data dimension vs. performance chart (N = 50K)
Dimension Data Set Size k Linear Search (ms) ckSearch (ms)
9 100000 3 419.812183 35.07446
18 100000 3 451.023915 92.92075
36 100000 3 477.775919 158.4776
60 100000 3 571.532918 122.8424
75 100000 3 660.196533 140.9969
Table 5.6: The effect of data dimension on performance (N=100K)
53
Data Dimension vs. Performance
N = 100000 & k = 3
0
100
200
300
400
500
600
700
9 18 36 60 75
Dimensions
Execu
tio
n T
ime (
ms)
linear search
ckSearch
Figure 5.6: Data dimension vs. performance chart (N = 100K)
Dimension Data Set Size Linear Search (ms) ckSearch (ms) Speedup
9 100000 419.812183 35.07446 11.96917
18 100000 451.023915 92.92075 4.853856
36 100000 477.775919 158.4776 3.014785
60 100000 571.532918 122.8424 4.652569
75 100000 660.196533 140.9969 4.682349
Table 5.7: ckSearch speedup over linear search for various dimensions
In this experiment we compared the ckSearch performance with the KD-tree and linear
search implementations. The experiment results clearly show that the ckSearch system
performed better than both the KD-tree and the linear search method. Moreover, the
ckSearch achieved considerable speedup over linear search (see table 5.7). The results also
show that as the number of dimensions increase the KD-tree and the linear search
performance gradually degrades. Especially, with a larger data set (100000 records) the linear
search performance degrades at a higher rate. On the other hand, ckSearch shows robustness
54
to dimension increase. The ckSearch performance time increases at a much slower rate than
the KD-tree and the linear search system (see figure 5.6).
5.4 The effect of search radius on the performance
The search radius is an important factor for the ckSearch system. The ckSearch system uses
incremental radius based search. Typically, a small search sphere is used and enlarged when
the search condition cannot be met. Our proposed ckSearch system relies on the search
sphere to minimize repeated costly distance calculations to optimize performance. Thus, it is
important to study the effect of the search radius on performance of the ckSearch system.
In this experiment, we used a large data set with 10000 records. We used a k value 1 for
the first part of the experiment and a k value of 3 for the second part of the experiment. The
radius value was gradually increased from 1.0 meters to 10.0 meters. The performance time
of each experiment was recorded.
The following table shows the effect of search radius on ckSearch query performance.
Even though the KD-tree and the linear search methods do not use a search radius, we have
listed the KD-tree and the linear search performance results to compare with the ckSearch
system.
55
Radius (m) Data Set Size KD-tree (ms) Linear Search (ms) ckSearch (ms)
1.0 10000 293.99938 109.38541 5.122073
2.0 10000 293.99938 109.38541 13.31026
3.0 10000 293.99938 109.38541 19.89829
4.0 10000 293.99938 109.38541 26.42299
5.0 10000 293.99938 109.38541 34.06723
6.0 10000 293.99938 109.38541 39.26128
7.0 10000 293.99938 109.38541 42.53429
8.0 10000 293.99938 109.38541 46.6953
9.0 10000 293.99938 109.38541 49.8422
10.0 10000 293.99938 109.38541 51.81746
Table 5.8: The effect of search radius on performance (k = 3)
Effect of radius increase on performance
k = 3
0
50
100
150
200
250
300
350
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
Radius
Execu
tio
n T
ime (
ms)
linear scan
kd-tree
ckSearch
Figure 5.7: Search radius vs. performance for 10000 data records
56
Radius (m) Data Set Size Linear Search (ms) ckSearch (ms)
1.0 10000 169.24721 11.324462
2.0 10000 169.24721 27.077553
3.0 10000 169.24721 43.68484
4.0 10000 169.24721 63.019417
5.0 10000 169.24721 76.492185
6.0 10000 169.24721 85.420351
7.0 10000 169.24721 94.545402
8.0 10000 169.24721 102.19342
9.0 10000 169.24721 112.9639
10.0 10000 169.24721 114.11205
Table 5.9: The effect of search radius on performance (k = 10)
Effect of radius increase on performance
k = 10
0
20
40
60
80
100
120
140
160
180
1.0
2.0 3.0
4.0
5.0 6.0
7.0 8.0
9.0
10.0
Radius
Ex
ec
uti
on
Tim
e (
ms
)
linear scan
ckSearch
Figure 5.8: Search radius vs. performance chart for 10,000 data records (k = 10)
Considering the experimental results listed in the tables 5.8 and 5.9, the search radius
has a significant impact on ckSearch performance. We observe a sharp increase in
performance time as the search radius increases. We believe this is due to an increase in the
number of redundant distance computations. As will be shown in the accuracy experiments,
57
the ckSearch algorithm finds the results well before reaching maximum radius of 10.0 meters
used for the above experiments.
5.5 The effect of search radius on accuracy
This experiment is similar to the above experiment regarding the effect of search radius on
performance time. In this experiment, we evaluate the effect of search radius on data
accuracy. Typically, a small search sphere is used as the start radius. This radius was
enlarged when the search condition could not be met. In this experiment we started with the
radius as 1.0 meters and went up to 10.0 meters. Since ckSearch system relies on the search
sphere to minimize repeated costly distance calculations to optimize performance, it is
important to study the effect of the search radius on accuracy of the ckSearch system.
In this experiment for each of the radius values used during the query, we recorded the
number of accurate nearest neighbor found by the ckSearch algorithm. The results from this
experiment are shown in the table below.
Radius (m) Data Set Size k = 3 k = 10
1.0 10000 0.0 0.0
2.0 10000 85.423 76.667
3.0 10000 97.355 97.702
4.0 10000 99.856
99.113
5.0 10000 100.0 99.822
6.0 10000 100.0 100.0
7.0 10000 100.0 100.0
8.0 10000 100.0 100.0
9.0 10000 100.0 100.0
10.0 10000 100.0 100.0
Table 5.10: The effect of the search radius on query accuracy
58
Search radius vs. Accuracy
N = 10000 & d = 60
0
20
40
60
80
100
120
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
Radius
KN
N f
ou
nd
(%
)
k=3
k=10
Figure 5.9: Search radius vs. query accuracy chart
The experimental results in the table show that the accuracy of the query search gets
better as the search radius increases from 1.0 meters to 10.0 meters. The larger search radius
allows the ckSearch algorithm to assess more of the nearest neighbors. Thus, the accuracy of
the search increases as the search radius increases. It also important to notice that the
ckSearch algorithm achieves 100% accuracy well before the maximum radius value 10.0
meters. This indicates that selecting proper radius is important for performance of the
ckSearch system.
5.6 The effect of the number of clusters
The number of clusters can affect the performance of a cluster based algorithm. Even though
clustering for ckSearch is part of the pre-processing stage and does not directly affect
performance time, it can indirectly influence the ckSearch algorithm time. In order to find out
the effect of the number clusters, we performed several experiments. In this set of
59
experiments, the effect of the clusters on the ckSearch system was investigated. As the
number of clusters increase it is plausible that it can cause increased computational
complexity and in turn increase computation time.
The number of cluster was gradually increased and the subsequent performance was
recorded in this experiment. We used 5, 10, 20, 30, and 50 clusters and the size of the data set
was 50,000 records. We varied the number of nearest neighbor values k (1 and 5) and
conducted two separate experiments. The results of the experiments are tabulated below.
Cluster k KD-tree (ms) Linear Search (ms) ckSearch (ms)
5 1 266.2874 246.287447 29.59359
10 1 266.2874 246.287447 27.03791
20 1 266.2874 246.287447 29.7673
30 1 266.2874 246.287447 17.83305
50 1 266.2874 246.287447 21.50379
Table 5.11: The effect of the number of clusters on performance (k=1)
Effect of data clusters on performance
N = 50000 & k = 1
0
50
100
150
200
250
300
5 10 20 30 50
Clusters
Execu
tio
n T
ime (
ms)
linear search
Kd-tree
ckSearch
Figure 5.10: The number of clusters vs. performance chart for 50000 data records (k = 1)
60
Cluster k Linear Search (ms) ckSearch (ms)
5 5 363.641468 116.6476
10 5 363.641468 121.0026
20 5 363.641468 104.7671
30 5 363.641468 86.15084
50 5 363.641468 114.5369
Table 5.12: The effect of the number of clusters on performance (k=5)
Effect data clusters on performance
N = 50000 & k = 5
0
50
100
150
200
250
300
350
400
5 10 20 30 50
Clusters
Ex
ec
uti
on
Tim
e (
ms
)
linear search
ckSearch
Figure 5.11: The number of clusters vs. performance chart (k = 5)
Tables 5.11 and 5.12 illustrate the results of our experiments with the number of
cluster. Our initial hypothesis was that as the number of clusters increases, performance will
decrease because more clusters will take longer to search. Interestingly according to our
results, ckSearch performance times remain close to the same or very slightly increase. We
hypothesize that this is due to spreading out the data records into number of clusters and most
61
of this cluster search is eliminated using the “cluster search rules”. These “cluster search
rules” prevented the ckSearch system from unnecessary searching.
62
CHAPTER 6
CONCLUSION
In this thesis, we introduced a new algorithm for K-nearest neighbor queries that uses
clustering and caching to improve performance. The main idea is to reduce the distance
computation cost between the query point and the data points in the data set. We used a
divide-and-conquer approach. First, we divide the training data into clusters based on
similarity between the data points in terms of Euclidean distance. Next we use linearization
for faster lookup. The data points in a cluster can be sorted based on their similarity
(measured by Euclidean distance) to the center of the cluster. Fast search data structures such
as the B-tree can be utilized to store data points based on their distance from the cluster
center and perform fast data search. The B-tree algorithm is good for range search as well.
We achieve a further performance boost by using B-tree based data caching. In this work we
provided details of the algorithm, an implementation, and experimental results in a robot
navigation task.
We conducted extensive experiments on the performance and the accuracy of the
ckSearch algorithm. In order to confirm performance improvement of KNN queries, we
performed experiments on the ckSearch system with large and small data sets. Several of our
experiments focused on performance of the ckSearch algorithm with high dimensional data
sets since many of the KNN search algorithms fail on performance when it comes to high
dimensional data. The results show that our algorithm is both effective and efficient. In fact,
63
the ckSearch algorithm achieves performance improvement over both the KD-tree and the
linear scan KNN algorithms.
In the future we will further improve our system by adding an analysis to select the
best possible initial search radius for the ckSearch algorithm. It is conceivable that selecting
too small search radius can end up with much unnecessary iteration. We want to remedy this
weakness of the system by adding the search radius selection analysis.
64
REFERENCES
[1] M. Procopio, T. Strohmann, A. Bates, G. Grudic, J. Mulligan. Using Binary
Classifiers to Augment Stereo Vision for Enhanced Autonomous Robot
Navigation. April 2007.
[2] Arya, S., D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An
Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed
Dimensions. Journal of the ACM, vol. 45, no. 6, pp. 891-923
[3] V. Ramasubramanian, Kuldip K. Paliwal. Fast nearest-neighbor search
algorithms based on approximation-elimination search. January 1999.
[4] J. Chua, P. Tischer. A Framework for the Construction of Fast Nearest
Neighbour Search Algorithms. Monash University, Australia.
[5] J. Chua, P. Tischer. Minimal Cost Spanning Trees for Nearest-Neighbour
Matching. Monash University, Australia.
[6] V. Athitsos, M. Potamias, P. Papapetrou, G. Kollios. Nearest Neighbor
Retrieval Using Distance-Based Hashing. In Proc. IEEE International
Conference on Data Engineering (ICDE), April 2008.
[7] Y. Hsueh, R. Zimmermann, M. Yang. Approximate Continuous K Nearest
Neighbor Queries for Continuous Moving Objects with Pre-Defined Paths.
Department of Computer Science, University of Southern California.
[8] W. Shang, H. Huang, H. Zhu, Y. Lin, Z. Wang, Y. Qu. An Improved kNN –
Fuzzy kNN Algorithm. School of Computer and Information Technology,
Beijing Jiaotong University, China.
[9] A. Jain, M. Murty, P. Flynn. Data Clustering: A Review. Michigan State
University, U.S.A.
[10] A. Duch, V. Castro, C. Martinez. Randomized K-Dimensional Binary Search
Trees. September, 1998.
[11] Z. Aghbari, A. Makinouchi. Linearization Approach for Efficient KNN Search
of High-Dimensional Data. University of Sharjah, Sharjah, UAE.
65
[12] R. Weber, H. Schek, S. Blott. A Quantative Analysis and Performance Study for
Similarity-Search Methods in High-Dimensional Spaces. ETH Zentrum, Zurich.
[13] A. Thomasian, L. Zhang. The Stepwise Dimensionality Increasing (SDI) Index
for High-Dimensional Data. May, 2006.
[14] B. Zheng, W. Lee, D. Lee. Search K Nearest Neighbors on Air. Hong Kong
University of Science and Technology, Clear Water Bay, Hong Kong.
[15] H. Zhang, A. Berg, M. Maire, J. Malik. SVM-KNN: Discriminative Nearest
Neighbor Classification for Visual Category Recognition. University of
California, Berkeley, California.
[16] C. Yu, B. Ooi, K. Tan, H. Jagadish. Indexing the Distance: An Efficient Method
to KNN Processing. Proc. Of the 27th VLDB Conference, Roma, Italy, 2001
[17] A. Nuchter, K. Lingemann, J. Hertzberg. 6D SLAM with Cached kd-tree Search.
University of Osnabruck, Osnabruck, Germany.
[18] G. Neto, H. Costelha, P. Lima. Topological Navigation in Configuration Space
Applied to Soccer Robots. Instituto Superior Tecnico, Portugal.
[19] C. Yu, S. Wang. Efficient Index based KNN join processing for high
dimensional data. Information and Software Technology. May 2006.
[20] G. DeSouza, A. Kak. Vision for Mobile Robot Navigation: A Survey. IEEE
Transactions on pattern analysis and machine intelligence, vol. 24, no. 2,
February, 2002.
[21] E. Plaku, L. Kavraki. Distributed Computation of the knn Graph for Large
High-Dimensional Point Sets. Journal of Parallel and Distributed Computing,
2007, vol. 67(3), pp. 346-359.
[22] J.L.Bentley Multidimensional Binary Search Trees in Database Applications.
IEEE Trans. on Software Engineering, SE-5(4):333-340, July 1979.
[23] N. Ripperda, C. Brenner. Marker-Free Registration of Terrestrial Laser Scans
Using the Normal Distribution Transform. University of Hannover, Germany.
[24] C. Atkeson, S. Schaal. Memory-Based Neural Networks For Robot Learning.
GIT, Atlanta, Georgia.
[25] J. Nievergelt, H. Hinterberger, K. Sevcik. The gridfile: An Adaptable
Symmetric Multikey File Stucture. ACM Trans. on Database Systems, 9(1):38
- 71, 1984.
66
[26] A. Gionis, P. Indyk, R. Motwani. Similarity search in high dimensions via
Hashing. In International Conference on Very Large Databases (VLDB), 1999
pp. 518-529.
[27] V. Athitsos, M. Potamias, P. Papapetrou, G. Killios. Nearest Neighbor Retrieval
Using Distance-Based Hashing.
[28] A. Andoni, P. Indyk. Efficient algorithms for substring nearest neighbor
Problem. In ACM-SIAM Symposium on Discrete Algorithms (SODA). 2006,
pp. 1203 – 1212.
[29] M. Zhang, T. Zhang, R. Ramakrishnan. BIRCH: A new data clustering
algorithm and its applications.Data Mining and Knowledge Discovery.
[30] G. Grizaite, R. Oberperfler. DBSCAN Clustering Algorithm. January 31, 2005.
[31] T. Bingmann. “STX B+ Trees Template Classes: Speed Test Results.” 2008.
Idle box. 4th April, 2009< http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-
html/speedtest.html >.
[32] D. Bentivegna. Learning from Observation Using Primitives. Doctoral Dissertation,
Georgia Institute of Technology, 2004.
[33] L. Xiong, S. Chitti. Mining multiple private databases using a kNN classifier.
In Proceedings of the 2007 ACM symposium on Applied computing. 2007,
pp. 435 - 440.
[34] H. Franco-Lopez, A. Ek, M. Bauer. Estimation and mapping of forest stand density,
volume, and cover type using the k-nearest neighbors method. Remote Sensing of
Environment, Vol. 77, No. 3, 2001, pp. 251-274.
[35] H. Maarse, P. Slump, A.Tas, J. Schaefer. Classification of wines according to type
Journal Zeitschrift für Lebensmitteluntersuchung und -Forschung A. Vol .184,
No. 3, March, 1987, pp. 198-203.
[36] A. Sohail, P. Bhattacharya. Schaefer. Classification of Facial Expressions Using
K-Nearest Neighbor Classifier. Computer Vision/Computer Graphics Collaboration
Techniques. Vol .4418, June, 2007, pp. 555-566.
[37] S. Arya, D. Mount. “ANN: A Library for Approximate Nearest Neighbor
Searching.” August 4, 2006, ANN. 14th
April, 2009.
< http://www.cs.umd.edu/~mount/ANN/>
67
APPENDIX A
NOTATION TABLE
Notation
Table A.1 lists a variety of symbols, functions, and parameters used in this paper. Following
terms and notations are used throughout this paper, especially in the pseudo-code section of
the algorithm.
D Number of dimensions
N Number of data points
D ∈ Ω Data set
Ω = [0,1]d Data space
R Result set containing k-nearest neighbors
Ci Cluster center reference point
R Radius of a search sphere
rincrement Radius increment value
rmax Maximum radius value for STOP criterion
pi A data point p in the ith cluster
distMaxi Maximum radius of a partition, Mi
distMini Distance between Ci and closest point to Ci
pmax The furthest data point from q in the KNN result set R
FurthestPoint(R,q) Furthest point from query point q in set R
SearchRadius(q) Search radius of query point, q
SearchSphere(q, r) Sphere space with query point q in center and radius r
distNearestq Nearest distance to query point, q
distance(pi, Ci) Distance between point pi and Cluster center Ci
keyi B-tree index of nodes and data entries in leaf node
datai Data entries in leaf node of a B-tree
distCenter Distance from query point q to Cluster center Ci
GetNearest(q) nearest neighbor to query point, q
Table A.1: List of various notations used in this thesis
68
APPENDIX B
IMPLEMENTATION PSEUDOCODE
ckSearch _KNN(q) 1 initialize();
2 loadBTree();
3 rincrement = increment value;
4 R = empty;
5
6 if (IsCacheHit(q) == true):
7 while( r < rmax)
8 if (distance(pmax,q) < r and R.Size() == k):
9 STOP;
10 return;
11
12 r = rincrement + r ;
13 SearchCache(q);
14
15 else if (IsCacheHit(q) == false):
16 while( r < rmax)
17 if (distance(pmax,q) < r and R.Size() == k):
18 STOP;
19 return;
20
21 r = rincrement + r ;
22 SearchClusters(q);
23 UpdateCache();
24
25 End ckSearch_KNN;
Figure B.1: ckSearch KNN algorithm
The figure above shows the ckSearch KNN query algorithm pseudocode. This is one of
the several methods utilized to implement ckSearch algorithm.
69
SearchClusters(q): 1 for i = 0 to (M – 1):
2 distCenter = distance(Ci,q);
3
4 if (exclude(i,q) == true) : // Rule 1
5 SKIP CLUSTERi;
6
7 else if (intersects(i,q) == true): // Rule 2
8 keyquery = i * µ + distCenter;
9 leafNodei = getQueryLeaf(btree, keyquery);
10 keyleft = i * µ + (distCenter – r);
11 SearchLeftNodes(leafNodei, keyleft);
12
13 else if (contains(i,q) == true): // Rule 3
14 keyquery = i * µ + distCenter;
15 leafNodei = getQueryLeaf(btree, keyquery);
16 keyleft = i * µ + (distCenter – r);
17 SearchLeftNodes(leafNodei, keyleft);
18 keyright = i * µ + (distCenter + r);
19 SearchRightNodes(leafNodei, keyright);
20
21 //end of for loop;
22 END;
Figure B.2: SeachClusters(q) pseudocode
The figure B.2 above shows pseudocode of the main cluster search algorithm. This
“SearchClusters(q)” algorithm is part of our proposed ckSearch KNN search algorithm.
70
SearchCache(q): 1 index = index of cached cluster;
2 for i = 0 to (M – 1): // searching all cached clusters
3 distCenter = distance(Ci,q);
4
5 if (exclude(i,q) == true) : // Cluster Rule #1
6 SKIP CLUSTERi;
7
8 else if (intersects(i,q) == true): // Cluster Rule #2
9 keyquery = i * µ + distCenter;
10 leafNodei = getQueryLeaf(btree, keyquery);
11 keyleft = i * µ + (distCenter – r);
12 SearchLeftNodes(leafNodei, keyleft);
13
14 else if (contains(i,q) == true): // Cluster Rule #3
15 keyquery = i * µ + distCenter;
16 leafNodei = getQueryLeaf(btree, keyquery);
17 keyleft = i * µ + (distCenter – r);
18 SearchLeftNodes(leafNodei, keyleft);
19 keyright = i * µ + (distCenter + r);
20 SearchRightNodes(leafNodei, keyright);
21
22 END;
Figure B.3: The “SearchCache(q)” algorithm pseudocode
The figure above (figure: B.3) shows pseudocode of the cache search algorithm. This
“SearchCache(q)” algorithm is part of our proposed ckSearch KNN search system.
71
SearchLeftNodes(leafNodei, keyleft) 1
2 for( i = 0; i < leafNodeSize(); i++
): // searching leafNodei for nearest neighbors
3 if R.Size() == k:
4 if (distance(pmax,q) > distance(datai,q)):
5 Remove pmax from R;
6 Add datai to R;
7
8 else if R.Size() ≠ k:
9 Add datai to R;
10
11 // End of for loop
12
13 distleft = distCenter – r;
14 leftLeafNode = GetLeftLeafNode(leafNodei);
15
16 while (true)
17 leftLeafNode = GetLeftLeafNode(leftLeafNode);
18 SearchLeafNode(leftLeafNode); //searching leftLeafNode for nearest neighbors
19
20 keyOfMinRecord = get key value of left most entry of leftLeafNode;
21
22 if keyOfMinRecord < distleft OR, if Cluster boundary reached
23 break loop; //reached the search sphere limit, no need to search
24
25 END;
Figure B.4: The SearchLeftNodes(leafNodei, keyleft) pseudocode
Figure B.4 above shows the “SearchLeftNodes(leafNodei, keyleft)” function
pseudocode. This function searches the leaf nodes to the left in the data structure for nearest
neighbor points. It is considered one of the most important functions in the ckSearch
implementation.
72
SearchRightNodes(leafNodei, keyright) 1
2 for( i = 0; i < leafNodeSize(); i++
): // searching leafNodei for nearest neighbors
3 if R.Size() == k:
4 if (distance(pmax,q) > distance(datai,q)):
5 Remove pmax from R;
6 Add datai to R;
7
8 else if R.Size() ≠ k:
9 Add datai to R;
10
11 // End of for loop
12
13 distright = distCenter + r;
14 rightLeafNode = GetRightLeafNode(leafNodei);
15
16 while (true)
17 rightLeafNode = GetRightLeafNode(rightLeafNode);
18 SearchLeafNode(rightLeafNode); //searching rightLeafNode for KNN
19
20 keyOfMaxRecord = get key value of left most entry of rightLeafNode;
21
22 if keyOfMaxRecord > distright OR, if Cluster boundary reached
23 break loop; //reached the search sphere limit, no need to search
24
25 END;
Figure B.5: The SearchRightNodes(leafNodei, keyright) pseudocode
Figure B.5 above shows the “SearchRightNodes(leafNodei, keyright)” function
pseudocode. This function searches the leaf nodes to the right in the data structure for nearest
neighbor points. It is considered one of the most important functions in the ckSearch
implementation.