A Mixed Hierarchical Algorithm for Nearest Neighbor Search · 2. RELATED WORK 2.1 K-means...

A Mixed Hierarchical Algorithm forNearest Neighbor Search

Carlo del MundoVirginia Tech2202 Kraft Dr.

Knowledge Works II BuildingBlacksburg, [email protected]

Mariam UmarVirginia Tech2202 Kraft Dr.

Knowledge Works II BuildingBlacksburg, VA

[email protected]

ABSTRACTThe k nearest neighbor (kNN) search is a computationallyintensive application critical to fields such as image process-ing, statistics, and biology. Recent works have demonstratedthe efficacy of k-d tree based implementations on multi-coreCPUs. It is unclear, however, whether such tree based im-plementations are amenable for execution in high-densityprocessors typified today by the graphics processing unit(GPU). This work seeks to map and optimize kNN to mas-sively parallel architectures such as the GPU. Our approachsynthesizes a clustering technique, k-means, with traditionalbrute force methods to prune the search space while takingadvantage of data-parallel execution of kNN on the GPU.Overall, our general case GPU version outperforms a single-threaded CPU by factors as high as 108.

1. INTRODUCTION & MOTIVATIONkNN is a fundamental algorithm for classifying objects. Itworks by finding the nearest neighbor of one or several querypoints in a metric space. Figure 1 depicts an example ofkNN for a 2D Euclidean metric space. In the context ofthis paper, the input data set is referred to as the referencepoints (shown as circles), and the targets are referred to asquery points (shown as an X). The two closest neighbors(K = 2) of the query point are shown in green.

Computing kNN presents a prohibitive cost for large in-puts and dimensionalities. This work seeks to capitalize onthe rich parallel resources of GPUs to accelerate kNN cal-culations. Traditional techniques focus on k-d tree baseddata structures to achieve O(log N) searches by pruningthe search space at each level of the tree. To the best of ourknowledge, no works have focused on search space pruningtechniques for kNN on the GPU.

We explore the use of a clustering technique, known as k-means, to perform offline groupings of like-coordinate points.Our approach takes advantage of the properties of clusters.

Figure 1: Nearest neighbor search for N = 12 and K= 2. The red X represents the query point, q, andits nearest neighbors are shown in green.

Points are clustered together based on a convergence crite-rion. These clusters of near-proximity points have centersknown as the centroids and are calculated as the averagecoordinates of points within a cluster. We assume that thenearest neighbor of a query point, q, belongs to the clos-est cluster to q. This fundamental assumption prunes thesearch space by discarding points that do not belong to theclosest cluster.

This tree-like pruning behavior can significantly cut downon the number of data points to test while avoiding branchdivergence penalties. In this work, we characterize the bruteforce (BF) linear kNN method for both the CPU and GPU.We then demonstrate the efficacy of our hierarchical algo-rithm on the GPU against the BF CPU. To that end, ourcontributions are as follows.

1. A characterization of the data-parallel brute force al-gorithm on CPU and GPU

2. The design, implementation, and characterization ofa mixed hierarchical kNN algorithm that prunes thesearch space via clustering

The rest of the document is outlined as follows. Section 2discusses related work, Section 3 presents our hierarchicalapproach, and finally, Section 4 summarizes and discussesour results.

2. RELATED WORK2.1 K-meansClustering is a widely studied problems of which k-means isthe canonical clustering method. Pelleg et al. [8] discussesissues in implementation for k-means such as poor scalingand finding local minima. They propose several solutionsfor these problems.

Alsabti et al. [1] explored a k-d tree based implementation ofk-means. They claim that the calculation of k-means witha k-d tree approach improves performance by two ordersof magnitude. Bradley et al. [4] shows that the initial pointcalculation is very important in performance, and they arguethat defining a better initial point helps k-means convergeto a better minimum.

Finding a better initial point improves solutions for bothcontinuous and discrete data sets. Kanungo et al. [7] useLloyd’s algorithm for k-means. Their approach differs fromthe conventional approach in that they construct a k-d treefor data points rather than the query points. They claimthat their implementation performs better both for synthet-ically generated data sets as well as real data sets.

K-means has been a popular clustering algorithm for manydecades, but no theoretical bounds have been establisheduntil now. Arthur et al. [2] shows that it is simple and fast,but additional research is required to understand its theo-retical complexity. They claim that even the initial clustersand their corresponding centers are chosen uniformly at ran-dom points. The cluster calculation is still superpolynomial.They are calculating lower bound on k-means using resetwidgets. This widget is introduced to make computation ofk-means much longer in order to calculate lower bound.

Reza et al. [5] have implemented k-means on GPUs. Theystructure their implementation based on the architecture ofthe GPU. They have taken performance and efficiency intoaccount demonstrating data intensive tasks on GPU is betterdue to power constraints. Their speedups improve by factorsas high as 68 for the NVIDIA 8800 Ultra GTX. A seriousconcern about their speed up is that they have not takeninto account the data transfer times between CPU and GPU,which may serve as a bottleneck for larger data sets.

2.2 kNNOne prominent tree based nearest neighbor algorithm is basedon the work of Arya et al [3]. This work focuses on creat-ing a tree data structure on the CPU to cut-down searchand space complexity to O(log N) and O(n), respectively.The authors discuss the implications of nearest neighbor forhigher dimensionalities (d > 20) and how to avoid commonpitfalls. Their methodology involves a modified k-d tree,also known as a bounding box decomposition tree. Finally,they relax the constraints of the nearest neighbor algorithmin order to gain substantial speedup with respect to algo-rithmic complexity.

Garcia et al. propose a GPU-based implementation of thekNN algorithm [6]. They implemented a brute-force ap-proach of the kNN problem by composing their computationas a series of matrix and sorting problems. By leveragingCUDA and CUBLAS, the authors have shown substantial

speedups by factors as high as 64 and 189 faster than Arya’swork on multi-core CPUs.

3. APPROACHThe common approach to kNN is: (1) a linear brute force(BF) approach and (2) the k-d tree approach. Instead, wepropose a mixed hierarchical algorithm that uses a combi-nation of BF and clustering.

In the BF algorithm, a simple Euclidean distance kernel isapplied to an array of points. This approach does not requireextra data structures other than an array. This approach isrelatively slow, on the order of O(N) where N is the numberof points in the list. Other efficient partitioning techniquesuse spatial data structures such as k-d trees reducing thecomplexity of the search to O(log N) [3]. Unfortunately,k-d tree based implementations for nearest neighbor searchhas only been widely studied on the CPU.

We propose a mixed hierarchical algorithm that first com-presses the data via a clustering scheme. Since points areclustered by a distance metric, we can assume that the querypoints and their neighbors are within the same cluster. De-termining the cluster is done by finding the Euclidean dis-tance between the query point and the centroid of the clus-ter. Finally, a brute force, data parallel approach can tra-verse the remaining data points within the identified cluster.This approach effectively prunes the number of referencepoints by a factor related to the cluster size.

Figure 2 compares and contrasts the following algorithms(1) brute force, (2) the k-d tree, and (3) our proposed hi-erarchical algorithm. The brute force algorithm, shown in(a), calculates distances between the query point and everyother point (O(N)). In the k-d tree method, the coordinatespace is subdivided as a set of tiles. Though this has al-gorithmic complexity of O(log N) for search, it is unclearwhether such a data structure will map well onto the GPU.Finally our hierarchical algorithm, shown in (c), first par-titions data points into clusters. Distance calculations arefirst done between each cluster’s centroid with the querypoint. Once the closest cluster is determined, a brute forceapproach is applied to points within that cluster.

Our implementations are based on the pseudocode outlinedin Sections 3.1, 3.2 and 3.3 for the brute force kNN, k-meansclustering, and our mixed algorithm, respectively.

3.1 Brute force kNN algorithmGiven a set of query points, Q, and a set of input data points,I. Then for each query point, qi:

1. Compute the Euclidean distance between qi and eachpoint in I.

2. Sort the distances in descending order. The k nearestneighbors for query point qi will be the first k entriesin the sorted array.

3.2 K-means clustering algorithmThe K-means clustering algorithm takes a parameter, C,which is the total number of clusters to group a set of refer-ence data points.

0 5 10 15 200

5

10

15

20

0 5 10 15 200

5

10

15

20

0 5 10 15 200

5

10

15

20

(a) (b) (c)

Figure 2: Approaches to kNN. In (a), the brute force nearest neighbor algorithm is shown. For a query point,q, a distance calculation is performed on every other point. In (b), the k-d tree nearest neighbor algorithmsubdivides the coordinate space into equally spaced tiles. The search complexity is of order O(log N). Finally,in (c), an example of our proposed hierarchical algorithm is shown. Instead of traversing a tree structure onthe GPU, clustering is performed on the CPU and clusters transferred to GPU. To effectively narrow downthe query point to its nearest neighbors, a distance calculation is performed from the query point to eachcluster’s centroids. The brute force approach is then applied to points in the closest cluster. Our assumptionis that the nearest neighbor will be contained within the closest cluster.

1. Choose C initial points as the initial clusters. Thecentroid of these clusters is the point coordinate of theinitial points.

2. Calculate the Euclidean distance between each pointto the current clusters. Group each point to its nearestcluster.

3. Recalculate the centroid of each cluster based on allpoints belonging to that cluster.

4. Repeat the previous two steps until the centroids con-verge.

3.3 Mixed Algorithm1. Calculate a set of C clusters using the algorithm in

Section 3.2.

2. For each query point, qi:

(a) Determine the closest cluster to qi by calculatingthe Euclidean distance of each cluster’s centroidwith the query point.

(b) Apply the brute force algorithm as detailed in sec-tion 3.1 to the closest cluster.

3.4 Limitations• Boundary cases. Like the k-d tree implementation,

boundary cases are an issue. Suppose the query pointsare in the boundaries between two clusters. In thissituation, both clusters must be traversed in order todetermine the nearest neighbor. The problem is fur-ther exacerbated when applied to query points in theboundaries between N clusters. This could be allevi-ated by creating a bounding volume for each cluster,therefore, identifying potential clusters that containthe nearest neighbor.

• Costs for creating the cluster. The start up costs forclustering can be prohibitive for large sample sizes. Weassume that the cost of creating the clusters before-hand will be amortized by fast query searches.

• Number of clusters and size of clusters. Empirical test-ing must be performed in order to determine the opti-mal number of clusters and the size of the respectiveclusters. Many clusters will increase the overhead ofdetermining which cluster the query point belongs to.Similarly, a large cluster will increase the overhead ofthe brute force algorithm.

4. RESULTS AND DISCUSSIONHere, we outline our experimental setup, results, and dis-cussion.

4.1 Experimental TestbedTable 1: Experimental Testbed.

OS/Kernel Debian Wheezy 7.0, v3.2.35-2 64-bitSoftware CUDA 5.0, Driver v313.30CPU Intel Celeron E3300 (2 cores 2.5 GHz)GPU NVIDIA Tesla C2075 (448 cores 1.15 GHz)Compiler nvcc -O3 (only optimizes CPU)

Our experimental testbed is listed in Table 1. For the courseof experimentation, we have used the nvcc compiler withcompiler flags (-03) to amortize the cost of having a slowCPU. The compiler flag improved CPU performance by afactor of five over CPU without the flags.

For our dataset, we use the USA-Central nodes data set andvary the input size from 1 to 64 MB. We fixed the numberof query (Q) and neighbor points (K) to one. Finally, we fixour experiment to 10 clusters.

Our implementations are broken down into three kernels:(1) distance computation, (2) sort, and (3) k-means. We runthese kernels on the CPU, GPU, or both. Our experimentsare as follows.

1. BF CPU. Distance Computation (on CPU), Sort (onCPU).

2. BF GPU. Distance Computation (on GPU), Sort (onGPU).

3. BF GPU + k-means. K-means (on CPU), DistanceComputation (GPU), Sort (on GPU).

Since the number of neighbor points (K) are fixed to one ele-ment, a reduction operation can be substituted for the sort-ing operation. We demonstrate the efficacy of reduction vs.sorting. The authors implemented the distance computationon CPU and GPU and k-means for CPU, used STL::sortand Thrust::sort for the CPU and GPU, respectively, andthe Thrust::reduction for the GPU reduction operation. Wedo not include the execution time of k-means for our mixedalgorithm. Furthermore, we assume that cluster creation isnegligible.

4.2 Results

0.1250.25

0.51248

163264

128256512

10242048

GPU BF + KM + Reduction

GPU BF + KM

GPU BF

CPU BF

Exec

utio

n Ti

me

(ms)

Number of Reference Points (MB)1 2 4 8 16 32 64

Figure 3: Results for BF CPU, BF GPU, and BFGPU with k-means.

Our primary results are listed in Figure 4. We note thatboth axes are on a logarithmic scale. In all cases, our GPUversions outperform the CPU versions with performance im-proving with successive GPU implementations. Figure 4 de-picts the execution time for N = 64 for the {GPU BF +KM + Reduction} broken down in its constituent stages:distance/sort or distance/reduction. Recall that the sortingoperation for the BF algorithm can be substituted for a re-duction operation if the number of neighbor points (K) isone. Substituting sort for reduction improves performanceby a factor of 20; in addition, we note that the executiontime is no longer dominated by the sorting stage, but ratherby the distance stage.

4.3 DiscussionThe linear growth of all experiments is expected since theBF algorithm requires O(N) search time. We note that the

Sort (91.5%)

Reduction(34.7%)

Distance (8.5%) Distance (65.3%)

(a) (b)

Figure 4: Percentage Execution for the Distance andSort/Reduction Stage for N = 64. In (a), the exe-cution of {BF GPU + KM} is shown, and in (b),the execution of {BF GPU + KM + Reduction}.Sorting is the dominant component of GPU BF al-gorithm with k-means comprising 91.5% of the exe-cution time for 64 MB. Substituting a reduction op-eration instead of sorting, the distance componentof the BF GPU becomes the dominant factor.

differences in execution time in the GPU versions can beattributed to differences in algorithmic design. The naive{GPU BF} version computes both distance and sorting forall points in the data set. The {GPU BF + KM} computesdistance and sorting for only a subset of points (within theselected cluster). Finally, the {GPU BF + KM + Reduc-tion} is similar to {GPU BF + KM}, but instead performsa reduction operation in lieu of sorting.

Overall, our fastest GPU version, {GPU BF + KM + Re-duction} outperforms its respective CPU version by factorsas high as 822 over our CPU implementation. We note, how-ever, this is a special corner case in the kNN computation(where K = 1). Therefore, for the general case, {GPU BF+ KM} outperforms its respective CPU version by factorsas high as 108 over our CPU implementation.

5. REFERENCES[1] Khaled Alsabti. An efficient k-means clustering

algorithm. In In Proceedings of IPPS/SPDP Workshopon High Performance Data Mining, 1998.

[2] David Arthur and Sergei Vassilvitskii. How slow is thek-means method? In Nina Amenta and OtfriedCheong, editors, Symposium on ComputationalGeometry, pages 144–153. ACM, 2006.

[3] Sunil Arya, David M. Mount, Nathan S. Netanyahu,Ruth Silverman, and Angela Y. Wu. An optimalalgorithm for approximate nearest neighbor searchingfixed dimensions. J. ACM, 45(6):891–923, November1998.

[4] P. S. Bradley and Usama M. Fayyad. Refining initialpoints for k-means clustering. pages 91–99. Morgankaufmann, 1998.

[5] Reza Farivar, Daniel Rebolledo, Ellick Chan, andRoy H. Campbell. A parallel implementation ofk-means clustering on gpus. In PDPTA, pages 340–345,2008.

[6] V. Garcia, E. Debreuve, F. Nielsen, and M. Barlaud.

K-nearest neighbor search: Fast gpu-basedimplementations and application to high-dimensionalfeature matching. In Image Processing (ICIP), 201017th IEEE International Conference on, pages3757–3760, 2010.

[7] Tapas Kanungo, David M. Mount, Nathan S.Netanyahu, Christine Piatko, Ruth Silverman, andAngela Y. Wu. The analysis of a simple k-meansclustering algorithm, 2000.

[8] Dau Pelleg and Andrew Moore. X-means: Extendingk-means with efficient estimation of the number ofclusters. In In Proceedings of the 17th InternationalConf. on Machine Learning, pages 727–734. MorganKaufmann, 2000.

Date post:	02-Apr-2019
Category:	Documents
Upload:	duonghuong
View:	213 times
Download:	0 times

A Mixed Hierarchical Algorithm for Nearest Neighbor Search · 2. RELATED WORK 2.1 K-means...

Documents