+ All Categories
Home > Documents > New Algorithms for Efficient High-Dimensional Nonparametric...

New Algorithms for Efficient High-Dimensional Nonparametric...

Date post: 21-Jun-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
24
Journal of Machine Learning Research 7 (2006) 1135–1158 Submitted 2/05; Published 6/06 New Algorithms for Efficient High-Dimensional Nonparametric Classification Ting Liu TINGLIU@CS. CMU. EDU Andrew W. Moore AWM@CS. CMU. EDU Alexander Gray AGRAY@CS. CMU. EDU Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213, USA Editor: Claire Cardie Abstract This paper is about non-approximate acceleration of high-dimensional nonparametric operations such as k nearest neighbor classifiers. We attempt to exploit the fact that even if we want exact answers to nonparametric queries, we usually do not need to explicitly find the data points close to the query, but merely need to answer questions about the properties of that set of data points. This offers a small amount of computational leeway, and we investigate how much that leeway can be exploited. This is applicable to many algorithms in nonparametric statistics, memory-based learn- ing and kernel-based learning. But for clarity, this paper concentrates on pure k-NN classification. We introduce new ball-tree algorithms that on real-world data sets give accelerations from 2-fold to 100-fold compared against highly optimized traditional ball-tree-based k-NN . These results in- clude data sets with up to 10 6 dimensions and 10 5 records, and demonstrate non-trivial speed-ups while giving exact answers. keywords: ball-tree, k-NN classification 1. Introduction Nonparametric models have become increasingly popular in the statistics and probabilistic AI com- munities. These models are extremely useful when the underlying distribution of the problem is unknown except that which can be inferred from samples. One simple well-known nonparametric classification method is called the k-nearest-neighbors or k-NN rule. Given a data set V R D con- taining n points, it finds the k closest points to a query point q R D , typically under the Euclidean distance, and chooses the label corresponding to the majority. Despite the simplicity of this idea, it was famously shown by Cover and Hart (Cover and Hart, 1967) that asymptotically its error is within a factor of 2 of the optimal. Its simplicity allows it to be easily and flexibly applied to a variety of complex problems. It has applications in a wide range of real-world settings, in particular pattern recognition (Duda and Hart, 1973; Draper and Smith, 1981); text categorization (Uchimura and Tomita, 1997); database and data mining (Guttman, 1984; Hastie and Tibshirani, 1996); in- formation retrieval (Deerwester et al., 1990; Faloutsos and Oard, 1995; Salton and McGill, 1983); image and multimedia search (Faloutsos et al., 1994; Pentland et al., 1994; Flickner et al., 1995; Smeulders and Jain, 1996); machine learning (Cost and Salzberg, 1993); statistics and data anal- ysis (Devroye and Wagner, 1982; Koivune and Kassam, 1995) and also combination with other c 2006 Ting Liu, Andrew W. Moore and Alexander Gray.
Transcript
Page 1: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

Journal of Machine Learning Research 7 (2006) 1135–1158 Submitted 2/05; Published 6/06

New Algorithms for Efficient High-DimensionalNonparametric Classification

Ting Liu TINGLIU @CS.CMU.EDU

Andrew W. Moore AWM @CS.CMU.EDU

Alexander Gray [email protected]

Computer Science DepartmentCarnegie Mellon UniversityPittsburgh, PA 15213, USA

Editor: Claire Cardie

AbstractThis paper is about non-approximate acceleration of high-dimensional nonparametric operations

such ask nearest neighbor classifiers. We attempt to exploit the factthat even if we want exactanswers to nonparametric queries, we usually do not need to explicitly find the data points close tothe query, but merely need to answer questions about the properties of that set of data points. Thisoffers a small amount of computational leeway, and we investigate how much that leeway can beexploited. This is applicable to many algorithms in nonparametric statistics, memory-based learn-ing and kernel-based learning. But for clarity, this paper concentrates on purek-NN classification.We introduce new ball-tree algorithms that on real-world data sets give accelerations from 2-foldto 100-fold compared against highly optimized traditionalball-tree-basedk-NN . These results in-clude data sets with up to 106 dimensions and 105 records, and demonstrate non-trivial speed-upswhile giving exact answers.

keywords: ball-tree,k-NN classification

1. Introduction

Nonparametric models have become increasingly popular in the statistics and probabilistic AI com-munities. These models are extremely useful when the underlying distribution of the problem isunknown except that which can be inferred from samples. One simple well-known nonparametricclassification method is called thek-nearest-neighbors ork-NN rule. Given a data setV ⊂ RD con-tainingn points, it finds thek closest points to a query pointq ∈ RD, typically under the Euclideandistance, and chooses the label corresponding to the majority. Despite the simplicity of this idea,it was famously shown by Cover and Hart (Cover and Hart, 1967) that asymptotically its error iswithin a factor of 2 of the optimal. Its simplicity allows it to be easily and flexibly appliedto avariety of complex problems. It has applications in a wide range of real-world settings, in particularpattern recognition (Duda and Hart, 1973; Draper and Smith, 1981); textcategorization (Uchimuraand Tomita, 1997); database and data mining (Guttman, 1984; Hastie and Tibshirani, 1996); in-formation retrieval (Deerwester et al., 1990; Faloutsos and Oard, 1995; Salton and McGill, 1983);image and multimedia search (Faloutsos et al., 1994; Pentland et al., 1994; Flickner et al., 1995;Smeulders and Jain, 1996); machine learning (Cost and Salzberg, 1993); statistics and data anal-ysis (Devroye and Wagner, 1982; Koivune and Kassam, 1995) and also combination with other

c©2006 Ting Liu, Andrew W. Moore and Alexander Gray.

Page 2: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

L IU , MOORE AND GRAY

methods (Woods et al., 1997). However, these methods all remain hamperedby their computationalcomplexity.

Several effective solutions exist for this problem when the dimensionD is small, including Voronoidiagrams (Preparata and Shamos, 1985), which work well for two dimensional data. Other meth-ods are designed to work for problems with moderate dimension (i.e. tens of dimensions), suchas k-D tree (Friedman et al., 1977; Preparata and Shamos, 1985), R-tree (Guttman, 1984), andball-tree (Fukunaga and Narendra, 1975; Omohundro, 1991; Uhlmann, 1991; Ciaccia et al., 1997).Among these tree structures, balltree, or metric-tree (Omohundro, 1991),represent the practicalstate of the art for achieving efficiency in the largest dimension possible (Moore, 2000; Clarkson,2002) without resorting to approximate answers. They have been used inmany different ways, in avariety of tree search algorithms and with a variety of “cached sufficient statistics” decorating theinternal leaves, for example in Omohundro (1987); Deng and Moore (1995); Zhang et al. (1996);Pelleg and Moore (1999); Gray and Moore (2001). However, many real-world problems are posedwith very large dimensions that are beyond the capability of such search structures to achieve sub-linear efficiency, for example in computer vision, in which each pixel of an image represents adimension. Thus, the high-dimensional case is the long-standing frontier ofthe nearest-neighborproblem.

With one exception, the proposals involving tree-based or other data structures have consideredthe generic nearest-neighbor problem, not that of nearest-neighborclassificationspecifically. Manyproposals designed specifically for nearest-neighbor classification have been proposed, virtually allof them pursuing the idea of reducing the number of training points. In most of these approaches,such as Hart (1968), although the runtime is reduced, so is the classification accuracy. Several sim-ilar training set reduction schemes yielding only approximate classifications have been proposed(Fisher and Patrick, 1970; Gates, 1972; Chang, 1974; Ritter et al., 1975; Sethi, 1981; Palau andSnapp, 1998). Our method achieves the exact classification that would beachieved by exhaus-tive search for the nearest neighbors. A few training set reduction methods have the capability ofyielding exact classifications. Djouadi and Bouktache (1997) described both approximate and exactmethods, however a speedup of only about a factor of two over exhaustive search was reported forthe exact case, for simulated, low-dimensional data. Lee and Chae (1998) also achieves exact clas-sifications, but only obtained a speedup over exhaustive search of about 1.7. It is in fact commonamong the results reported for training set reduction methods that only 40-60% of the training pointscan be discarded,i.e. no important speedups are possible with this approach when the Bayes riskisnot insignificant. Zhang and Srihari (2004) pursued a combination of training set reduction and atree data structure, but is an approximate method.

In this paper, we propose two new ball-tree based algorithms, which we’ll call KNS2 and KNS3.They are both designed for binaryk-NN classification. We only focus on binary case, since thereare many binary classification problems, such as anomaly detection (Kruegel and Vigna, 2003),drug activity detection (Komarek and Moore, 2003); and video segmentation (Qi et al., 2003). Liuet al. (2004b) applied similar ideas to many-class classification and proposed a variation of thek-NN algorithm. KNS2 and KNS3 share the same insight that the task ofk-nearest-neighbor clas-sification of a queryq need not require us to explicitly find those k nearest neighbors. To be morespecific, there are three similar but in fact different questions: (a)“What are the k nearest neigh-

1136

Page 3: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

NEW ALGORITHMS FOREFFICIENT HIGH-DIMENSIONAL NONPARAMETRIC CLASSIFICATION

bors ofq?” (b) “How many of the k nearest neighbors ofq are from the positive class?”and (c)“Are at least t of the k nearest neighbors from the positive class?”Many researches have focusedon the first question (a), but uses of proximity queries in statistics far more frequently require (b)and (c) types of computations. In fact, for thek-NN classification problem, when the thresholdt isset, it is sufficient to just answer the much simpler question (c). The triangle inequality underlyinga ball-tree has the advantage of bounding the distances between data points, and can thus help usestimate the nearest neighbors without explicitly finding them. In our paper, we test our algorithmson 17 synthetic and real-world data sets, with dimensions ranging from 2 to 1.1×106 and numberof data points ranging from 104 to 4.9×105. We observe up to 100-fold speedup compared againsthighly optimized traditional ball-tree-basedk-NN , in which the neighbors are found explicitly.

Omachi and Aso (2000) proposed a fastk-NN classifier based on a branch and bound method, andthe algorithm shares some ideas of KNS2, but it did not fully explore the ideaof doingk-NN classifi-cation without explicitly finding thek nearest neighbor set, and the speed-up the algorithm achievedis limited. In section 4, we address Omachi and Aso’s method in more detail.

We will first describe ball-trees and this traditional way of using them (whichwe call KNS1), whichcomputes problem (a). Then we will describe a new method (KNS2) for problem (b), designed forthe common setting of skewed-class data. We’ll then describe a new method (KNS3) for problem(c), which removes the skewed-class assumption, applying to arbitrary classification problems. Atthe end of Section 5 we will say a bit about the relative value of KNS2 versus KNS3.

2. Ball-Tree

A ball-tree(Fukunaga and Narendra, 1975; Omohundro, 1991; Uhlmann, 1991;Ciaccia et al., 1997;Moore, 2000) is a binary tree where each node represents a set of points, called Points(Node). Givena data set, theroot nodeof a ball-tree represents the full set of points in the data set. A node can beeither aleaf nodeor anon-leaf node. A leaf node explicitly contains a list of the points representedby the node. A non-leaf node has two children nodes:Node.child1andNode.child2, where

Points(Node.child1)∩Points(Node.child2) = φPoints(Node.child1)∪Points(Node.child2) = Points(Node)

Points are organized spatially. Each node has a distinguished point called aPivot. Depending on theimplementation, thePivot may be one of the data points, or it may be the centroid ofPoints(Node).Each node records the maximum distance of the points it owns to its pivot. Call this the radius ofthe node

Node.Radius= maxx∈Points(Node)

| Node.Pivot−x |

Nodes lower down the tree have smaller radius. This is achieved by insisting,at tree constructiontime, that

x ∈ Points(Node.child1) ⇒ | x−Node.child1.Pivot | ≤ | x−Node.child2.Pivot |

x ∈ Points(Node.child2) ⇒ | x−Node.child2.Pivot | ≤ | x−Node.child1.Pivot |

1137

Page 4: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

L IU , MOORE AND GRAY

Provided that our distance function satisfies the triangle inequality, we can bound the distance froma query pointq to any point in any ball-tree node. Ifx ∈ Points(Node)then we know that:

|x−q| ≥ |q−Node.Pivot|−Node.Radius (1)

|x−q| ≤ |q−Node.Pivot|+Node.Radius (2)

Here is an easy proof of the inequality. According to triangle inequality, we have |x− q| ≥ |q−Node.Pivot|− |x−Node.Pivot|. Givenx ∈ Points(Node)andNode.Radiusis the maximum distanceof the points it owns to its pivot,|x−Node.Pivot| ≤ Node.Radius, so |x−q| ≥ |q−Node.Pivot|−Node.Radius. Similarly, we can prove Equation (2).�

A ball-tree is constructed top-down. There are several ways to construct them, and practical al-gorithms trade off the cost of construction (it can be inefficient to beO(R2) given a data set withRpoints, for example) against the tightness of the radius of the balls. Moore (2000) describes a fastway for constructing a ball-tree appropriate for computational statistics. Ifa ball-tree is balanced,then the construction time isO(CRlogR), whereC is the cost of a point-point distance computation(which isO(m) if there arem dense attributes, andO( f m) if the records are sparse with only frac-tion f of attributes taking non-zero value). Figure 1 shows a 2-dimensional dataset and the first fewlevels of a ball-tree.

1a. A dataset

A

1b. Root node

B

C

1c. The 2 children

D

G

E

F

1d. The 4 grandchildren

A

B C

D E F G

1e. The internal tree structure

Figure 1: An example of ball-tree structure

3. KNS1: Conventionalk Nearest Neighbor Search with Ball-Tree

In this paper, we call conventional ball-tree-based search (Uhlmann, 1991)KNS1. Let PSbe a setof data points, andPS ⊆V, whereV is the training set. We begin with the following definition:

1138

Page 5: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

NEW ALGORITHMS FOREFFICIENT HIGH-DIMENSIONAL NONPARAMETRIC CLASSIFICATION

Say thatPS consists of the k-NN ofq in V if and only if

|V |≥ k and PSare thek-NN of q in Vor

|V |< k and PS== V(3)

We now define a recursive procedure calledBallKNN with the following inputs and output.

PSout = BallKNN(PSin,Node)

LetV be the set of points searched so far, on entry. Assume thatPSin consists of thek-NN of q in V.This function efficiently ensures that on exit,PSout consists of thek-NN of q in V ∪Points(Node).We define

Dsofar=

{

∞ i f | PSin |< kmaxx∈PSin | x−q | i f | PSin |== k

(4)

Dsofar is the minimum distance within which points become interesting to us.

Let DNodeminp =

{

max(|q−Node.Pivot|−Node.Radius,DNode.parentminp ) i f Node6= Root

max(|q−Node.Pivot|−Node.Radius,0) i f Node== Root(5)

DNodeminp is the minimum possible distance from any point inNodeto q. This is computed using the

bound given by Equation (1) and the fact that all points covered by a node must be covered byits parent. This property implies thatDNode

minp will never be less than the minimum distance of itsancestors. Step 2 of section 4 explains this optimization further. See Figure 2for details.Experimental results show that KNS1 (conventional ball-tree-basedk-NN search) achieves signifi-cant speedup over Naivek-NN when the dimensiond of the data set is moderate (less than 30). Inthe best case, the complexity of KNS1 can be as good asO(d logR), given a data set withR points.However, withd increasing, the benefit achieved by KNS1 degrades, and whend is really large, inthe worst case, the complexity of KNS1 can be as bad asO(dR). Sometimes it is even slower thanNaivek-NN search, due to the curse of dimensionality.

In the following sections, we describe our new algorithms KNS2 and KNS3, these two algorithmsare both based on ball-tree structure, but by using different search strategies, we explore how muchspeedup can be achieved beyond KNS1.

4. KNS2: Fasterk-NN Classification for Skewed-Class Data

In many binary classification domains, one class is much more frequent than the other. For example,in High Throughput Screening data sets, (described in section 7.2), it is far more common for theresult of an experiment to be negative than positive. In detection of fraud telephone calls (Fawcettand Provost, 1997) or credit card transactions (Stolfo et al., 1997), the number of legitimate transac-tions is far more common than fraudulent ones. In insurance risk modeling (Pednault et al., 2000),a very small percentage of the policy holders file one or more claims in a giventime period. Thereare many other examples of domains with similar intrinsic imbalance, and therefore, classificationwith a skewed distribution is important. Various researches have been focused on designing clever

1139

Page 6: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

L IU , MOORE AND GRAY

ProcedureBallKNN (PSin,Node)begin

if (DNodeminp ≥ Dsofar) then /* If this condition is satisfied, then impossible

ReturnPSin unchanged. for a point in Node to be closer than thepreviously discoveredkth nearest neighbor.*/

else if(Node is a leaf)PSout = PSin

∀x ∈ Points(Node)if (| x−q |< Dsofar) then /* If a leaf, do a naive linear scan */

addx to PSout

if (| PSout |== k+1) thenremove furthest neighbor fromPSout

updateDsofar

else /*If a non-leaf, explore the nearer of the twonode1 = child of Node closest toq child nodes, then the further. It is likely thatnode2 = child of Node furthest fromq further search will immediately prune itself.*/PStemp= BallKNN(PSin,node1)PSout = BallKNN(PStemp,node2)

end

Figure 2: A call of BallKNN({},Root) returns thek nearest neighbors ofq in the ball-tree.

methods to solve this type of problem (Cardie and Howe, 1997). The new algorithm introduced inthis section, KNS2, is designed to acceleratek-NN based classification in such skewed data scenar-ios.

KNS2 answers type(b) question described in the introduction, namely, “How many of thek nearestneighbors are in the positive class?” The key idea of KNS2 is we can answer question (b) withoutexplicitly finding thek-NN set.

KNS2 attacks the problem by building two ball-trees: APostreefor the points from the positive(small) class, and aNegtreefor the points from the negative (large) class. Since the number ofpoints from the positive class(small) is so small, it is quite cheap to find the exactk nearest positivepoints ofq by using KNS1. And the idea of KNS2 is first searchPostreeusing KNS1 to find thek nearest positive neighbors setPossetk, and then searchNegtreewhile usingPossetk as bounds toprune nodes far away, and at the same time estimating the number of negative points to be insertedto the true nearest neighbor set. The search can be stopped as soon aswe get the answer to question(b). Empirically, much more pruning can be achieved by KNS2 than conventional ball-tree search.A concrete description of the algorithm is as follows:

Let Rootpos be the root ofPostree, andRootneg be the root ofNegtree. Then, we classify a newquery pointq in the following fashion

1140

Page 7: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

NEW ALGORITHMS FOREFFICIENT HIGH-DIMENSIONAL NONPARAMETRIC CLASSIFICATION

• Step 1 —“ Find positive” : Find thek nearest positive class neighbors ofq (and their dis-tances toq) using conventional ball-tree search.

• Step 2 —“Insert negative” : Do sufficient search on the negative tree to prove that thenumber of positive data points amongk nearest neighbors isn for some value ofn.

Step 2 is achieved using a new recursive search calledNegCount. In order to describeNegCountweneed the following four definitions.

• The Dists Array. Dists is an array of elementsDists1 . . .Distsk consisting of the distances tothek nearest positive neighbors found so far ofq, sorted in increasing order of distance. Fornotational convenience we will also writeDists0 = 0 andDistsk+1 = ∞.

• PointsetV. Define pointsetV as the set of points in the negative balls visited so far in thesearch.

• The Counts Array (n,C) (n≤ k+1). C is an array of counts containing n+1 array elementsC0,C1, ...Cn. Say(n,C)summarize interesting negative points for pointsetV if and only if

1. ∀i = 0,1, ...,n,Ci =|V ∩{x :| x−q |< Distsi} | (6)

Intuitively Ci is the number of points inV whose distances toq are closer thanDistsi . Inother words,Ci is the number of negative points inV closer than theith positive neighborto q.

2. Ci + i ≤ k(i < n), Cn +n > k.

This simply declares that the lengthn of theC array is as short as possible while ac-counting for thek members ofV that are nearest toq. Such ann exists sinceC0 = 0 andCk+1 = Total number of negative points. To make the problem interesting, we assumethat the number of negative points and the number of positive points are bothgreaterthank.

• DNodeminp andDNode

maxp

Here we will continue to useDNodeminp which is defined in equation (4).

Symmetrically, we also defineDNodemaxp as follows:

Let DNodemaxp=

{

min(|q−Node.Pivot|+Node.Radius, DNode.parentmaxp ) i f Node6= Root

|q−Node.Pivot|+Node.Radius i f Node== Root(7)

DNodemaxp is the maximum possible distance from any point in Node toq. This is computed using

the bound in Equation (1) and the property of a ball-tree that all the points covered by a nodemust be covered by its parent. This property implies thatDNode

maxp will never be greater than themaximum possible distance of its ancestors.

Figure 3 gives a good example. There are 3 nodesp, c1 andc2. c1 andc2 arep’s children.q is the query point. In order to computeDc1

minp, first we compute|q−c1.pivot|−c1.radius,

1141

Page 8: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

L IU , MOORE AND GRAY

Dminpc1

���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

���������������������������������������������������������������������������������������������������������������������������������������������������������

���������������������������������������������������������������������������������������������������������������������������������������������������������

������������������������������

������������������������������

p

c2

c1

q

Figure 3: An example to illustrate how to computeDNodeminp

which is the dotted line in the figure, butDc1minp can be further bounded byDp

minp, since it isimpossible for any point to be in the shaded area. Similarly, we get the equationfor Dc1

maxp.DNode

minp andDNodemaxp are used to estimate the counts array(n,C). Again we take advantage of

the triangle inequality of ball-tree. For any Node, if there exists ani (i ∈ [1,n]), such thatDistsi−1 ≤ DNode

maxp< Distsi , then for∀x∈ Points(Node), Distsi−1 ≤| x−q |< Distsi . Accord-ing to the definition ofC, we can add| Points(Node)| toCi ,Ci+1, ...Cn. The function ofDNode

minpsimilar to KNS1, is used to help prune uninteresting nodes.

Step 2 of KNS2 is implemented by the recursive function below:

(nout,Cout) = NegCount(nin,Cin,Node, jparent,Dists)

See Figure 4 for the detailed implementation of NegCount.Assume that on entry(nin,Cin) summarize interesting negative points for pointsetV, whereV isthe set of points visited so far during the search. This algorithm efficiently ensures that, on exit,(nout,Cout) summarize interesting negative points forV ∪Points(Node). In addition, jparent is atemporary variable used to prevent multiple counts for the same point. This variable relates to theimplementation of KNS2, and we do not want to go into the details in this paper.

We can stop the procedure whennout becomes 1 (which means all thek nearest neighbors ofqare in the negative class) or when we run out of nodes.nout represents the number of positive pointsin thek nearest neighbors ofq.The top-level call is

NegCount(k,C0,NegTree.Root,k+1,Dists)

whereC0 is an array of zeroes andDistsare defined in step 2 and obtained by applying KNS1 to thePostree.

There are at least two situations where this algorithm can run faster than simply finding k-NN .First of all, whenn = 1, we can stop and exit, since this means we have found at leastk negativepoints closer than the nearest positive neighbor toq. Notice that thek negative points we havefound are not necessarily the exactk nearest neighbors toq, but this won’t change the answer to

1142

Page 9: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

NEW ALGORITHMS FOREFFICIENT HIGH-DIMENSIONAL NONPARAMETRIC CLASSIFICATION

ProcedureNegCount (nin,Cin,Node, jparent,Dists)begin

nout := nin /* Variables to be returned by the search.Cout := Cin Initialize them here. */

ComputeDNodeminp andDNode

maxp

Search fori, j ∈ [1,nout], such thatDistsi-1 ≤ DNode

minp < Distsi

Distsj-1 ≤ DNodemaxp< Distsj

For all index∈ [ j, jparent) /* Re-estimateCout */UpdateCout

index := Coutindex+ | Points(Node)| /* Only update the count less thanjparent

Updatenout, such that to avoid counting twice. */Cout

nout−1 +(nout−1) ≤ k, Coutnout +nout > k

SetDistsnout := ∞

(1) if (nout == 1) /* At leastk negative points closer toqReturn(1,Cout) than the closest positive one: done! */

(2) if (i == j) /* Node is located between two adjacentReturn(nout,Cout) positive points, no need to split. */

(3) if (Node is a leaf)Forallx∈ Points(Node)

Compute| x−q |Update and return (nout,Cout)

(4) elsenode1 := child of Node closest toqnode2 := child of Node furthest fromq(ntemp,Ctemp) := NegCount(nin,Cin,node1, j,Dists)if (ntemp== 1)

Return (1,Cout)else (nout,Cout) := NegCount(ntemp,Ctemp,node2, j,Dists)

end

Figure 4: Procedure NegCount.

our question. This situation happens frequently for skewed data sets. The second situation is asfollows: A Node can also be pruned if it is located exactly between two adjacent positive points, orit is farther away than thenth positive point. This is because that in these situations, there is no needto figure out which negative point is closer within the Node. Especially asn gets smaller, we havemore chance to prune a node, becauseDistsnin decreases asnin decreases.

Omachi and Aso (2000) proposed ak-NN method based on branch and bound. For simplicity,we call their algorithm KNSV. KNSV is similar to KNS2, in that for the binary classcase, it also

1143

Page 10: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

L IU , MOORE AND GRAY

builds two trees, one for each class. For consistency, let’s still call themPostreeandNegtree. KNSVfirst searches the tree whose center of gravity is closer toq. Without loses of generality, we assumeNegtreeis closer, so KNSV will searchNegtreefirst. Instead of fully exploring the tree, it does agreedy depth first search only to findk candidate points. Then KNSV moves on to searchPostree.The search is the same as conventional ball-tree search (KNS1), except that it uses thekth candidatenegative point to bound the distance. After the search ofPostreeis done. KNSV counts how manyof thek nearest neighbors so far are from the negative class. If the number ismore thank/2, the al-gorithm stops. Otherwise, KNSV will go back to searchNegtreefor the second time, this time fullysearch the tree. KNSV has advantages and disadvantages. The first advantage is that it is simple,and thus it is easy to extend to many-class case. Also if the first guess of KNSV is correct and thek candidate points are good enough to prune away many nodes, it will be faster than conventionalball-tree search. But there are some obvious drawbacks of the algorithm.First, the guess of thewinner class is only based on which class’s center of gravity is the closestto q. Notice that this isa pure heuristic, and the probability of making a mistake is high. Second, usinga greedy search tofind thek candidate nearest neighbors has a high risk, since these candidates might not even be closeto the true nearest neighbors. In that case, the chance for pruning away nodes from the other classbecomes much smaller. We can imagine that in many situations, KNSV will end up searching thefirst tree for yet another time. Finally, we want to point out that KNSV claims itcan perform wellfor many-class nearest neighbors, but this is based on the assumption that the winner class containsat leastk/2 points within the nearest neighbors, which is often not true for the many-class case.Comparing to KNSV, KNS2’s advantages are (i) it uses the skewness property of a data set, whichcan be robustly detected before the search, and (ii) more careful design gives KNS2 more chance tospeedup the search.

5. KNS3: Are at Leastt of the k Nearest Neighbors Positive?

In this paper’s second new algorithm, we remove KNS2’s constraint of anassumed skewness in theclass distribution. Instead, we answer a weaker question: “are at leastt of thek nearest neighborspositive?”, where the questioner must supplyt andk. In the usualk-NN rule,t represents a majoritywith respect tok, but here we consider the slightly more general form which might be used forexample during classification with known false positive and false negative costs.

In KNS3, we define two important quantities:

Dpost = distance o f the tth nearest positive neighbor o fq (8)

Dnegm = distance o f the mth nearest negative neighbor o fq (9)

wherem+ t = k+1.

Before introducing the algorithm, we state and prove an important proposition, which relate thetwo quantitiesDpos

t andDnegm with the answer to KNS3.

Proposition 1 Dpost ≤ Dneg

m if and only if at least t of the k nearest neighbors ofq from the positiveclass.

1144

Page 11: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

NEW ALGORITHMS FOREFFICIENT HIGH-DIMENSIONAL NONPARAMETRIC CLASSIFICATION

Proof:If Dpos

t ≤ Dnegm , then there are at leastt positive points closer than themth negative point toq. This

also implies that if we draw a ball centered atq, and with its radius equal toDnegm , then there are

exactlym negative points and at leastt positive points within the ball. Sincet + m= k+ 1, if weuseDk to denote the distance of thekth nearest neighbor, we getDk ≤ Dneg

m , which means that thereare at mostm− 1 of thek nearest neighbors ofq from the negative class. It is equivalent to saythat there are at leastt of the k nearest neighbors ofq are from the positive class. On the otherhand, if there are at leastt of thek nearest neighbors from the positive class, thenDpos

t ≤ Dk, thenumber of negative points is at mostk−t < m, soDk ≤Dneg

m . This implies thatDpost ≤Dneg

m is true.�

Figure 5 provides an illustration. In this example,k = 5, t = 3. We use black dots to representpositive points, and white dots to represent negative points. The reasonto redefine the problem of

D3neg

D3pos

q

A

B

Figure 5: An example ofDpost andDneg

m

KNS3 is to transform ak nearest neighbor searching problem to a much simpler counting prob-lem. In fact, in order to answer the question, we do not even have to computethe exact value ofDpos

t and Dnegm , instead, we can estimate them. We defineLo(Dpos

t ) andU p(Dpost ) as the lower

and upper bounds ofDpost , and similarly we defineLo(Dneg

m ) andU p(Dnegm ) as the lower and upper

bounds ofDnegm . If at any point,U p(Dpos

t ) ≤ Lo(Dnegm ), we knowDpos

t ≤ Dnegm , on the other hand, if

U p(Dnegm ) ≤ Lo(Dpos

t ), we knowDnegm ≤ Dpos

t .

Now our computational task is to efficiently estimateLo(Dpost ), U p(Dpos

t ), Lo(Dnegm ) andU p(Dneg

m ).And it is very convenient for a ball-tree structure to do so. Below is the detailed description:

At each stage of KNS3 we have two sets of balls in use calledP andN, whereP is a set of balls fromPostreebuilt from positive data points, andN consists of balls fromNegtreebuilt from negative datapoints.

Both sets have the property that if a ball is in the set, then neither its ball-tree ancestors nor de-scendants are in the set, so that each point in the training set is a member of oneor zero balls inP∪N. Initially, P = {PosTree.root} andN = {NegTree.root}. Each stage of KNS3 analyzesP toestimateLo(Dpos

t ), U p(Dpost ), and analyzesN to estimateLo(Dneg

m ), U p(Dnegm ). If possible, KNS3

terminates with the answer, else it chooses an appropriate ball fromP or N, and replaces that ballwith its two children, and repeats the iteration. Figure (6) shows one stage ofKNS3. The balls

1145

Page 12: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

L IU , MOORE AND GRAY

involved are labeleda throughg and we have

P = {a,b,c,d}

N = {e, f ,g}

Notice that although c and d are inside b, they are not descendants of b. This is possible becausewhen a ball is splitted, we only require the pointset of its children be disjoint, butthe balls coveringthe children node may be overlapped.

a

b

c

d

f

g

e

q

Figure 6: A configuration at the start of a stage.

In order to computeLo(Dpost ), we need to sort the ballsu∈ P, such that

∀ui ,u j ∈ P, i < j ⇒ Diminp≤ D j

minp

Then

Lo(Dpost ) = D

u jminp, where

j−1

∑i=1

| Points(ui) |< t andj

∑i=1

| Points(ui) |≥ t

Symmetrically, in order to computeU p(Dpost ), we sortu∈ P, such that

∀ui ,u j ∈ P, i < j ⇒ Dimaxp≤ D j

maxp.

Then

U p(Dpost ) = D

u jmaxp, where

j−1

∑i=1

| Points(ui) |< t andj

∑i=1

| Points(ui) |≥ t

Similarly, we can computeLo(Dnegm ) andU p(Dneg

m ).

To illustrate this, it is useful to depict a ball as an interval, where the two endsof the intervaldenote the minimum and maximum possible distances of a point owned by the ball to the query.Figure 5(a) shows an example. Notice, we also mark “+5” above the interval to denote the numberof points owned by the ballB. After we have this representation, bothP andN can be represented asa set of intervals, each interval corresponds to a ball. This is shown in 5(b). For example, the secondhorizontal line denotes the fact that ballb contains four positive points, and that the distance from

1146

Page 13: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

NEW ALGORITHMS FOREFFICIENT HIGH-DIMENSIONAL NONPARAMETRIC CLASSIFICATION

any location inb to q lies in the range[0,5]. The value ofLo(Dpost ) can be understood as the answer

to the following question: what if we tried to slide all the positive points within their bounds as farto the left as possible, where would thetth closest positive point lie? Similarly, we can estimateU p(Dpos

t ) by sliding all the positive points to the right ends within their bounds.

Dist0 1 2 3 4 5

Dist0 1 2 3 4 5

Up(D6neg )

Lo(D )7pos

Lo(D6neg )

Up(D )7pos

q

+5

a +2

b +4

+4c

d

e

f

g

−3

+3

−5

−5

B

(a) (b)

Figure 7: (a) The interval representation of a ballB. (b) The interval representation of the configu-ration in Figure 6

.

For example, in Figure 6, letk= 12 andt = 7. Thenm= 12−7+1= 6. We can estimate (Lo(Dpos7 ),

U p(Dpos7 )) and (Lo(Dneg

6 ), U p(Dneg6 )), and the results are shown in Figure 5. Since the two intervals

(Lo(Dpos7 ), U p(Dpos

7 )) and (Lo(Dneg6 ),U p(Dneg

6 )) have overlap now, no conclusion can be made atthis stage. Further splitting needs to be done to refine the estimation.

Below is the pseudo code of KNS3 algorithm: We define a loop procedure called PREDICTwiththe following input and output.

Answer= PREDICT(P,N, t,m)

The Answer, a boolean value, is TRUE, if there are at leastt of the k nearest neighbors from thepositive class; and False otherwise. Initially, P ={PosTree.root} and N ={NegTree.root}. Thethresholdt is given, andm= k− t +1.

Before we describe the algorithm, we first introduce two definitions.Define:

(Lo(DSi ),U p(DS

i )) = Estimatebound(S, i) (10)

Here S is either setP or N, and we are interested in theith nearest neighbor ofq from set S. Theoutput is the lower and upper bounds. The concrete procedure for estimating the bounds was justdescribed.

Notice that the estimation of the upper and lower bounds could be very loose inthe beginning,

1147

Page 14: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

L IU , MOORE AND GRAY

and will not give us enough information to answer the question. In this case, we will need to split aball-tree node and re-estimate the bounds. With more and more nodes being splitted, our estimationbecomes more and more precise, and the procedure can be stopped as soon asU p(Dpos

t )≤ Lo(Dnegm )

or U p(Dnegm ) ≤ Lo(Dpos

t ). The function ofPick(P,N) below is to choose one node either from P orN to split. There are different strategies for picking a node, for simplicity, our implementation onlyrandomly pick a node to split.Define:

split node= Pick(P,N) (11)

Here splitnode is the node chosen to be split. See Figure 8.

ProcedurePREDICT ( P, N, t, m)begin

Repeat(Lo(Dpos

t ),U p(Dpost )) = Estimatebound(P, t) /* See Definition 10. */

(Lo(Dnegm ),U p(Dneg

m )) = Estimatebound(N, m)if (U p(Dpos

t ) ≤ Lo(Dnegm )) then

Return TRUEif (U p(m

neg) ≤ Lo(Dnegm )) then

Return FALSE

split node = Pick(P, N)remove splitnode from P or Ninsert splitnode.child1 and splitnode.child2 to P or N

end

Figure 8: Procedure PREDICT..

Our explanation of KNS3 was simplified for clarity. In order to avoid frequent searches over the fulllengths of setsN andP, they are represented as priority queues. Each set in fact uses two queues:one prioritized byDu

maxpand the other byDuminp.This ensures that the costs of all argmins, deletions

and splits are logarithmic in the queue size.

Some people may ask the question: “It seems that KNS3 has more advantagesthan KNS2, it re-moves the assumption of skewness of the data set. In general, it has more chances to prune awaynodes, etc. Why we still need KNS2?” The answer is KNS2 does have its own advantages. Itanswers a more difficult question than KNS3. To know exact how many of the nearest neighborsare from the positive class can be especially useful when the threshold for deciding a class is notknown. In that case, KNS3 doesn’t work at all since we can not provide a statict for answering thequestion (c). But KNS2 can still work very well. On the other hand, the implementation of KNS2 ismuch simpler than KNS3. For instance, it does not need the priority queues we just described. Sothere does exist some cases where KNS2 is faster than KNS3.

1148

Page 15: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

NEW ALGORITHMS FOREFFICIENT HIGH-DIMENSIONAL NONPARAMETRIC CLASSIFICATION

6. Experimental Results

To evaluate our algorithms, we used both real data sets (from UCI and KDDrepositories) and alsosynthetic data sets designed to exercise the algorithms in various ways.

6.1 Synthetic Data Sets

We have six synthetic data sets. The first synthetic data set we have is calledIdeal, as illustrated inFigure 6.1(a). All the data in the left upper area are assigned to the positive class, and all the data inthe right lower area are assigned to the negative class. The second dataset we have is calledDiag2d,as illustrated in Figure 6.1(b). The data are uniformly distributed in a 10 by 10 square. The dataabove the diagonal are assigned to the positive class, below diagonal are assigned to the negativeclass. We made several variants of Diag2d to test the robustness of KNS3.Diag2d(10%) has 10%data ofDiag2d. Diag3d is a cube with uniformly distributed data and classified by a diagonal-plane.Diag10d is a 10 dimensional hypercube with uniformly distributed data and classified byahyper-diagonal-plane.Noise-diag2d has the same data asDiag2d(10%), but 1% of the data wasassigned to the wrong class.

(10, 0)

(0, 0) (0, 10)

(10, 10)

(b) Diag2d (100,000 data−points)(a) Ideal

Figure 9: Synthetic Data Sets

Table6.1 is a summary of the data sets in the empirical analysis.

6.2 Real-World Data Sets

We used UCI & KDD data (listed in Table 6.2), but we also experimented with datasets of particularcurrent interest within our laboratory.

Life Sciences.These were proprietary data sets (ds1andds2) similar to the publicly available OpenCompound Database provided by the National Cancer Institute (NCI Open Compound Database,2000). The two data sets are sparse. We also present results on data sets derived fromds1, denotedds1.10pca, ds1.100pcaandds2.100anchorby linear projection using principal component analysis

1149

Page 16: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

L IU , MOORE AND GRAY

Data Set Num. of Num. of Num. of Num.pos/Num.negrecords Dimensions positive

Ideal 10000 2 5000 1Diag2d(10%) 10000 2 5000 1Diag2d 100000 2 50000 1Diag3d 100000 3 50000 1Diag10d 100000 10 50000 1Noise2d 10000 2 5000 1

Table 1: Synthetic Data Sets

(PCA).

Link Detection. The first, Citeseer, is derived from the Citeseer web site (Citeseer,2002)and liststhe names of collaborators on published materials. The goal is to predict whether JLee (the mostcommon name) was a collaborator for each work based on who else is listed for that work. WeuseJ Lee.100pcato represent the linear projection of the data to 100 dimensions using PCA. Thesecond link detection data set is derived from the Internet Movie Database (IMDB, 2002) and isdenotedimdbusing a similar approach, but to predict the participation of Mel Blanc (againthe mostcommon participant).

UCI/KDD data. We use four large data sets from KDD/UCI repository (Bay, 1999). Thedatasets can be identified from their names. They were converted to binary classification problems.Each categorical input attribute was converted inton binary attributes by a 1-of-n encoding (wheren is the number of possible values of the attribute).

1. Letteroriginally had 26 classes: A-Z. We performed binary classification using the letter Aas the positive class and “Not A” as negative.

2. Ipums(from ipums.la.97). We predictfarm status, which is binary.

3. Movie is a data set from (informedia digital video library project, 2001). The TREC-2001Video Track organized by NIST shot boundary Task. 4 hours of video or 13 MPEG-1 videofiles at slightly over 2GB of data.

4. Kdd99(10%)has a binary prediction: Normal vs. Attack.

6.3 Methodology

The data setds2 is particular interesting, because its dimension is 1.1×106. Our first experimentis especially designed for it. We usek=9, andt = ⌈k/2⌉, then we print out the distribution of timetaken for queries of three algorithms: KNS1, KNS2, and KNS3. This is aimedat understanding therange of behavior of the algorithms under huge dimensions (some queries will be harder, or takelonger, for an algorithm than other queries). We randomly took 1% negative records (881) and 50%positive records (105) as test data (total 986 points), and train on the remaining 87372 data points.

1150

Page 17: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

NEW ALGORITHMS FOREFFICIENT HIGH-DIMENSIONAL NONPARAMETRIC CLASSIFICATION

Data Set Num. of Num. of Num.of Num.pos/Num.negrecords Dimensions positive

ds1 26733 6348 804 0.03ds1.10pca 26733 10 804 0.03ds1.100pca 26733 100 804 0.03ds2 88358 1.1×106 211 0.002ds2.100anchor88358 100 211 0.002J Lee.100pca 181395 100 299 0.0017Blanc Mel 186414 10 824 0.004

Data Set Num. Num. of Num.of Num.pos/Num.negrecords Dimensions positive

Letter 20000 16 790 0.04Ipums 70187 60 119 0.0017Movie 38943 62 7620 0.24Kdd99( 10% ) 494021 176 97278 0.24

Table 2: Real Data Sets

For our second set of experiments, we did 10-fold cross-validation on all the data sets. For eachdata set, we testedk = 9 andk = 101, in order to show the effect of a small value and a largevalue. For KNS3, we usedt = ⌈k/2⌉: a data point is classified as positive iff the majority of itsk nearest neighbors are positive. Since we use cross-validation, thus each experiment requiredR k-NN classification queries (whereR is the umber of records in the data set) and each query in-volved thek-NN among 0.9R records. A naive implementation with no ball trees would thus require0.9R2 distance computations. We want to emphasize here that these algorithms are allexact. Noapproximations were used in the classifications.

6.4 Results

Figure 10 shows the histograms of times and speed-ups for queries on the ds2 data set. For Naivek-NN , all the queries take 87372 distance computations. For KNS1, all the queries take more than1.0×104 distance computations, (the average number of distances computed is 1.3×105) whichis greater than 87372 and thus traditional ball-tree search is worse than “naive” linear scan. ForKNS2, most of the queries take less than 4.0×104 distance computations, a few points take longertime. The average number of distances computed is 6233. For KNS3, all the queries take less than1.0× 104 distance computations, the average number of distances computed is 3411. The lowerthree figures illustrate speed-up achieved for KNS1, KNS2 and KNS3 over naive linear scan. Thefigures show the distribution of the speedup obtained for each query. From 10(d) we can see that onaverage, KNS1 is even slower than the naive algorithm. KNS2 can get from 2- to 250-fold speedups.On average, it has a 14-fold speedup. KNS3 can get from 2- to 2500-fold speedups. On average, ithas a 26-fold speedups.Table 6.4 shows the results for the second set of experiments. The second column lists the computa-tional cost of naivek-NN , both in terms of the number of distance computations and the wall-clock

1151

Page 18: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

L IU , MOORE AND GRAY

kns1

distance computation

num

ber

of d

ata

0 20000 40000 60000 80000 120000

010

020

030

040

050

0kns2

distance computation

num

ber

of d

ata

0 20000 40000 60000 80000 120000

010

020

030

040

050

0

kns3

distance computation

num

ber

of d

ata

0 20000 40000 60000 80000 120000

010

020

030

040

050

0

kns1 speedup

folds of speed up

num

ber

of d

ata

0 20 40 60 80 100

020

040

060

080

0

kns2 speedup

folds of speed up

num

ber

of d

ata

0 20 40 60 80 100

020

040

060

080

0

kns3 speedup

folds of speed up

num

ber

of d

ata

0 20 40 60 80 100

020

040

060

080

0

Figure 10: (a) Distribution of times taken for queries of KNS1 (b) Distributionof times taken forqueries of KNS2 (c) Distribution of times taken for queries of KNS3 (d) Distribution ofspeedup for queries achieved for KNS1 (e) Distribution of speedup for queries achievedfor KNS2 (f) Distribution of speedup for queries achieved for KNS3

time on an unloaded 2 GHz Pentium. We then examine the speedups of KNS1 (traditional useof a ball-tree) and our two new ball-tree methods (KNS2 and KNS3). Generally speaking, thespeedup achieved for distance computations on all three algorithms are greater than the correspond-ing speedup for wall-clock time. This is expected, because the wall-clock time also includes thetime for building ball trees, generating priority queues and searching. We can see that for the syn-thetic data sets, KNS1 and KNS2 yield 2-700 fold speedup over naive. KNS3 yields a 2-4500 foldspeedup. Notice that KNS2 can’t beat KNS1 for these data sets, because KNS2 is designed tospeedupk-NN search on data sets with unbalanced output classes. Since all the synthetic data setshave equal number of data from positive and negative classes, KNS2 has no advantage.

It is notable that for some high-dimensional data sets, KNS1 does not produce an acceleration

1152

Page 19: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

NEW ALGORITHMS FOREFFICIENT HIGH-DIMENSIONAL NONPARAMETRIC CLASSIFICATION

over naive. KNS2 and KNS3 do, however, and in some cases they are hundreds of times faster thanKNS1.

NAIVE KNS1 KNS2 KNS3dists time dists time dists time dists time

(secs) speedup speedup speedup speedup speedup speedupideal k=9 9.0×107 30 96.7 56.5 112.9 78.5 4500 486

k=101 23.0 10.2 24.7 14.7 4500 432Diag2d(10%)k=9 9.0×107 30 91 51.1 88.2 52.4 282 27.1

k=101 22.3 8.7 21.3 9.3 167.9 15.9Diag2d k=9 9.0×109 3440 738 366 664 372 2593 287

k=101 202.9 104 191 107.5 2062 287Diag3d k=9 9.0×109 4060 361 184.5 296 184.5 1049 176.5

k=101 111 56.4 95.6 48.9 585 78.1Diag10d k=9 9.0×109 6080 7.1 5.3 7.3 5.2 12.7 2.2

k=101 3.3 2.5 3.1 1.9 6.1 0.7Noise2d k=9 9.0×107 40 91.8 20.1 79.6 30.1 142 42.7

k=101 22.3 4 16.7 4.5 94.7 43.5ds1 k=9 6.4×108 4830 1.6 1.0 4.7 3.1 12.8 5.8

k=101 1.0 0.7 1.6 1.1 10 4.2ds1.10pca k=9 6.4×108 420 11.8 11.0 33.6 21.4 71 20

k=101 4.6 3.4 6.5 4.0 40 6.1ds1.100pca k=9 6.4×108 2190 1.7 1.8 7.6 7.4 23.7 29.6

k=101 0.97 1.0 1.6 1.6 16.4 6.8ds2 k=9 8.5×109 105500 0.64 0.24 14.0 2.8 25.6 3.0

k=101 0.61 0.24 2.4 0.83 28.7 3.3ds2.100- k=9 7.0×109 24210 15.8 14.3 185.3 144 580 311

k=101 10.9 14.3 23.0 19.4 612 248J Lee.100- k=9 3.6×1010 142000 2.6 2.4 28.4 27.2 15.6 12.6

k=101 2.2 1.9 12.6 11.6 37.4 27.2Blanc Mel k=9 3.8×1010 44300 3.0 3.0 47.5 60.8 51.9 60.7

k=101 2.9 3.1 7.1 33 203 134.0Letter k=9 3.6×108 290 8.5 7.1 42.9 26.4 94.2 25.5

k=101 3.5 2.6 9.0 5.7 45.9 9.4Ipums k=9 4.4×109 9520 195 136 665 501 1003 515

k=101 69.1 50.4 144.6 121 5264 544Movie k=9 1.4×109 3100 16.1 13.8 29.8 24.8 50.5 22.4

k=101 9.1 7.7 10.5 8.1 33.3 11.6Kddcup99 k=9 2.7×1011 1670000 4.2 4.2 574 702 4 4.1(10%) k=101 4.2 4.2 187.7 226.2 3.9 3.9

Table 3: Number of distance computations and wall-clock-time for Naivek-NN classification (2ndcolumn). Acceleration for normal use of KNS1 (in terms of num. distances and time).Accelerations of new methods KNS2 and KNS3 in other columns. Naive times are inde-pendent ofk.

7. Comments and Related Work

Why k-NN ? k-NN is an old classification method, often not achieving the highest possible accu-racies when compared to more complex methods. Why study it? There are many reasons.k-NN isa useful sanity check or baseline against which to check more sophisticated algorithmsprovidedk-NN is tractable. It is often the first line of attack in a new complex problem dueto its simplic-ity and flexibility. The user need only provide a sensible distance metric. The method is easy tointerpret once this distance metric is understood. We have already mentionedits compelling theo-

1153

Page 20: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

L IU , MOORE AND GRAY

retical properties, which explains its surprisingly good performance in practice in many cases. Forthese reason and others,k-NN is still popular in some fields that need classification, for exampleComputer Vision and QSAR analysis of High Throughput Screening data (e.g., Zheng and Tropsha,2000). Finally, we believe that the same insights that acceleratek-NN will apply to more modernalgorithms. From a theoretical viewpoint, many classification algorithms can be viewed simply asthe nearest-neighbor method with a certain broader notion of distance function; see for exampleBaxter and Bartlett (1998) for such a broad notion. RKHS kernel methods use another example of abroadened notion of distance function. More concretely, we have applied similar ideas to speed upnonparametric Bayes classifiers, in work to be submitted.

Applicability of other proximity query work. For the problem of “find thek nearest datapoints”(as opposed to our question of “performk-NN or Kernel classification”) in high dimensions, the fre-quent failure of a traditional ball-tree to beat naive has lead to some very ingenious and innovativealternatives, based on random projections, hashing discretized cubes, and acceptance of approxi-mate answers. For example Gionis et al. (1999) gives a hashing method thatwas demonstrated toprovide speedups over a ball-tree-based approach in 64 dimensions bya factor of 2-5 dependingon how much error in the approximate answer was permitted. Another approximatek-NN idea isin Arya et al. (1998), one of the firstk-NN approaches to use a priority queue of nodes, in thiscase achieving a 3-fold speedup with an approximation to the truek-NN . In (Liu et al., 2004a),we introduced a variant of ball-tree structures which allow non-backtracking search to speed upapproximate nearest neighbor, and we observed up to 700-fold accelerations over conventional ball-tree basedk-NN . Similar idea has been proposed by Indyk (2001). However, theseapproaches arebased on the notion that any points falling within a factor of(1+ ε) times the true nearest neighbordistance are acceptable substitutes for the true nearest neighbor. Notingin particular that distancesin high-dimensional spaces tend to occupy a decreasing range of continuous values (Hammersley,1950), it remains unclear whether schemes based upon the absolute values of the distances ratherthan theirranksare relevant to the classification task. Our approach, because it need not find thek-NN to answer the relevant statistical question, finds an answer without approximation. The factthat our methods are easily modified to allow(1+ ε) approximation in the manner of Arya et al.(1998) suggests an obvious avenue for future research.

No free lunch. For uniform high-dimensional data no amount of trickery can save us. The expla-nation for the promising empirical results is that all the inter-dependences in the data mean we areworking in a space of much lower intrinsic dimensionality (Maneewongvatana and Mount, 2001).Note though, that in experiments not reported here, QSAR and visionk-NN classifiers give betterperformance on the original data than on PCA-projected low dimensional data, indicating that someof these dependencies are non-linear.

Summary. This paper has introduced and evaluated two new algorithms for more effectively ex-ploiting spatial data structures duringk-NN classification. We have shown significant speedups onhigh dimensional data sets without resorting to approximate answers or sampling. The result isthat thek-NN method now scales to many large high-dimensional data sets that previously were nottractable for it, and are still not tractable for many popular methods such as support vector machines.

1154

Page 21: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

NEW ALGORITHMS FOREFFICIENT HIGH-DIMENSIONAL NONPARAMETRIC CLASSIFICATION

References

S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm forapproximate nearest neighbor searching fixed dimensions.Journal of the ACM, 45(6):891–923,1998. URLciteseer.ist.psu.edu/arya94optimal.html.

J. Baxter and P. Bartlett. The Canonical Distortion Measure in Feature Space and 1-NN Classifica-tion. In Advances in Neural Information Processing Systems 10. Morgan Kaufmann, 1998.

S. D. Bay. UCI KDD Archive [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Dept ofInformation and Computer Science, 1999.

C. Cardie and N. Howe. Improving minority class prediction using case-specific feature weights.In Proceedings of 14th International Conference on Machine Learning, pages 57–65. MorganKaufmann, 1997. URLciteseer.nj.nec.com/cardie97improving.html.

C. L. Chang. Finding prototypes for nearest neighbor classifiers.IEEE Trans. Computers, C-23(11):1179–1184, November 1974.

P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search inmetric spaces. InProceedings of the 23rd VLDB International Conference, September 1997.

K. Clarkson. Nearest Neighbor Searching in Metric Spaces: Experimental Results for sb(S). ,2002.

S. Cost and S. Salzberg. A Weighted Nearest Neighbour Algorithm for Learning with SymbolicFeatures.Machine Learning, 10:57–67, 1993.

T. M. Cover and P. E. Hart. Nearest neighbor pattern classification.IEEE Trans. Information Theory,IT-13,no.1:21–27, 1967.

S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing bylatent semantic analysis.Journal of the American Society of Information Science, 41(6):391–407,1990.

K. Deng and A. W. Moore. Multiresolution Instance-based Learning. InProceedings of the TwelfthInternational Joint Conference on Artificial Intelligence, pages 1233–1239, San Francisco, 1995.Morgan Kaufmann.

L. Devroye and T. J. Wagner.Nearest neighbor methods in discrimination, volume 2. P.R. Krish-naiah and L. N. Kanal, eds., North-Holland, 1982.

A. Djouadi and E. Bouktache. A fast algorithm for the nearest-neighbor classifier. IEEE Trans.Pattern Analysis and Machine Intelligence, 19(3):277–282, March 1997.

N. R. Draper and H. Smith.Applied Regression Analysis, 2nd ed.John Wiley, New York, 1981.

R. O. Duda and P. E. Hart.Pattern Classification and Scene Analysis. John Wiley & Sons, 1973.

C. Faloutsos and D. W. Oard. A survey of information retrieval and filtering methods. TechnicalReport CS-TR-3514, Carnegie Mellon University Computer Science Department, 1995.

1155

Page 22: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

L IU , MOORE AND GRAY

C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic, and William Equitz.Efficient and effective querying by image content.Journal of Intelligent Information Systems, 3(3/4):231–262, 1994.

T. Fawcett and F. J. Provost. Adaptive fraud detection.Data Mining and Knowledge Discovery, 1(3):291–316, 1997. URLciteseer.nj.nec.com/fawcett97adaptive.html.

F. P. Fisher and E. A. Patrick. A preprocessing algorithm for nearestneighbor decision rules. InProc. Nat’l Electronic Conf., volume 26, pages 481–485, December 1970.

M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee,D. Petkovic, D. Steele, and P. Yanker. Query by image and video content:the qbic system.IEEEComputer, 28:23–32, 1995.

J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmicexpected time.ACM Transactions on Mathematical Software, 3(3):209–226, September 1977.

K. Fukunaga and P.M. Narendra. A Branch and Bound Algorithm for Computing K-Nearest Neigh-bors. IEEE Trans. Computers, C-24(7):750–753, 1975.

G. W. Gates. The reduced nearest neighbor rule.IEEE Trans. Information Theory, IT-18(5):431–433, May 1972.

A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. InProceedings of the 25th VLDB Conference, 1999.

A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In Todd K. Leen, Thomas G.Dietterich, and Volker Tresp, editors,Advances in Neural Information Processing Systems 13.MIT Press, 2001.

A. Guttman. R-trees: A dynamic index structure for spatial searching. InProceedings of the ThirdACM SIGACT-SIGMOD Symposium on Principles of Database Systems. Assn for ComputingMachinery, April 1984.

J. M. Hammersley. The Distribution of Distances in a Hypersphere.Annals of Mathematical Statis-tics, 21:447–452, 1950.

P. E. Hart. The condensed nearest neighbor rule.IEEE Trans. Information Theory, IT-14(5):515–516, May 1968.

T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE Trans.Pattern Analysis and Machine Intelligence, 18(6):607–615, June 1996.

P. Indyk. On approximate nearest neighbors underl∞ norm. Journal of Computer and SystemSciences, 63(4), 2001.

CMU informedia digital video library project. The trec-2001 video trackorganized by nist shotboundary task, 2001.

V. Koivune and S. Kassam. Nearest neighbor filters for multivariate data. In IEEE Workshop onNonlinear Signal and Image Processing, 1995.

1156

Page 23: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

NEW ALGORITHMS FOREFFICIENT HIGH-DIMENSIONAL NONPARAMETRIC CLASSIFICATION

P. Komarek and A. W. Moore. Fast robust logistic regression for largesparse datasets with binaryoutputs. InArtificial Intelligence and Statistics, 2003.

C. Kruegel and G. Vigna. Anomaly detection of web-based attacks. InProceedings of the 10thACM conference on Computer and communications security table of contents, pages 251–261,2003.

E. Lee and S. Chae. Fast design of reduced-complexity nearest-neighbor classifiers using triangularinequality. IEEE Trans. Pattern Analysis and Machine Intelligence, 20(5):562–566, May 1998.

T. Liu, A. W. Moore, A. Gray, and K. Yang. An investigation of practicalapproximate nearestneighbor algorithms. InProceedings of Neural Information Processing Systems, 2004a.

T. Liu, K. Yang, and A. Moore. The ioc algorithm: Efficient many-class non-parametric classifi-cation for high-dimensional data. InProceedings of the conference on Knowledge Discovery inDatabases (KDD), 2004b.

S. Maneewongvatana and D. M. Mount. The analysis of a probabilistic approach to nearest neighborsearching. InIn Proceedings of WADS 2001, 2001.

A. W. Moore. The Anchors Hierarchy: Using the Triangle Inequality to Survive High-DimensionalData. InTwelfth Conference on Uncertainty in Artificial Intelligence. AAAI Press, 2000.

S. Omachi and H. Aso. A fast algorithm for a k-nn classifier based on branch and bound method andcomputational quantity estimation. InIn Systems and Computers in Japan, vol.31, no.6, pp.1-9,2000.

S. M. Omohundro. Bumptrees for Efficient Function, Constraint, and Classification Learning. InR. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors,Advances in Neural InformationProcessing Systems 3. Morgan Kaufmann, 1991.

S. M. Omohundro. Efficient Algorithms with Neural Network Behaviour.Journal of ComplexSystems, 1(2):273–347, 1987.

A. M. Palau and R. R. Snapp. The labeled cell classifier: A fast approximation to k nearest neigh-bors. InProceedings of the 14th International Conference on Pattern Recognition, 1998.

E. P. D. Pednault, B. K. Rosen, and C. Apte. Handling imbalanced data setsin insurance riskmodeling, 2000.

D. Pelleg and A. W. Moore. Accelerating Exactk-means Algorithms with Geometric Reasoning.In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining.ACM, 1999.

A. Pentland, R. Picard, and S. Sclaroff. Photobook: Content-based manipulation of image databases,1994. URLciteseer.ist.psu.edu/pentland95photobook.html.

F. P. Preparata and M. Shamos.Computational Geometry. Springer-Verlag, 1985.

Y. Qi, A. Hauptman, and T. Liu. Supervised classification for video shot segmentation. InProceed-ings of IEEE International Conference on Multimedia and Expo, 2003.

1157

Page 24: New Algorithms for Efficient High-Dimensional Nonparametric Classificationpeople.ee.duke.edu/~lcarin/liu06a.pdf · 2006-10-21 · the algorithm shares some ideas of KNS2, but it

L IU , MOORE AND GRAY

G. L. Ritter, H. B. Woodruff, S. R. Lowry, and T. L. Isenhour. An algorithm for a selective nearestneighbor decision rule.IEEE Trans. Information Theory, IT-21(11):665–669, November 1975.

G. Salton and M. McGill. Introduction to Modern Information Retrieval.McGraw-Hill BookCompany, New York, NY, 1983.

I. K. Sethi. A fast algorithm for recognizing nearest neighbors.IEEE Trans. Systems, Man, andCybernetics, SMC-11(3):245–248, March 1981.

A. Smeulders and R. Jain, editors.Image Databases and Multi-media Search. World ScientificPublishing Company, 1996.

S. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. Chan. Credit card fraud detection using meta-learning: Issues and initial results, 1997. URLciteseer.nj.nec.com/stolfo97credit.html.

Y. Hamamoto S. Uchimura and S. Tomita. A bootstrap technique for nearest neighbor classifierdesign.IEEE Trans. Pattern Analysis and Machine Intelligence, 19(1):73–79, 1997.

J. K. Uhlmann. Satisfying general proximity/similarity queries with metric trees.InformationProcessing Letters, 40:175–179, 1991.

K. Woods, K. Bowyer, and W. P. Kegelmeyer Jr. Combination of multiple classifiers using localaccuracy estimates.IEEE Trans. Pattern Analysis and Machine Intelligence, 19(4):405–410,1997.

B. Zhang and S. Srihari. Fast k-nearest neighbor classification usingcluster-based trees.IEEETrans. Pattern Analysis and Machine Intelligence, 26(4):525–528, April 2004.

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for VeryLarge Databases. InProceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposiumon Principles of Database Systems : PODS 1996. ACM, 1996.

W. Zheng and A. Tropsha. A Novel Variable Selection QSAR Approach based on the K-NearestNeighbor Principle.J. Chem. Inf.Comput. Sci., 40(1):185–194, 2000.

1158


Recommended