+ All Categories
Home > Documents > A Fast Approximately k -Nearest-Neighbour Search Algorithm for Classification Tasks

A Fast Approximately k -Nearest-Neighbour Search Algorithm for Classification Tasks

Date post: 25-Nov-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
10
Transcript

A fast approximately k�nearest�neighbour sear h

algorithm for lassi� ation tasks

Fran is o Moreno-Se o, Luisa Mi ó, and José On ina

?

Dept. Lenguajes y Sistemas Informáti os

Universidad de Ali ante, E-03071 Ali ante, Spain,

{pa o,mi o,on ina}�dlsi.ua.es

Abstra t. The k�nearest�neighbour (k�NN) sear h algorithm is widely

used in pattern lassi� ation tasks. A large set of fast k�NN sear h al-

gorithms have been developed in order to obtain lower error rates. Most

of them are extensions of fast NN sear h algorithms where the ondi-

tion of �nding exa tly the k nearest neighbours is imposed. All these

algorithms al ulate a number of distan es that in reases with k. Also,

a ve tor�spa e representation is usually needed in these algorithms.

If the ondition of �nding exa tly the k nearest neighbours is relaxed,

further redu tions on the number of distan e omputations an be ob-

tained. In this work we propose a modi� ation of the LAESA (Linear

Approximating and Eliminating Sear h Algorithm, a fast NN sear h al-

gorithm for metri spa es) in order to use a ertain neighbourhood for

lowering error rates and redu e the number of distan e omputations at

the same time.

Keywords: Nearest Neighbour, Metri Spa es, Pattern Re ognition.

?

The authors wish to thank the Spanish CICyT for partial support of this work

through proje t TIC97�0941.

1

1 Introdu tion

Non-parametri lassi� ation is one of the most widely used te hniques in pattern

re ognition [2℄. One of the simplest te hniques (and one of the most popular) is

to use the nearest�neighbour (NN) lassi�er whi h, given an unknown sample x,

�nds the prototype p in the training set whi h is losest to x, then it lassi�es x

in the same lass as p. The NN lassi�er usually obtains a eptable error rates,

but it is possible to obtain better (lower) error rates using a number k of nearest

neighbours. Thus, a k�nearest�neighbour (k�NN) lassi�er �nds the k nearest

neighbours of the sample x, and then, through a voting pro ess, it lassi�es x in

the lass whi h has most representatives among those k nearest neighbours.

Usually, these lassi�ers are implemented through an exhaustive sear h; that

is, all the distan es between the sample and the prototypes in the training set are

al ulated. When the representation spa e is an Eu lidean spa e, this exhaustive

sear h is usually not very time� onsuming. On the other hand, when working

on a metri spa e

1

in whi h the temporal ost of al ulating the distan e be-

tween two prototypes is high (as for instan e the edit distan e when lassifying

handwritten hara ters [8℄), the exhaustive sear h be omes impra ti al.

Several algorithms (AESA [7℄, LAESA [5℄ and TLAESA [4℄, among others)

have been developed whi h �nd the nearest neighbour in a metri spa e with

a low number of distan e al ulations. Also, the AESA algorithm has been ex-

tended [1℄ to �nd the k nearest neighbours with a low number of distan e al-

ulations. In this paper, we present an extension of the LAESA algorithm that

uses at most k neighbours to lassify the sample. Although this extension does

not �nd exa tly the k nearest neighbours, the error rates obtained are very lose

to those of a lassi�er that uses the exa t k nearest neighbours.

In the next se tion we will introdu e the LAESA algorithm and the pro-

posed modi� ation, the Approximating k�LAESA (Ak�LAESA) algorithm. The

following se tions des ribe the experiments and the results obtained, whi h show

that the Ak�LAESA obtains error rates lose to those of a k�NN lassi�er, while

at the same time al ulates a redu ed number of distan es. Finally we present

the on lusions and we outline some future work.

2 The Ak�LAESA algorithm

The Ak�LAESA algorithm is derived from the LAESA algorithm [5, 6℄. This

latter algorithm relies on the triangle inequality to prune the training set. It

uses a lower bound of the real distan e between ea h prototype and the sample,

whi h is used to eliminate all those prototypes that annot be loser to the

sample than a given one. The LAESA algorithm has two main steps:

1. Prepro essing step:

This step is arried out before the lassi� ation begins. First, it sele ts a set

1

A metri spa e is a representation spa e whi h has some kind of metri de�ned.

2

of a number of prototypes

2

alled base prototypes. Then, it al ulates and

stores the a tual distan es between ea h base prototype and all the other

prototypes in the training set (in luding the base prototypes).

2. Classi� ation step:

During this step, the distan es al ulated in the prepro essing step are used

to obtain a lower bound of the distan e between ea h prototype and the

sample. Using this lower bound of the distan e, the algorithm iteratively

�nds a good andidate to nearest neighbour, al ulates the a tual distan e

between this andidate and the sample and prunes the training set

3

, until

there is only one prototype: the nearest neighbour.

The Ak�LAESA (see �gure 1) is a simple but powerful evolution of the

LAESA algorithm that simply stops the sear h for the nearest neighbour when

the number of remaining (not pruned) prototypes is less than a number k. Then,

it lassi�es the sample by voting among those prototypes. Our experiments show

that the number of prototypes used in the voting is in general smaller than k, and

that those prototypes are not exa tly the k nearest neighbours. Despite of this,

error rates for this algorithm are very lose (di�erent in less than 1%) to those of

a lassi�er using the exa t k nearest neighbours. Also, the algorithm al ulates

a lower number of distan es than a lassi�er using the k nearest neighbours, and

also al ulates fewer distan es than the LAESA algorithm itself.

3 Experiments

In [6℄ we presented some results of the Ak�LAESA algorithm on syntheti data,

using 2 lasses. In this work, the algorithm for generating lustered data appeared

in [3℄ has been used to generate syntheti data from 4 lasses and with two values

for the dimensionality of the generated data: 6 and 10. This algorithm generates

random syntheti data from di�erent lasses ( lusters) with a given maximum

overlap between them. Ea h lass follows a Gaussian distribution with the same

varian e and di�erent randomly hosen means. The overlap was set to 0.04 and

the varian e to 0.05, in order to obtain low error rates (less than 8% for the NN

lassi�er).

On e we had the syntheti data prepared, several partitions of training and

test set with di�erent sizes were made. For training sets, the sizes used were:

1024, 2048, 3072, 4096, 6144 and 8192. The test set size was 512 for all training

set sizes. All experiments were repeated 16 times with di�erent training and test

sets of ea h size, in order to obtain statisti ally signi� ative results.

In the �rst experiment we ompare Ak�LAESA error rates with those of the

NN lassi�er and various k-NN lassi�ers. The experiment has been repeated for

k = 7 and k = 17, using data of di�erent dimensionalities (see �gure 2). As we

2

This number depends ex lusively on the dimensionality of data (see [5℄ for more

details).

3

All the prototypes whose lower bound of the distan e is bigger than the a tual

distan e between the andidate and the sample an be safely eliminated.

3

Input: P � E, n = jP j; { �nite set of training prototypes }

B � P , m = jBj; { set of Base prototypes }

D 2 R

n�m

; { pre omputed n�m array of interprototype distan es }

x 2 E; {test sample }

k 2 N; { maximum number of neighbours to use }

Output: 2 N; { lass assigned to the sample x }

Fun tions: d : E �E ! R; { distan e fun tion }

VOTING : E ! N; { voting fun tion }

Variables: p; q; s; b 2 P ;

G 2 R

n

; { lower bounds array }

p

2 E; { best andidate to nearest neighbour }

d

; d

xs

; g

p

; g

q

; g

b

2 R;

n 2 N; { number of omputed distan es }

stop : Boolean; { used to stop the sear h }

begin

d

:=1; p

:= indeterminate;G := [0℄;

s := arbitrary_element(B); n := 0; stop :=false;

while jP j > 0 and not stop do

d

xs

:= d(x; s);n := n + 1; { distan e omputing }

P := P � fsg;

if d

xs

< d

then

d

:= d

xs

; p

:= s { updating d

, p

}

endif

b := indeterminate; g

b

:=1; q := indeterminate; g

q

:=1;

for every p 2 P do { eliminating and approximating loop }

if s 2 B then

G[p℄ :=max(G[p℄; jD[p; s℄� d

xs

j); { updating G, if possible }

endif

g

p

:= G[p℄;

if p 2 B then { approximating: sele ting from B }

if g

p

< g

b

then g

p

:= g

b

; b := p endif

else { p 62 B }

if g

p

>= d

then

P := P � fpg { eliminating from P �B }

else { approximating: sele ting from P �B }

if g

p

< g

q

then g

p

:= g

q

; q := p endif

endif

endif

endfor

if b 6=indeterminate then s := b

elsif jP j < k then stop :=true { stop the sear h }

else s := q endif

endwhile

P := P [ fp

g; { retrieve the best andidate to NN }

:=VOTING(P );

end

Fig. 1. Algorithm Ak�LAESA

4

ould expe t, Ak�LAESA has lower error rates than the NN lassi�er (also lower

than some k�NN lassi�ers), but they are slightly higher than those of the k�NN

lassi�ers, using the same value of k. Another point is that the Ak�LAESA does

not always use k prototypes to lassify the sample: it uses at most k. This feature

ould explain the di�eren e between Ak�LAESA error rates and k�NN lassi�ers

error rates, for a �xed value of k. The average number of neighbours used by

Ak�LAESA was al ulated, in order to perform a more a urate omparison

(see �gure 6). The numbers were 5.27 for k = 7 when dimensionality was 6, and

6.78 prototypes with dimensionality 10. For k = 17, the numbers were 9.22 and

15.42, respe tively. Figure 2 shows a omparison of the Ak�LAESA with a k�NN

lassi�er that used aproximately the same number of prototypes (6, 7, 10 and

16 respe tively), and also with a k�NN lassi�er with k = 7 and k = 17. The

error rates of a NN and a 3�NN lassi�ers also have been plotted.

Even though Ak�LAESA error rates are slightly higher than those of the

k�NN lassi�ers (but lower than those of a NN lassi�er), our experiments show

that the number of distan e al ulations of the Ak�LAESA does not depend on

the size of the training set (see �gure 3), while the number of distan e al ulations

of a k�NN lassi�er (when using the exhaustive sear h) is exa tly the size of

the training set. When the temporal ost of al ulating the distan e between

two prototypes is high, the Ak�LAESA is faster than the k�NN lassi�er, and

obtains error rates very lose to those of the k�NN lassi�er.

The se ond experiment was performed to show that the Ak�LAESA al u-

lates less distan es than the LAESA algorithm. Also, as it happens with the

LAESA algorithm, the number of distan e al ulations does not depend on the

training set size. Figure 3 shows the average number of distan e al ulations of

the LAESA algorithm, and the Ak�LAESA with k = 3 and k = 7.

The aim of the third experiment was to show the behaviour of Ak�LAESA

when the value of k in reases, and a omparison with the behaviour of a k�

NN lassi�er was also made. As shown in �gure 4, the error rate starts (with

dimensionality 6) at a value lose to 7% in both lassi�ers. As the value of k

in reases, the rate tends to a 4% for the k�NN lassi�er and to a 5% for Ak�

LAESA. Figure 5 plots the average di�eren e between Ak�LAESA error rates

and the k�NN lassi�er error rates, showing that the k�NN lassi�er error rates

are lower, but the di�eren e is (with dimensionality 6 and 10) less than 1%.

Finally, an experiment was developed to �nd out how many prototypes used

by the Ak�LAESA algorithm to lassify were a tually among the k nearest

neighbours (see �gure 6). The tables in �gure 6 show, for two di�erent values of

k:

1. The average number of prototypes used by the Ak�LAESA in the voting

pro ess,

2. How many of these prototypes are among the k nearest neighbours (and

also the per entage), that is, how many of the �nal prototypes used by the

Ak�LAESA are one of the k nearest neighbours.

3. How many are among the 2k nearest neighbours, and

4. How many are among the 4k nearest neighbours.

5

3

4

5

6

7

8

9

0 2000 4000 6000 8000

erro

r ra

te (

%)

training set size

Dimensionality 6 - k=7

NN6-NN7-NN

A7-LAESA3-NN

3

4

5

6

7

8

9

0 2000 4000 6000 8000

erro

r ra

te (

%)

training set size

Dimensionality 6 - k=17

NN10-NN17-NN

A17-LAESA3-NN

2

3

4

5

6

7

0 2000 4000 6000 8000

erro

r ra

te (

%)

training set size

Dimensionality 10 - k=7

NN7-NN3-NN

A7-LAESA

2

3

4

5

6

7

0 2000 4000 6000 8000

erro

r ra

te (

%)

training set size

Dimensionality 10 - k=17

NN16-NN17-NN

A17-LAESA3-NN

Fig. 2. Error rates of the Ak�LAESA lassi�er ompared to k�NN lassi�ers when the

training set size in reases.

6

05

10152025303540

0 2000 4000 6000 8000

num

ber

of d

ista

nce

calc

ulat

ions

training set size

Dimensionality 6

LAESAA3-LAESAA7-LAESA

020406080

100120140160

0 2000 4000 6000 8000

num

ber

of d

ista

nce

calc

ulat

ions

training set size

Dimensionality 10

LAESAA3-LAESAA7-LAESA

Fig. 3. Average number of distan e al ulations of the LAESA and Ak�LAESA algo-

rithms, as the training set size in reases.

0123456789

10

0 50 100 150 200

erro

r ra

te (

%)

value of K

K-NN classifier Dimensionality 6

2048 prototypes4096 prototypes8192 prototypes

0123456789

10

0 50 100 150 200

erro

r ra

te (

%)

value of K

AK-LAESA Dimensionality 6

2048 prototypes4096 prototypes8192 prototypes

0123456789

10

0 50 100 150 200

erro

r ra

te (

%)

value of K

K-NN classifier Dimensionality 10

2048 prototypes4096 prototypes8192 prototypes

0123456789

10

0 50 100 150 200

erro

r ra

te (

%)

value of K

AK-LAESA Dimensionality 10

2048 prototypes4096 prototypes8192 prototypes

Fig. 4. Error rates of the Ak�LAESA lassi�er ompared to k�NN lassi�er, as the

value of k in reases.

7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300 350 400

aver

age

diffe

renc

e be

twee

n er

ror

rate

s (%

)

value of K

AK-LAESA vs K-NN

Dim. 6Dim. 10

Fig. 5. Average di�eren e between error rates of the Ak�LAESA lassi�er a k�NN

lassi�er, as the value of k in reases.

8

The results show that the number of prototypes used by the Ak�LAESA whi h

are among the k nearest neighbours de reases as dimensionality in reases, while

at the same time Ak�LAESA error rates improve (see �gure 4).

Dimensionality 6

Value of Prototypes Among Among Among

k Used k NN 2k NN 4k NN

7 5.27 2.33 (44%) 3.04 (58%) 3.74 (71%)

17 9.22 5.14 (56%) 6.49 (70%) 7.61 (83%)

Dimensionality 10

Value of Prototypes Among Among Among

k Used k NN 2k NN 4k NN

7 6.78 1.42 (21%) 1.86 (27%) 2.56 (38%)

17 15.42 3.61 (23%) 5.57 (36%) 8.15 (53%)

Fig. 6. Prototypes used by the Ak�LAESA whi h are among the k nearest neighbours.

4 Con lusions and future work

We have developed a fast lassi�er based on the LAESA algorithm [5℄ whi h

obtains error rates lose to those of a k�NN lassi�er, while al ulating a low

number of distan es. Also the temporal and spatial omplexities of the Ak�

LAESA algorithm are the same than those of the LAESA algorithm. The Ak�

LAESA error rates (and its behaviour as k in reases) are very lose to those of a

lassi�er that uses the k nearest neighbours. The Ak�LAESA performan e seem

not to de rease as the size of the training set grows. Also, this behaviour of the

Ak�LAESA seems to improve as dimensionality of data in reases.

As for the future, we will explore the behaviour of the Ak�LAESA with data

of dimensionalities higher than those used for these experiments. Also, we plan

to apply this algorithm to real data tasks.

Currently, the base prototypes sele tion algorithm of the Ak�LAESA has

been borrowed from the LAESA algorithm, and we have also used the LAESA's

optimal number of base prototypes for ea h value of the dimensionality. We think

that a di�erent number of base prototypes or a di�erent sele tion algorithm for

base prototypes an improve the error rates of the Ak�LAESA, spe ially with

low dimensionality data.

A knowledgements: The authors wish to thank Jorge Calera-Rubio and Mikel

L. For ada, for their invaluable help in writing this paper.

9

Referen es

1. Aibar, P., Juan, A., Vidal, E.: Extensions to the approximating and eliminat-

ing sear h algorithm (AESA) for �nding k-nearest-neighbours. New Advan es and

Trends in Spee h Re ognition and Coding (1993) 23�28

2. Duda, R., Hart, P.: Pattern Classi� ation and S ene Analysis. Wiley (1973)

3. Jain, A.K., Dubes, R.C.: Algorithms for lustering data. Prenti e-Hall (1988)

4. Mi ó, L., On ina, J., Carras o, R.C.: A fast bran h and bound nearest neighbour

lassi�er in metri spa es. Pattern Re ognition Letters (1996) 17 731�739

5. Mi ó, L., On ina, J., Vidal, E.: A new version of the nearest neighbour approximat-

ing and eliminating sear h algorithm (AESA) with linear prepro essing-time and

memory requirements. Pattern Re ognition Letters (1994) 15 9�17

6. Moreno-Se o, F., On ina, J., Mi ó, L.: Improving the LAESA algorithm error rates.

In: Pro eedings of the VIII Symposium Na ional de Re ono imiento de Formas y

Análisis de Imágenes, Bilbao (1999) 413�419

7. Vidal, E.: New formulation and improvements of the Nearest-Neighbour Approx-

imating and Eliminating Sear h Algorithm (AESA). Pattern Re ognition Letters

(1994) 15 1�7

8. Wagner, R.A., Fis her, M.J.: The String-to-String Corre tion Problem. Journal of

the Asso iation for Computing Ma hinery (1974) 21(1) 168�173


Recommended