Date post: | 25-Nov-2023 |
Category: |
Documents |
Upload: | independent |
View: | 0 times |
Download: | 0 times |
A fast approximately k�nearest�neighbour sear h
algorithm for lassi� ation tasks
Fran is o Moreno-Se o, Luisa Mi ó, and José On ina
?
Dept. Lenguajes y Sistemas Informáti os
Universidad de Ali ante, E-03071 Ali ante, Spain,
{pa o,mi o,on ina}�dlsi.ua.es
Abstra t. The k�nearest�neighbour (k�NN) sear h algorithm is widely
used in pattern lassi� ation tasks. A large set of fast k�NN sear h al-
gorithms have been developed in order to obtain lower error rates. Most
of them are extensions of fast NN sear h algorithms where the ondi-
tion of �nding exa tly the k nearest neighbours is imposed. All these
algorithms al ulate a number of distan es that in reases with k. Also,
a ve tor�spa e representation is usually needed in these algorithms.
If the ondition of �nding exa tly the k nearest neighbours is relaxed,
further redu tions on the number of distan e omputations an be ob-
tained. In this work we propose a modi� ation of the LAESA (Linear
Approximating and Eliminating Sear h Algorithm, a fast NN sear h al-
gorithm for metri spa es) in order to use a ertain neighbourhood for
lowering error rates and redu e the number of distan e omputations at
the same time.
Keywords: Nearest Neighbour, Metri Spa es, Pattern Re ognition.
?
The authors wish to thank the Spanish CICyT for partial support of this work
through proje t TIC97�0941.
1
1 Introdu tion
Non-parametri lassi� ation is one of the most widely used te hniques in pattern
re ognition [2℄. One of the simplest te hniques (and one of the most popular) is
to use the nearest�neighbour (NN) lassi�er whi h, given an unknown sample x,
�nds the prototype p in the training set whi h is losest to x, then it lassi�es x
in the same lass as p. The NN lassi�er usually obtains a eptable error rates,
but it is possible to obtain better (lower) error rates using a number k of nearest
neighbours. Thus, a k�nearest�neighbour (k�NN) lassi�er �nds the k nearest
neighbours of the sample x, and then, through a voting pro ess, it lassi�es x in
the lass whi h has most representatives among those k nearest neighbours.
Usually, these lassi�ers are implemented through an exhaustive sear h; that
is, all the distan es between the sample and the prototypes in the training set are
al ulated. When the representation spa e is an Eu lidean spa e, this exhaustive
sear h is usually not very time� onsuming. On the other hand, when working
on a metri spa e
1
in whi h the temporal ost of al ulating the distan e be-
tween two prototypes is high (as for instan e the edit distan e when lassifying
handwritten hara ters [8℄), the exhaustive sear h be omes impra ti al.
Several algorithms (AESA [7℄, LAESA [5℄ and TLAESA [4℄, among others)
have been developed whi h �nd the nearest neighbour in a metri spa e with
a low number of distan e al ulations. Also, the AESA algorithm has been ex-
tended [1℄ to �nd the k nearest neighbours with a low number of distan e al-
ulations. In this paper, we present an extension of the LAESA algorithm that
uses at most k neighbours to lassify the sample. Although this extension does
not �nd exa tly the k nearest neighbours, the error rates obtained are very lose
to those of a lassi�er that uses the exa t k nearest neighbours.
In the next se tion we will introdu e the LAESA algorithm and the pro-
posed modi� ation, the Approximating k�LAESA (Ak�LAESA) algorithm. The
following se tions des ribe the experiments and the results obtained, whi h show
that the Ak�LAESA obtains error rates lose to those of a k�NN lassi�er, while
at the same time al ulates a redu ed number of distan es. Finally we present
the on lusions and we outline some future work.
2 The Ak�LAESA algorithm
The Ak�LAESA algorithm is derived from the LAESA algorithm [5, 6℄. This
latter algorithm relies on the triangle inequality to prune the training set. It
uses a lower bound of the real distan e between ea h prototype and the sample,
whi h is used to eliminate all those prototypes that annot be loser to the
sample than a given one. The LAESA algorithm has two main steps:
1. Prepro essing step:
This step is arried out before the lassi� ation begins. First, it sele ts a set
1
A metri spa e is a representation spa e whi h has some kind of metri de�ned.
2
of a number of prototypes
2
alled base prototypes. Then, it al ulates and
stores the a tual distan es between ea h base prototype and all the other
prototypes in the training set (in luding the base prototypes).
2. Classi� ation step:
During this step, the distan es al ulated in the prepro essing step are used
to obtain a lower bound of the distan e between ea h prototype and the
sample. Using this lower bound of the distan e, the algorithm iteratively
�nds a good andidate to nearest neighbour, al ulates the a tual distan e
between this andidate and the sample and prunes the training set
3
, until
there is only one prototype: the nearest neighbour.
The Ak�LAESA (see �gure 1) is a simple but powerful evolution of the
LAESA algorithm that simply stops the sear h for the nearest neighbour when
the number of remaining (not pruned) prototypes is less than a number k. Then,
it lassi�es the sample by voting among those prototypes. Our experiments show
that the number of prototypes used in the voting is in general smaller than k, and
that those prototypes are not exa tly the k nearest neighbours. Despite of this,
error rates for this algorithm are very lose (di�erent in less than 1%) to those of
a lassi�er using the exa t k nearest neighbours. Also, the algorithm al ulates
a lower number of distan es than a lassi�er using the k nearest neighbours, and
also al ulates fewer distan es than the LAESA algorithm itself.
3 Experiments
In [6℄ we presented some results of the Ak�LAESA algorithm on syntheti data,
using 2 lasses. In this work, the algorithm for generating lustered data appeared
in [3℄ has been used to generate syntheti data from 4 lasses and with two values
for the dimensionality of the generated data: 6 and 10. This algorithm generates
random syntheti data from di�erent lasses ( lusters) with a given maximum
overlap between them. Ea h lass follows a Gaussian distribution with the same
varian e and di�erent randomly hosen means. The overlap was set to 0.04 and
the varian e to 0.05, in order to obtain low error rates (less than 8% for the NN
lassi�er).
On e we had the syntheti data prepared, several partitions of training and
test set with di�erent sizes were made. For training sets, the sizes used were:
1024, 2048, 3072, 4096, 6144 and 8192. The test set size was 512 for all training
set sizes. All experiments were repeated 16 times with di�erent training and test
sets of ea h size, in order to obtain statisti ally signi� ative results.
In the �rst experiment we ompare Ak�LAESA error rates with those of the
NN lassi�er and various k-NN lassi�ers. The experiment has been repeated for
k = 7 and k = 17, using data of di�erent dimensionalities (see �gure 2). As we
2
This number depends ex lusively on the dimensionality of data (see [5℄ for more
details).
3
All the prototypes whose lower bound of the distan e is bigger than the a tual
distan e between the andidate and the sample an be safely eliminated.
3
Input: P � E, n = jP j; { �nite set of training prototypes }
B � P , m = jBj; { set of Base prototypes }
D 2 R
n�m
; { pre omputed n�m array of interprototype distan es }
x 2 E; {test sample }
k 2 N; { maximum number of neighbours to use }
Output: 2 N; { lass assigned to the sample x }
Fun tions: d : E �E ! R; { distan e fun tion }
VOTING : E ! N; { voting fun tion }
Variables: p; q; s; b 2 P ;
G 2 R
n
; { lower bounds array }
p
�
2 E; { best andidate to nearest neighbour }
d
�
; d
xs
; g
p
; g
q
; g
b
2 R;
n 2 N; { number of omputed distan es }
stop : Boolean; { used to stop the sear h }
begin
d
�
:=1; p
�
:= indeterminate;G := [0℄;
s := arbitrary_element(B); n := 0; stop :=false;
while jP j > 0 and not stop do
d
xs
:= d(x; s);n := n + 1; { distan e omputing }
P := P � fsg;
if d
xs
< d
�
then
d
�
:= d
xs
; p
�
:= s { updating d
�
, p
�
}
endif
b := indeterminate; g
b
:=1; q := indeterminate; g
q
:=1;
for every p 2 P do { eliminating and approximating loop }
if s 2 B then
G[p℄ :=max(G[p℄; jD[p; s℄� d
xs
j); { updating G, if possible }
endif
g
p
:= G[p℄;
if p 2 B then { approximating: sele ting from B }
if g
p
< g
b
then g
p
:= g
b
; b := p endif
else { p 62 B }
if g
p
>= d
�
then
P := P � fpg { eliminating from P �B }
else { approximating: sele ting from P �B }
if g
p
< g
q
then g
p
:= g
q
; q := p endif
endif
endif
endfor
if b 6=indeterminate then s := b
elsif jP j < k then stop :=true { stop the sear h }
else s := q endif
endwhile
P := P [ fp
�
g; { retrieve the best andidate to NN }
:=VOTING(P );
end
Fig. 1. Algorithm Ak�LAESA
4
ould expe t, Ak�LAESA has lower error rates than the NN lassi�er (also lower
than some k�NN lassi�ers), but they are slightly higher than those of the k�NN
lassi�ers, using the same value of k. Another point is that the Ak�LAESA does
not always use k prototypes to lassify the sample: it uses at most k. This feature
ould explain the di�eren e between Ak�LAESA error rates and k�NN lassi�ers
error rates, for a �xed value of k. The average number of neighbours used by
Ak�LAESA was al ulated, in order to perform a more a urate omparison
(see �gure 6). The numbers were 5.27 for k = 7 when dimensionality was 6, and
6.78 prototypes with dimensionality 10. For k = 17, the numbers were 9.22 and
15.42, respe tively. Figure 2 shows a omparison of the Ak�LAESA with a k�NN
lassi�er that used aproximately the same number of prototypes (6, 7, 10 and
16 respe tively), and also with a k�NN lassi�er with k = 7 and k = 17. The
error rates of a NN and a 3�NN lassi�ers also have been plotted.
Even though Ak�LAESA error rates are slightly higher than those of the
k�NN lassi�ers (but lower than those of a NN lassi�er), our experiments show
that the number of distan e al ulations of the Ak�LAESA does not depend on
the size of the training set (see �gure 3), while the number of distan e al ulations
of a k�NN lassi�er (when using the exhaustive sear h) is exa tly the size of
the training set. When the temporal ost of al ulating the distan e between
two prototypes is high, the Ak�LAESA is faster than the k�NN lassi�er, and
obtains error rates very lose to those of the k�NN lassi�er.
The se ond experiment was performed to show that the Ak�LAESA al u-
lates less distan es than the LAESA algorithm. Also, as it happens with the
LAESA algorithm, the number of distan e al ulations does not depend on the
training set size. Figure 3 shows the average number of distan e al ulations of
the LAESA algorithm, and the Ak�LAESA with k = 3 and k = 7.
The aim of the third experiment was to show the behaviour of Ak�LAESA
when the value of k in reases, and a omparison with the behaviour of a k�
NN lassi�er was also made. As shown in �gure 4, the error rate starts (with
dimensionality 6) at a value lose to 7% in both lassi�ers. As the value of k
in reases, the rate tends to a 4% for the k�NN lassi�er and to a 5% for Ak�
LAESA. Figure 5 plots the average di�eren e between Ak�LAESA error rates
and the k�NN lassi�er error rates, showing that the k�NN lassi�er error rates
are lower, but the di�eren e is (with dimensionality 6 and 10) less than 1%.
Finally, an experiment was developed to �nd out how many prototypes used
by the Ak�LAESA algorithm to lassify were a tually among the k nearest
neighbours (see �gure 6). The tables in �gure 6 show, for two di�erent values of
k:
1. The average number of prototypes used by the Ak�LAESA in the voting
pro ess,
2. How many of these prototypes are among the k nearest neighbours (and
also the per entage), that is, how many of the �nal prototypes used by the
Ak�LAESA are one of the k nearest neighbours.
3. How many are among the 2k nearest neighbours, and
4. How many are among the 4k nearest neighbours.
5
3
4
5
6
7
8
9
0 2000 4000 6000 8000
erro
r ra
te (
%)
training set size
Dimensionality 6 - k=7
NN6-NN7-NN
A7-LAESA3-NN
3
4
5
6
7
8
9
0 2000 4000 6000 8000
erro
r ra
te (
%)
training set size
Dimensionality 6 - k=17
NN10-NN17-NN
A17-LAESA3-NN
2
3
4
5
6
7
0 2000 4000 6000 8000
erro
r ra
te (
%)
training set size
Dimensionality 10 - k=7
NN7-NN3-NN
A7-LAESA
2
3
4
5
6
7
0 2000 4000 6000 8000
erro
r ra
te (
%)
training set size
Dimensionality 10 - k=17
NN16-NN17-NN
A17-LAESA3-NN
Fig. 2. Error rates of the Ak�LAESA lassi�er ompared to k�NN lassi�ers when the
training set size in reases.
6
05
10152025303540
0 2000 4000 6000 8000
num
ber
of d
ista
nce
calc
ulat
ions
training set size
Dimensionality 6
LAESAA3-LAESAA7-LAESA
020406080
100120140160
0 2000 4000 6000 8000
num
ber
of d
ista
nce
calc
ulat
ions
training set size
Dimensionality 10
LAESAA3-LAESAA7-LAESA
Fig. 3. Average number of distan e al ulations of the LAESA and Ak�LAESA algo-
rithms, as the training set size in reases.
0123456789
10
0 50 100 150 200
erro
r ra
te (
%)
value of K
K-NN classifier Dimensionality 6
2048 prototypes4096 prototypes8192 prototypes
0123456789
10
0 50 100 150 200
erro
r ra
te (
%)
value of K
AK-LAESA Dimensionality 6
2048 prototypes4096 prototypes8192 prototypes
0123456789
10
0 50 100 150 200
erro
r ra
te (
%)
value of K
K-NN classifier Dimensionality 10
2048 prototypes4096 prototypes8192 prototypes
0123456789
10
0 50 100 150 200
erro
r ra
te (
%)
value of K
AK-LAESA Dimensionality 10
2048 prototypes4096 prototypes8192 prototypes
Fig. 4. Error rates of the Ak�LAESA lassi�er ompared to k�NN lassi�er, as the
value of k in reases.
7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 50 100 150 200 250 300 350 400
aver
age
diffe
renc
e be
twee
n er
ror
rate
s (%
)
value of K
AK-LAESA vs K-NN
Dim. 6Dim. 10
Fig. 5. Average di�eren e between error rates of the Ak�LAESA lassi�er a k�NN
lassi�er, as the value of k in reases.
8
The results show that the number of prototypes used by the Ak�LAESA whi h
are among the k nearest neighbours de reases as dimensionality in reases, while
at the same time Ak�LAESA error rates improve (see �gure 4).
Dimensionality 6
Value of Prototypes Among Among Among
k Used k NN 2k NN 4k NN
7 5.27 2.33 (44%) 3.04 (58%) 3.74 (71%)
17 9.22 5.14 (56%) 6.49 (70%) 7.61 (83%)
Dimensionality 10
Value of Prototypes Among Among Among
k Used k NN 2k NN 4k NN
7 6.78 1.42 (21%) 1.86 (27%) 2.56 (38%)
17 15.42 3.61 (23%) 5.57 (36%) 8.15 (53%)
Fig. 6. Prototypes used by the Ak�LAESA whi h are among the k nearest neighbours.
4 Con lusions and future work
We have developed a fast lassi�er based on the LAESA algorithm [5℄ whi h
obtains error rates lose to those of a k�NN lassi�er, while al ulating a low
number of distan es. Also the temporal and spatial omplexities of the Ak�
LAESA algorithm are the same than those of the LAESA algorithm. The Ak�
LAESA error rates (and its behaviour as k in reases) are very lose to those of a
lassi�er that uses the k nearest neighbours. The Ak�LAESA performan e seem
not to de rease as the size of the training set grows. Also, this behaviour of the
Ak�LAESA seems to improve as dimensionality of data in reases.
As for the future, we will explore the behaviour of the Ak�LAESA with data
of dimensionalities higher than those used for these experiments. Also, we plan
to apply this algorithm to real data tasks.
Currently, the base prototypes sele tion algorithm of the Ak�LAESA has
been borrowed from the LAESA algorithm, and we have also used the LAESA's
optimal number of base prototypes for ea h value of the dimensionality. We think
that a di�erent number of base prototypes or a di�erent sele tion algorithm for
base prototypes an improve the error rates of the Ak�LAESA, spe ially with
low dimensionality data.
A knowledgements: The authors wish to thank Jorge Calera-Rubio and Mikel
L. For ada, for their invaluable help in writing this paper.
9
Referen es
1. Aibar, P., Juan, A., Vidal, E.: Extensions to the approximating and eliminat-
ing sear h algorithm (AESA) for �nding k-nearest-neighbours. New Advan es and
Trends in Spee h Re ognition and Coding (1993) 23�28
2. Duda, R., Hart, P.: Pattern Classi� ation and S ene Analysis. Wiley (1973)
3. Jain, A.K., Dubes, R.C.: Algorithms for lustering data. Prenti e-Hall (1988)
4. Mi ó, L., On ina, J., Carras o, R.C.: A fast bran h and bound nearest neighbour
lassi�er in metri spa es. Pattern Re ognition Letters (1996) 17 731�739
5. Mi ó, L., On ina, J., Vidal, E.: A new version of the nearest neighbour approximat-
ing and eliminating sear h algorithm (AESA) with linear prepro essing-time and
memory requirements. Pattern Re ognition Letters (1994) 15 9�17
6. Moreno-Se o, F., On ina, J., Mi ó, L.: Improving the LAESA algorithm error rates.
In: Pro eedings of the VIII Symposium Na ional de Re ono imiento de Formas y
Análisis de Imágenes, Bilbao (1999) 413�419
7. Vidal, E.: New formulation and improvements of the Nearest-Neighbour Approx-
imating and Eliminating Sear h Algorithm (AESA). Pattern Re ognition Letters
(1994) 15 1�7
8. Wagner, R.A., Fis her, M.J.: The String-to-String Corre tion Problem. Journal of
the Asso iation for Computing Ma hinery (1974) 21(1) 168�173