+ All Categories
Home > Documents > BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe...

BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe...

Date post: 07-Oct-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
26
Transcript
Page 1: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

An evidence�theoretic k�NN rule with parameter

optimization�

L� M� Zouhal��y and T� Den�ux�

� Universit�e de Technologie de Compi�egne � U�M�R� CNRS ���� Heudiasyc

BP �� � F��� Compi�egne cedex � France

email� Thierry�Denoeux�hds�utc�fr

y Centre International des Techniques Informatiques

Lyonnaise des Eaux

�Technical Report Heudiasyc ������ To appear in IEEE Transactions on Systems� Man

and Cybernetics�

Page 2: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

Abstract

This paper presents a learning procedure for optimizing the parameters in the evidence�theoretic k�nearest neighbor rule� a pattern classi�cation method based on the Dem�pster�Shafer theory of belief functions� In this approach� each neighbor of a patternto be classi�ed is considered as an item of evidence supporting certain hypothesesconcerning the class membership of that pattern� Based on this evidence� basic beliefmasses are assigned to each subset of the set of classes� Such masses are obtainedfor each of the k nearest neighbors of the pattern under consideration and aggregatedusing the Dempster�s rule of combination� In many situations� this method was foundexperimentally to yield lower error rates than other methods using the same infor�mation� However� the problem of tuning the parameters of the classi�cation rule wasso far unresolved� In this paper� we propose to determine optimal or near�optimalparameter values from the data by minimizing an error function� This re�nement ofthe original method is shown experimentally to result in substantial improvement ofclassi�cation accuracy�

Page 3: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

� Introduction

In the classical approach to statistical pattern recognition� the entities to be classi�edare assumed to be selected by some form of random experiment� The feature vectordescribing each entity is then a random vector with well�de�ned � though unknown �probability density function depending on the pattern category� Based on these den�sities and on the prior probability of each class� posterior probabilities can be de�ned�and the optimal Bayes decision rule can then theoretically be used for classifying anarbitrary pattern with minimal expected risk� Since the class�conditional densitiesand prior probabilities are usually unknown� they need to be estimated from the data�A lot of methods have been proposed for building consistent estimators of the pos�terior probabilities under various assumptions� However� for �nite sample size� theresulting estimates generally do not provide a faithful representation of the funda�mental uncertainty pertaining to the class of a pattern to be classi�ed� For example�if only a relatively small number of training vectors is available� and a new patternis encountered that is very dissimilar from all previous patterns� the uncertainty isquite high and this situation of near�ignorance is not re�ected by the outputs of aconventional parametric or non parametric statistical classi�er� whose principle fun�damentally relies on asymptotic assumptions� This problem is particularly acute insituations in which decisions need to be made based on weak information� such ascommonly encountered in system diagnosis applications� for example�

As an attempt to provide an answer to the above problem� it was recently sug�gested to re�formulate the pattern classi�cation problem by considering the followingquestion� Given a training set of �nite size containing feature vectors with known �orpartly known classi�cation� and a suitable distance measure� how to characterize theuncertainty pertaining to the class of a new pattern In a recent paper �� � an answerto this question was proposed based on the Dempster�Shafer theory of evidence ��� �The approach consists in considering each neighbor of a pattern to be classi�ed as anitem of evidence supporting certain hypotheses concerning the class membership ofthat pattern� Based on this evidence� basic belief masses are assigned to each subsetof the set of classes� Such masses are obtained for each of the k nearest neighborsof the pattern under consideration and aggregated using the Dempster�s rule of com�bination� Given a �nite set of actions and losses associated to each action and eachclass� decisions can then be made by using some generalization of the Bayes decisiontheory�

In many situations� this method was found experimentally to yield lower errorrates than other methods based on the same information� However� the problem ofoptimizing the parameters involved in the classi�cation rule was so far unresolved�In this paper� we propose to determine optimal or near�optimal parameter valuesfrom the data by minimizing a certain error function� This re�nement of the originalmethod is shown experimentally to result in substantial improvement of classi�cationaccuracy�

The paper is organized as follows� The evidence�theoretic k�NN rule is �rst recalledin Section �� The basic concepts of the Dempster�Shafer theory are assumed to be

Page 4: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

known to the reader who is invited to refer to ��� and ��� for detailed presentations�and to �� for a short introduction� Section � describes the learning procedure� as wellas an approximation to it allowing to near�optimize the error function very e�ciently�Simulation results are then presented in Section �� and Section � concludes the paper�

� The evidence�theoretic k�NN rule

We consider the problem of classifying entities into M categories or classes� The setof classes is denoted by � � f��� � � � � �Mg� The available information is assumed toconsist in a training set T � f�x���� ����� � � � � �x�N�� ��N�g of N n�dimensional pat�terns x�i�� i � �� � � � � N and their corresponding class labels� ��i�� i � �� � � � � N takingvalues in �� The similarity between patterns is assumed to be correctly measured bya certain distance function d��� ��

Let x be a new vector to be classi�ed on the basis of the information containedin T � Each pair �x�i�� ��i� constitutes a distinct item of evidence regarding the classmembership of x� If x is �close� to x�i� according to the relevant metric d� then one willbe inclined to believe that both vectors belong to the same class� On the contrary�if d�x�x�i� is very large� then the consideration of x�i� will leave us in a situationof almost complete ignorance concerning the class of x� Consequently� this item ofevidence may be postulated to induce a basic belief assignment �BBA m��jx�i� over� de�ned by�

m�f�qgjx�i� � ��q�d

�i� ��

m��jx�i� � �� ��q�d�i� ��

m�Ajx�i� � �� �A � �� n f�� f�qgg ��

where d�i� � d�x�x�i�� �q is the class of x�i� ���i� � �q� � is a parameter such that� � � � � and �q is a decreasing function verifying �q�� � � et limd�� �q�d � ��Note that m��jx�i� reduces to the vacuous belief function �m��jx�i� � � when thedistance between x and x�i� tends to in�nity� re�ecting a state of total ignorance�When d denotes the Euclidean distance� a rational choice for �q was shown in �� tobe�

�q�d � exp���qd� ��

�q being a positive parameter associated to class �q �As a result of considering each training pattern in turn� we obtain N BBAs that

can be combined using the Dempster�s rule of combination to form a resulting BBAm synthesizing one�s �nal belief regarding the class of x�

m � m��jx���� � � ��m��jx�N� ��

�In this paper� we assume for simplicity the class of each training vector to be known with certainty�The more general situation in which the training set is only imperfectly labeled has been introducedin ���� However� the problem of optimizing the parameters in the general case is not completely solvedyet �see Section ���

Page 5: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

Since those training patterns situated far from x actually provide very little informa�tion� it is su�cient to consider the k nearest neighbors of x in this sum� An alternativede�nition of m is therefore�

m � m��jx�i��� � � ��m��jx�ik� ��

where Ik � fi�� � � � � ikg contains the indices of the k nearest neighbors of x in T �Adopting this latter de�nition� m can be shown �� to have the following expression�

m�f�qg ��

K

���� Y

i�Ik�q

��� ��q�d�i�

�AY

r ��q

Yi�Ik�r

��� ��r�d�i� ��

�q � f�� � � � �Mg

m�� ��

K

MYr��

Yi�Ik�r

��� ��r�d�i� ��

where Ik�q is the subset of Ik corresponding to those neighbors of x belonging to class�q and K is a normalizing factor� Hence� the focal elements of m are singletons andthe whole frame �� Consequently� the credibility and the plausibility of each class �qare respectively equal to�

bel�f�qg � m�f�qg ��

pl�f�qg � m�f�qg �m�� ���

The pignistic probability distribution as de�ned by Smets ��� is given by�

BetP�f�qg �X

fA��j�q�Ag

m�A

jAj� m�f�qg �

m��

M���

for q � �� � � � �M � Let us now assume that� based on this evidential corpus� a decisionhas to be made regarding the assignment of x to a class� and let us denote by �qthe action of assigning x to class �q� Let us further assume that the loss incurred incase of a wrong classi�cation is equal to �� while the loss corresponding to a correctclassi�cation is equal to �� Then� the lower and the upper expected losses �� associatedto action �q are respectively equal to�

R���qjx � �� pl�f�qg ���

R���qjx � �� bel�f�qg ���

The expected loss relative to the pignistic distribution is�

Rbet��qjx � �� BetP�f�qg ���

Given the particular form of m� the three strategies consisting in minimizing R�� R�

and Rbet lead to the same decision in that case� the pattern is assigned to the classwith maximum belief assignment� Other decision strategies including the possibilityof pattern rejection as well as the existence of unknown classes are studied in ��� � �

Page 6: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

� Parameter optimization

��� The approach

In the above description of the evidence�theoretic k�NN rule� we left open the questionof the choice of parameters � and � � ���� � � � � �q

t appearing in Equations � and ��Whereas the value of � proves in practice not to be too critical� the tuning of the otherparameters was found experimentally to have signi�cant in�uence on classi�cationaccuracy� In �� � it was proposed to set � � ���� and �q to the inverse of the meandistance between training patterns belonging to class �q� Although this heuristicyields good results on average� the e�ciency of the classi�cation procedure can beimproved if these parameters are determined as the values optimizing a performancecriterion� Such a criterion can be de�ned as follows�

Let us consider a training pattern x��� belonging to class �q� The class mem�

bership of x can be encoded as a vector t��� � �t���� � � � � � t

���M t of M binary indicator

variables t���j de�ned by t

���j � � if j � q and t

���j � � otherwise� By consider�

ing the k nearest neighbors of x��� in the training set� one obtains a �leave�one�out� BBA m��� characterizing one�s belief concerning the class of x��� if this patternwas to be classi�ed using other training patterns� Based on m���� an output vec�tor P ��� � �BetP����f��g� � � � �BetP

����f�Mgt of pignistic probabilities can be com�puted� BetP��� being the pignistic probability distribution associated to m���� Ideally�vector P ��� should as �close� as possible to vector t���� closeness being de�ned� forexample� according to the squared error E�x����

E�x��� � �P ��� � t���t�P ��� � t��� �MXq��

�P ���q � t���q � ���

The mean squared error over the whole training set T of size N is �nally equal to�

E ��

N

NX���

E�x��� ���

Function E can be used as a cost function for tuning the parameter vector �� Theanalytical expression for the gradient of E�x��� with respect to � can be calculated�allowing the parameters �q to be determined iteratively by a gradient search procedure�see Appendix A� Alternatively� the minimum of function E can be approximated inone step for large N using the approach described in the sequel�

��� One�step procedure

For an arbitrary training pattern x��� and �xed parameters� vector P ��� can be re�garded as a function of two vectors�

�� a vector d� � �d�i���� � � � � d�ik��t of squared distances between x��� and its k

nearest neighbors� and

Page 7: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

�� a vector � containing the class labels of these neighbors�

For small k and largeN � d� can be assumed to close to zero�� allowing each component

P���q to be approximated by Taylor series expansion around � up to the �rst order�

P ���q �d� �� P ���

q �� � rd

�P ���tq

���d

���

d� ���

The �rst term in this expression can be readily obtained from Equations � and � bysetting d�i� to � for all i� which leads to�

P ���q �� �

��� �k

K

���� ��kq � � �

M

����

where kq is the number of neighbors of x��� in class �q and

K ���� �kPM

q����� ��kq � k � ����

The computation of the �rst order term�

rd�P ���t

q

���d

���

d� �kX

i��

�P���q

�d�i����d�i�� ���

is more involved �see appendix B� This term can be shown to be of the form A�����where A��� is a square matrix of size M � As a consequence� both E�x��� and E can beapproximated by quadratic forms of �� which allows the minimum to be approacheddirectly by solving a system of linear equations�

Figures � and � show the quality of this approximation in the case of two Gaussianclasses with mean vectors �� � �� �t and �� � ��� �t� respectively� and covariancematrices �� � �� � I � The data set contained ��� samples of each class� Displayedare the mean squared error as a function of ���� �� �Figure � and its quadraticapproximation �Figure �� for k � �� The minima of the two functions di�er by lessthan ������ �� which proves the relevance of the approximation in that case� Notethat the quality of the approximation depends on both k and N �

� Numerical experiments

The performances of the above methods were compared to those of the voting k�NNrule with randomly resolved ties� the distance�weighted k�NN rule �� � the fuzzy k�NN rule proposed by Keller �� � and the evidence�theoretic rule without parameteroptimization �� � Experiments were carried out on a set of standard arti�cial and

�This assumption is justi�ed by the following result �� Regarding the training set as a sampledrawn from some probability distribution� the k�th nearest neighbor of x��� converges to x��� withprobability one as the sample size increases with k �xed�

Page 8: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

real�world benchmark classi�cation tasks� The main characteristics of the used datasets are summarized in Table ��

Data sets B� and B� were generated using a method proposed in �� � The dataconsists in three Gaussian classes in �� dimensions� which diagonal covariance matricesD�� D� and D�� The i�th diagonal element Dqi of Dq is de�ned as a function of twoparameters a and b�

D�i�a� b � a� �b� ai� �

n� �

D�i�a� b � a� �b� an � i

n� �

D�i�a� b � a� �b� amin

�i�n � i

n��

where n � �� is the input dimension� The mean vectors and covariance matrices �withn � �� for the three classes were�

�� � ��� � � � � � �� � ��� � � � � � �� � ������ � � � � ����D���� �� D����� � D���� ��

The ionosphere data set �Ion was collected by a radar system and consists of phasedarray of �� high�frequency antennas with a total transmitted power of the order of��� kilowatts ��� � The targets were free electrons in the ionosphere� �Good� radarreturns are those showing evidence of some type of structure in the ionosphere� �Bad�returns are those that do not�

The vehicle data set �Veh was collected from silhouettes by the HIPS �HierarchicalImage Processing System extension BINATTS� Four model vehicles were used for theexperiment� bus� Chevrolet van� Saab ���� and Opel Manta ���� The data was usedto distinguish �D objects within a �D silhouette of the objects ��� �

The sonar data were used by Gorman and Sejnowski in a study of the classi�cationof sonar signals using a neural network �� � The task is to discriminate between sonarsignals bounced o� a metal cylinder and those bounced o� a roughly cylindrical rock�

Test error rates are represented as a function of k in Figures � to ��� The results forsynthetic data are averages over � independent training sets� Test error rates obtainedfor the values of k yielding the best classi�cation of training vectors are presented inTable ��

As can be seen from these results� the evidence�theoretic rule with optimized �

presented in this paper always performed as well or better than the four other rulestested� and signi�cantly improved at the �� � con�dence level over the evidence�theoretic rule with �xed � on data sets B� and B� when considering the best resultsobtained for � � k � ��� However� the most distinctive feature of this rule seems tobe its robustness with respect to the number k of neighbors taken in consideration�Figures � to ��� By optimizing �� the method learns to discard those neighbors whosedistance to the pattern under consideration is too high� Practically� this property isof great importance since it relieves the designer of the system from the burden of

Page 9: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

searching for the optimal value of k� When the number of training patterns is large�then the amount of computation may be further reduced by adopting the approximateone�step procedure for optimizing � which gives reasonably good results for small k�Figures �� �� �� � and ��� However� the use of the exact procedure should be preferedfor small and medium�sized training sets�

� Concluding remarks

A technique for optimizing the parameters in the evidence�theoretic k�NN rule hasbeen presented� The classi�cation rule obtained by this method has proved superiorto the voting� distance�weighted and fuzzy rules on a number of benchmark problems�A remarkable property achieved with this approach is the relative insensitivity of theresults to the choice of k�

The method can be generalized in several ways� First of all� one can assume amore general metric than the Euclidean one considered so far� and apply the principlesdescribed in this paper to search for the optimal metric ��� � For instance� let �q bea positive de�nite diagonal matrix with diagonal elements �q��� � � � � �q�n� The distancebetween an input vector x and a learning vector x�i� belonging to class �q can bede�ned as�

d�x�x�i� � �x� x�i�t�q�x� x�i�

�nX

j��

�q�j�xj � x�i�j �

The parameters �q�j for � � q �M and � � j � n can then be optimized using exactlythe same approach as described in this paper� which may in some cases result in furtherimprovement of classi�cation results� A more general form could even be assumed for�q� with however the risk of a dramatic increase in the number of parameters for largeinput dimensions�

More fundamentally� the method can also be extended to handle the more generalsituation in which the class membership of training patterns is itself a�ected by un�certainty� For example� let us assume that the class of each training pattern x�i� isonly known to lie in a subset A�i� of � �such a situation may typically arise� e�g�� inmedical diagnosis problems in which some records in a database are related to patientsfor which only a partial or uncertain diagnosis is available� A natural extension ofEquations ��� is then�

m�A�i�jx�i� � ���d�i�

m��jx�i� � �� ���d�i�

m�Bjx�i� � �� �B � �� n f�� A�i�g

with ��d � exp���d�� � being a positive parameter �note that we cannot a de�ne aseparate parameter for each class in this case� since the class of x�i� is only partiallyknown� The BBAs de�ned in that way correspond to simple belief functions and can

Page 10: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

be combined in linear time with respect to the number of classes� For optimizing ��the error criterion de�ned in Equation �� has to be generalized in some way� Withthe same notations as in Section ���� a possible expression for the error concerningpattern x��� is�

E�x��� ��BetP����A���� �

���

which re�ects the fact that the pignistic probability of x��� belonging to A���� given theother training patterns� should be as high as possible� The value of � minimizing themean error may then be determined using an iterative search procedure� Experimentswith this approach are currently under way and will be reported in future publications�

Page 11: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

A Computation of the derivatives of E w�r�t� �q

Let x��� be a training pattern and m��� the BBA obtained by classifying x��� usingits k nearest neighbors in the training set� Function m��� is computed according toEquations � and ��

m����f�qg ��

K

�BB��� Y

i�I���k�q

��� ��q�d���i�

�CCAY

r ��q

Yi�I

���k�r

��� ��r�d���i� ���

�q � f�� � � � �Mg

m����� ��

K

MYr��

Yi�I

���k�r

��� ��r�d���i� ���

where I���k�r denotes the set of indices of the k nearest neighbors of pattern x��� in class

�r� d���i� is the distance between x��� and x�i�� and K is a normalizing factor� In thefollowing� we shall assume that�

�q�d���i� � exp���qd

���i�� ���

which will simply be denoted by ����i�q � To simplify the calculations� we further in�

troduce the unnormalized orthogonal sum ��� m���� de�ned as m�����A � Km����Afor all A � �� We also denote as m���i� the unnormalized orthogonal sum of the

m��jx�j� for all j � I���k � j �� i� that is m���� � m���i� m��jx�i�� where denotes the

unnormalized orthogonal sum operator� More precisely� we have�

m�����f�qg � m����f�qgjx�i�� m���i��f�qg � m���i��� � ���

m�����jx�i� m���i��f�qg �q � f�� � � � �Mg

m������ � m���i���m��jx�i� ���

The error for pattern x��� is�

E�x��� �MXq��

�P ���q � t���q � ���

where t��� is the class membership vector for pattern x��� and P���q is the pignistic

probability of class �q computed from m��� as P���q � m����f�qg �

m������M

The derivative of E�x��� with respect to each parameter �q can be computed as�

�E�x���

��q�Xi�I

���k�q

�E�x���

�����i�q

�����i�q

��q���

Page 12: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

with

�E�x���

�����i�q

�MXr��

�E�x���

�P���r

�P���r

�����i�q

���

�MXr��

��P ���r � t���r

��m����f�rg

�����i�q

��

M

�m�����

�����i�q

���

and��

���i�q

��q� �d���i������i�q ���

The derivatives in Equation �� can be computed as�

�m����f�rg

�����i�q

��

�����i�q

m�����f�rg

K

����

��

K�

�K�m�����f�rg

�����i�q

�m�����f�rg�K

�����i�q

���

�m�����

�����i�q

��

K�

�K�m������

�����i�q

�m�������K

�����i�q

���

�K

�����i�q

�MXr��

�m�����f�rg

�����i�q

��m������

�����i�q

���

Finally��m�����f�rg

�����i�q

� � m���i��f�rg � m���i����rq � � m���i��� ���

where is the Kronecker symbol� and

�m������

�����i�q

� � m���i��� ���

which completes the calculation of the gradient of E�x��� w�r�t� �q�To account for the constraint �q �� we introduce new parameters q �q �

�� � � � �M such that��q � �q

� ���

and we compute �E�x������q

as�

�E�x���

�q��E�x���

��q

��q�q

� �q�E�x���

��q���

��

Page 13: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

B Linearization

We consider the expansion around d� � � of P���q by a Taylor series up to the �rst

order�P ���q �d� �� P ���

q �� � �rd

�P ���q ��td� ���

where P���q �� is given by Equations ��� In the following� we shall compute the �rst

order term in the above equation� and deduce from that result a method for deter�mining an approximation to the optimal parameter vector� To simplify the notation�the superscript �� will be omitted from the following calculations�

As a result of the de�nition of the pignistic probability� we have�

�Pq�d�i��

��m�f�qg

�d�i���

M

�m��

�d�i�����

The derivatives ofm�f�qg andm�� can be more conveniently expressed as a functionof the unnormalized BBA m��

�m�f�qg

�d�i���

K�

�K�m��f�qg

�d�i���m��f�qg

�K

�d�i��

����

�m��

�d�i���

K�

�K�m���

�d�i���m���

�K

�d�i��

����

To compute �m��f�qg��d�i��

and �m�����d�i��

we need to distinguish two cases�Case �� i � Ik�q � We then have�

�m��f�qg

�d�i��� ���q exp���qd

�i��Y

j�Ik�q�j ��i

��� � exp���qd�j��

�Yr ��q

Yj�Ik�r

��� � exp���qd�j�� ���

�m���

�d�i��� ��q exp���qd

�i��MYr��

Yj�Ik�r �j ��i

��� � exp���rd�j�� ���

Setting all distances to � in the above equations� we have�

�m��f�qg

�d�i��

����d

���

� ���q��� �k�� ���

�m���

�d�i��

����d

���

� ��q��� �k�� ���

Case �� i � Ik�l� l �� q� We have�

�m��f�qg

�d�i��� ��l exp���ld

�i�����Y

j�Ik�q

��� � exp���qd�j��

��

Page 14: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

�Yr ��q

Yj�Ik�r �j ��i

��� � exp���rd�j�� ���

�m���

�d�i��� ��l exp���ld

�i��MYr��

Yj�Ik�r �j ��i

��� � exp���rd�j�� ���

Setting the distances to zero in the above equations�

�m��f�qg

�d�i��

����d

���

� ��l��� ��� �kq ��� �k�kq�� ���

�m���

�d�i��

����d

���

� ��l��� �k�� ���

where kq � jIk�qj�The derivatives of K are simply obtained as follows�

�K

�d�i���

MXq��

�m��f�qg

�d�i����m���

�d�i�����

Hence��K

�d�i��

����d���

� ��q

MXr���r ��q

��� ��� �kr ��� �k�kr�� ���

It follows from the preceding calculations that� for i � Ik�r� the derivatives ofm��f�qg� m

��� and K for d� � � are proportional to �r� Since m��f�qg� m

��� andK do not themselves depend on � for d� � �� the derivative of Pq is also proportionalto �r� Hence� we have�

Xi�Ik�r

�Pq�d�i��

����d

���

d�i�� � Aq�r�r ���

for all r � f�� � � � �Mg� Aq�r being some constant not depending on �� Consequently�we can write�

Pq � Pq�� �MXr��

Aq�r�r ���

and� expressing this result in matrix form�

P �� P �� �A� ���

with A � �Ai�j is a square matrix of size M �The above calculations have been performed for an arbitrary training pattern x�

Reintroducing the pattern index �� we have�

P ��� �� P ����� �A���� ���

��

Page 15: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

Introducing these terms into the mean squared error� we have�

E ��

N

NX���

�P ��� � t���t�P ��� � t���

��

N

NX���

�P ������ t��� �A����t�P ������ t��� �A���� ���

��

N

NX���

�P ������ t���t�P ������ t��� � ��tA���t�P ������ t��� � ���

�tA���tA����

The gradient E with respect to � is therefore given by�

r�E ��

N

�NX���

A���t�P ������ t��� �NX���

A���tA����

���

Minimizing E under the constraint � � is a nonnegative least squares problem thatmay be solved e�ciently using� for instance� the algorithm described in ���� page ��� �

��

Page 16: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

References

�� T� M� Cover and P� E� Hart� Nearest neighbor pattern classi�cation� IEEETransactions on Information Theory� IT������������ �����

�� T� Den!ux� A k�nearest neighbor classi�cation rule based on Dempster�Shafertheory� IEEE Transactions on Systems� Man and Cybernetics� �������������������

�� T� Den!ux� Analysis of evidence�theoretic decision rules for pattern classi�cation�Pattern Recognition� ��������������� �����

�� T� Den!ux� Application du mod"ele des croyances transf#erables en reconnaissancede formes� Traitement du Signal �In press�� �����

�� T� Den!ux and G� Govaert� Combined supervised and unsupervised learningfor system diagnosis using Dempster�Shafer theory� In P� Borne et al�� editor�CESA��� IMACS Multiconference� Symposium on Control� Optimization and Su�pervision� volume �� pages �������� Lille� July �����

�� S� A� Dudani� The distance�weighted k�nearest�neighbor rule� IEEE Transactionson Systems� Man and Cybernetics� SMC����������� �����

�� J� H� Friedman� Regularized discriminant analysis� Journal of the AmericanStatistical Association� ����������� �����

�� R� P� Gorman and T� J� Sejnowski� Analysis of hidden units in a layered networktrained to classify sonar targets� Neural Networks� �������� �����

�� J� M� Keller� M� R� Gray� and J� A� Givens� A fuzzy k�NN neighbor algorithm�IEEE Transactions on Systems� Man and Cybernetics� SMC�������������� �����

��� C� L� Lawson and R� J� Hanson� Solving Least Squares Problems� Prentice�hall������

��� P� M� Murphy and D� W� Aha� UCI Reposition of machine learning databases�Machine�readable data repository� University of California� Departement of In�formation and Computer Science�� Irvine� CA� �����

��� G� Shafer� A mathematical theory of evidence� Princeton University Press� Prince�ton� N�J�� �����

��� P� Smets� The combination of evidence in the Transferable Belief Model� IEEETransactions on Pattern Analysis and Machine Intelligence� ������������� �����

��� P� Smets and R� Kennes� The Transferable Belief Model� Articial Intelligence������������ �����

��

Page 17: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

��� L� M� Zouhal� Contribution �a l�application de la th�eorie des fonctions de croy�ance en reconnaissance des formes� PhD thesis� Universit#e de Technologie deCompi"egne� �����

��

Page 18: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

Biography

Lalla Meriem Zouhal

Lalla Meriem Zouhal received a M�S� in electronics from the Facult#e des SciencesHassan II� Casablanca� Morocco� in ����� a DEA �Dipl$ome d�Etudes Approfondiesin System Control from the Universit#e de Technologie de Compi"egne� France� in �����and a PhD from the same institution in ����� Her research interests concern patternclassi�cation� Dempster�Shafer theory and Fuzzy logic�

Thierry Den�ux

Thierry Den!ux graduated in ���� as an engineer from the Ecole Nationale des Pontset Chauss#ees in Paris� and earned a PhD from the same institution in ����� Heobtained the �Habilitation "a diriger des Recherches� from the Institut National Poly�technique de Lorraine in ����� From ���� to ����� he was employed by the Lyonnaisedes Eaux water company� where he was in charge of research projects concerning theapplication of neural networks to forecasting and diagnosis� Dr� Den!ux joined theUniversit#e de Technologie de Compi"egne as an assistant professor in ����� His researchinterests include arti�cial neural networks� statistical pattern recognition� uncertaintymodeling and data fusion�

��

Page 19: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

List of Tables

� Main characteristics of data sets� number of classes �M� training setsize �N� test set size �Nt and input dimension �n� � � � � � � � � � � ��

� Test error rates obtained with the voting� distance�weighted� fuzzy andevidence�theoretic classi�cation rules for the best value of k �in brack�ets� with �� � con�dence intervals� ETF� evidence�theoretic classi�erwith �xed �% ETO� evidence�theoretic classi�er with optimized �� � � � ��

List of Figures

� Contour lines of the error function for di�erent values of �� using thegradient�descent method� The optimal value is � � ������ ����t� � � � ��

� Contour lines of the error function for di�erent values of �� using thelinearization method for pignistic probabilities vectors� The optimalvalue is � � ������ ���t� � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Test error rates on data sets B� as a function of k� for the ETF �x�and ETO k�NN rules� �� gradient method and �� � linearization method� ��

� Test error rates on data sets B� as a function of k� for the voting ���ETO��� fuzzy �� � and distance�weighted ��� k�NN rules� � � � � � � � ��

� Test error rates on data sets B� as a function of k� for the ETF �x�and ETO k�NN rules� �� gradient method and �� � linearization method� ��

� Test error rates on data sets B� as a function of k� for the voting ���ETO��� fuzzy �� � and distance�weighted ��� k�NN rules� � � � � � � � ��

� Test error rates on ionosphere data as a function of k� for the ETF �x�and ETO k�NN rules� �� gradient method and �� � linearization method� ��

� Test error rates on ionosphere data as a function of k� for the voting��� ETO��� fuzzy �� � and distance�weighted ��� k�NN rules� � � � � � ��

� Test error rates on vehicle data as a function of k� for the ETF �x� andETO k�NN rules� �� gradient method and �� � linearization method� ��

�� Test error rates on vehicle data as a function of k� for the voting ���ETO��� fuzzy �� � and distance�weighted ��� k�NN rules� � � � � � � � ��

�� Test error rates on sonar data as a function of k� for the ETF �x� andETO k�NN rules� �� gradient method and �� � linearization method� ��

�� Test error rates on sonar data as a function of k� for the voting ���ETO��� fuzzy �� � and distance�weighted ��� k�NN rules� � � � � � � � ��

��

Page 20: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

Table �� Main characteristics of data sets� number of classes �M� training set size�N� test set size �Nt and input dimension �n�

data set M N Nt n

B� � �� ���� ��

B� � ��� ���� ��

Ion � ��� ��� ��

veh � ��� ��� ��

Son � ��� ��� ��

Table �� Test error rates obtained with the voting� distance�weighted� fuzzy andevidence�theoretic classi�cation rules for the best value of k �in brackets� with �� �con�dence intervals� ETF� evidence�theoretic classi�er with �xed �% ETO� evidence�theoretic classi�er with optimized ��

data set voting ETF ETO weighted fuzzy

B� ���� � ���� ���� � ���� ���� � ���� ���� � ���� ���� � ������ ��� ��� ��� ��

B� ���� � ���� ���� � ���� ���� � ���� ���� � ���� ����� ������� ��� ��� ��� ���

Ion ���� � ���� ���� � ���� ���� � ���� ���� � ���� ���� � ������ �� �� �� ��

Veh ���� � ���� ���� � ���� ���� � ���� ���� � ���� ���� � ������ �� �� �� ��

Son ���� � ���� ���� � ���� ���� � ���� ���� � ���� ���� � ������ �� �� �� ��

��

Page 21: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0.01859

0.018601

0.018643

0.018705

0.018809

Figure �� Contour lines of the error function for di�erent values of �� using thegradient�descent method� The optimal value is � � ������ ����t�

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0.018585

0.018593

0.018619

0.018676

Figure �� Contour lines of the error function for di�erent values of �� using thelinearization method for pignistic probabilities vectors� The optimal value is � ������� ���t�

��

Page 22: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

0 5 10 150.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

k

erro

r ra

te

Figure �� Test error rates on data sets B� as a function of k� for the ETF �x� andETO k�NN rules� �� gradient method and �� � linearization method�

0 5 10 15 20 25 30 35 400.3

0.35

0.4

0.45

0.5

0.55

0.6

k

erro

r ra

te

Figure �� Test error rates on data sets B� as a function of k� for the voting ��� ETO���fuzzy �� � and distance�weighted ��� k�NN rules�

��

Page 23: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

0 5 10 150.24

0.26

0.28

0.3

0.32

0.34

0.36

k

erro

r ra

te

Figure �� Test error rates on data sets B� as a function of k� for the ETF �x� andETO k�NN rules� �� gradient method and �� � linearization method�

0 5 10 15 20 25 30 35 400.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

k

erro

r ra

te

Figure �� Test error rates on data sets B� as a function of k� for the voting ��� ETO���fuzzy �� � and distance�weighted ��� k�NN rules�

��

Page 24: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

0 5 10 150.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

k

erro

r ra

te

Figure �� Test error rates on ionosphere data as a function of k� for the ETF �x� andETO k�NN rules� �� gradient method and �� � linearization method�

0 5 10 15 20 25 30 35 400.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

k

erro

r ra

te

Figure �� Test error rates on ionosphere data as a function of k� for the voting ���ETO��� fuzzy �� � and distance�weighted ��� k�NN rules�

��

Page 25: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

0 5 10 15

0.32

0.34

0.36

0.38

0.4

k

erro

r ra

te

Figure �� Test error rates on vehicle data as a function of k� for the ETF �x� andETO k�NN rules� �� gradient method and �� � linearization method�

0 5 10 15 20 25 30 35 400.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

k

erro

r ra

te

Figure ��� Test error rates on vehicle data as a function of k� for the voting ���ETO��� fuzzy �� � and distance�weighted ��� k�NN rules�

��

Page 26: BP F Compi - UMR Heudiasyctdenoeux/dokuwiki/_media/... · Compi egne cedex F rance email ThierryDe no eux hd su tc fr y Cen tre In ternational des T ec hniques Informatiques Ly onnaise

0 5 10 150.1

0.15

0.2

0.25

0.3

0.35

0.4

k

erro

r ra

te

Figure ��� Test error rates on sonar data as a function of k� for the ETF �x� andETO k�NN rules� �� gradient method and �� � linearization method�

0 5 10 15 20 25 30 35 400.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

k

erro

r ra

te

Figure ��� Test error rates on sonar data as a function of k� for the voting ��� ETO���fuzzy �� � and distance�weighted ��� k�NN rules�

��


Recommended