+ All Categories
Home > Documents > On the Sensitivity of the Probability of Error Rule for Feature Selection

On the Sensitivity of the Probability of Error Rule for Feature Selection

Date post: 06-Nov-2016
Category:
Upload: moshe
View: 212 times
Download: 0 times
Share this document with a friend
5
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 1, JANUARY 1980 Correspondence On the Sensitivity of the Probability of Error Rule for Feature Selection MOSHE BEN-BASSAT Abstract-The low sensitivity of the probability of error rule (Pe rule) for feature selection is demonstrated and discussed. It is shown that under certain conditions features with significantly different discrimi- nation power are considered as equivalent by the Pe rule. The main reason for this phenomenon lies in the fact that, directly, the Pe rule depends only on the most probable class and that, under the stated condition, the prior most probable class remains the posterior most probable class regardless of the result for the observed feature. A rule for breaking ties is suggested to refine the feature ordering induced by the Pe rule. By this tie-breaking rule, when two features have the same value for the expected probability of error, the feature with the higher variance for the probability of error is preferred. Index Terms-Classification, feature selection, pattern recognition, probability of error, sensitivity of feature selection rules. I. INTRODUCTION Adopting the Bayesian approach for the classification of a given object to one of m possible classes, let rri denote the prior probability for class i, let Pi(X) denote the conditional probability (density) of a feature X under class i, and let 7ri(X) denote the posterior probability of class i after observing X. Assuming a zero-one loss matrix the optimal Bayes rule as- signs the object to the class with the highest a posteriori probability. The Bayes risk associated with X reduces to the expected probability of error, Pe(X), which is given by Pe(X) = E [ I - max { IT" (X), * * *, 7'rm (X)} I ( 1) where, here and throughout the paper, expectation is taken with respect to the mixed distribution of X. This paper is concerned with the choice of a criterion func- tion for evaluating the differentiation power of a feature or a subset of features. We do not deal in this paper with efficient algorithms for subset selection or sequential selection, al- though the relationship of these two problems to the choice of the criterion function is mentioned. Let F denote the set of all available features, each of which is represented by a real random variable which may be multi- dimensional, e.g., when subsets of features are evaluated. Assuming that the testing cost for each feature is the same, the objective of the feature selection task is to select X E F for which the classifier error rate is minimized. The natural rule for this purpose appears to be the probability of error rule by which a feature X E F is preferred to a feature Y E F if Pe(X) <Pe(Y), while X and Y are indifferent if Pe(X) = Pe(Y). Alternative feature selection rules were previously Manuscript received June 22, 1978; revised January 3, 1979. This work was supported in part by the U.S. Public Health Services under Grant RO 1 HS 01474 from the Health Resources Administration and in part by the National Science Foundation under Grant ENG-7729007 from the Division of Engineering. The author is with the Center for the Critically Ill, School of Medi- cine, University of Southern California, Los Angeles, CA 90027, and with the Faculty of Management, Tel Aviv University, Tel Aviv, Israel. considered mainly because of computational difficulties, but see also Toussaint [61 and Ben-Bassat [21-[4]. The purpose of this paper is to point at another deficiency of the probability of error rule (to be denoted henceforth as the Pe rule), that is its low sensitivity to distinguish between good and better features. This phenomenon was previously considered by Duin et al. [5] for the case of binary features and two classes. The results in our paper, however, are not restricted to this special care. In Section II, a numerical example is used to demonstrate the low sensitivity of the Pe rule, and a theorem is presented that explains the reasons for the low sensitivity. In Section III, a rule for breaking ties is introduced which resolves the low sensitivity problem. The ranking induced by this rule is also compared to the ranking induced by Shannon's rule and the quadratic information gain rule. Section IV concludes with a discussion and summary of the results. II. SENSITIVITY OF THE Pe RULE Consider the pattern recognition problem with four classes and four binary features which is presented in Table I. The entries of the table represent the respective conditional prob- abilities for positive results of the features given the classes. Before doing any computation, it can easily be seen that X4 is a useless feature which can contribute nothing to differen- tiation between the classes. Features X3 and X2 seem to have some differentiation power (certainly more than X4), but they do not appear to be as strong as feature X1 that differentiates strongly between the most probable class C1, and the rest of the classes C2, C3, and C4. These intuitive arguments will later on be supported by more accurate analysis. Surprisingly enough, the Pe rule fails to distinguish between these features, since the expected probability of error for all the four features is 0.20. This implies that by the Pe rule X1, X2, X3, and X4 are indifferent, and therefore if ties are broken arbitrarily, it is possible that the useless feature X4 will be selected by the Pe rule. The questions to be answered are 1) what is the reason for the insensitivity of the Pe rule and 2) what can be done to resolve it. The main reason for the low sensitivity of the Pe rule lies in the fact that, directly, it depends only on the posterior probability of the most probable class and that, under certain conditions, the prior most probable class remains the posterior most probable class regardless of the result for the observed feature. The prior and posterior values for the class prob- abilities may be different; however, the a priori maximum and the a posteriori maximum are attained on the same class. Theorem 1 states the necessary and sufficient conditions for this situation. Theorem 1: Assume that class t is a priori the most probable class, i.e., rt = max {fr,. , }. (2) For a given feature X, assume that for every class i, i # t, there exists a constant Ki such that SUP Pi(x) < Ki. x Pt(x) (3) Then class t remains the posterior most probable class regard- 0162-8828/80/0100-0057$00.75 © 1980 IEEE 57
Transcript

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 1, JANUARY 1980

Correspondence

On the Sensitivity of the Probability of Error Rulefor Feature Selection

MOSHE BEN-BASSAT

Abstract-The low sensitivity of the probability of error rule (Pe rule)for feature selection is demonstrated and discussed. It is shown thatunder certain conditions features with significantly different discrimi-nation power are considered as equivalent by the Pe rule. The mainreason for this phenomenon lies in the fact that, directly, the Pe ruledepends only on the most probable class and that, under the statedcondition, the prior most probable class remains the posterior mostprobable class regardless of the result for the observed feature. A rulefor breaking ties is suggested to refine the feature ordering induced bythe Pe rule. By this tie-breaking rule, when two features have thesame value for the expected probability of error, the feature withthe higher variance for the probability of error is preferred.

Index Terms-Classification, feature selection, pattern recognition,probability of error, sensitivity of feature selection rules.

I. INTRODUCTIONAdopting the Bayesian approach for the classification of a

given object to one of m possible classes, let rri denote theprior probability for class i, let Pi(X) denote the conditionalprobability (density) of a feature X under class i, and let 7ri(X)denote the posterior probability of class i after observing X.Assuming a zero-one loss matrix the optimal Bayes rule as-signs the object to the class with the highest a posterioriprobability. The Bayes risk associated with X reduces to theexpected probability of error, Pe(X), which is given by

Pe(X) = E [ I - max { IT" (X), * * *, 7'rm (X)} I ( 1)

where, here and throughout the paper, expectation is takenwith respect to the mixed distribution of X.This paper is concerned with the choice of a criterion func-

tion for evaluating the differentiation power of a feature or asubset of features. We do not deal in this paper with efficientalgorithms for subset selection or sequential selection, al-though the relationship of these two problems to the choiceof the criterion function is mentioned.Let F denote the set of all available features, each of which

is represented by a real random variable which may be multi-dimensional, e.g., when subsets of features are evaluated.Assuming that the testing cost for each feature is the same,the objective of the feature selection task is to select X E Ffor which the classifier error rate is minimized. The naturalrule for this purpose appears to be the probability of errorrule by which a feature X E F is preferred to a feature Y E Fif Pe(X) <Pe(Y), while X and Y are indifferent if Pe(X) =Pe(Y). Alternative feature selection rules were previously

Manuscript received June 22, 1978; revised January 3, 1979. Thiswork was supported in part by the U.S. Public Health Services underGrant RO 1 HS 01474 from the Health Resources Administration andin part by the National Science Foundation under Grant ENG-7729007from the Division of Engineering.The author is with the Center for the Critically Ill, School of Medi-

cine, University of Southern California, Los Angeles, CA 90027, andwith the Faculty of Management, Tel Aviv University, Tel Aviv, Israel.

considered mainly because of computational difficulties,but see also Toussaint [61 and Ben-Bassat [21-[4]. Thepurpose of this paper is to point at another deficiency ofthe probability of error rule (to be denoted henceforth asthe Pe rule), that is its low sensitivity to distinguish betweengood and better features. This phenomenon was previouslyconsidered by Duin et al. [5] for the case of binary featuresand two classes. The results in our paper, however, are notrestricted to this special care.

In Section II, a numerical example is used to demonstratethe low sensitivity of the Pe rule, and a theorem is presentedthat explains the reasons for the low sensitivity. In SectionIII, a rule for breaking ties is introduced which resolves thelow sensitivity problem. The ranking induced by this rule isalso compared to the ranking induced by Shannon's rule andthe quadratic information gain rule. Section IV concludeswith a discussion and summary of the results.

II. SENSITIVITY OF THE Pe RULE

Consider the pattern recognition problem with four classesand four binary features which is presented in Table I. Theentries of the table represent the respective conditional prob-abilities for positive results of the features given the classes.Before doing any computation, it can easily be seen that X4is a useless feature which can contribute nothing to differen-tiation between the classes. Features X3 and X2 seem to havesome differentiation power (certainly more than X4), but theydo not appear to be as strong as feature X1 that differentiatesstrongly between the most probable class C1, and the rest ofthe classes C2, C3, and C4. These intuitive arguments will lateron be supported by more accurate analysis. Surprisinglyenough, the Pe rule fails to distinguish between these features,since the expected probability of error for all the four featuresis 0.20. This implies that by the Pe rule X1, X2, X3, and X4are indifferent, and therefore if ties are broken arbitrarily, itis possible that the useless feature X4 will be selected by thePe rule. The questions to be answered are 1) what is thereason for the insensitivity of the Pe rule and 2) what can bedone to resolve it.The main reason for the low sensitivity of the Pe rule lies

in the fact that, directly, it depends only on the posteriorprobability of the most probable class and that, under certainconditions, the prior most probable class remains the posteriormost probable class regardless of the result for the observedfeature. The prior and posterior values for the class prob-abilities may be different; however, the a priori maximum andthe a posteriori maximum are attained on the same class.Theorem 1 states the necessary and sufficient conditions forthis situation.Theorem 1: Assume that class t is a priori the most probable

class, i.e.,

rt = max {fr,. , }. (2)

For a given feature X, assume that for every class i, i # t, thereexists a constant Ki such that

SUP Pi(x) < Ki.x Pt(x) (3)

Then class t remains the posterior most probable class regard-

0162-8828/80/0100-0057$00.75 © 1980 IEEE

57

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 1, JANUARY 1980

TABLE ITHE PATTERNS MATRIX

APRIORI FEATURESPROBABILITY 1 2 3 4

.8000 1 .800 .900 .250 .400

.1000 2 .050 .450 .306 .400

.0500 3 .300 .850 .600 .400

.0500 4 .200 .850 .920 .400

less of the observation x, if and only if

t> Ki for every i,it ,2, * m. (4)7Ti

(Notation: The features as random variables are denoted byupper case letters, e.g., X. Specific values observed for thesefeatures are denoted by the corresponding lower case letters,e.g., x.)

Proof:1) Sufficiency: For any given observation x we get

from Bayes' formula

7ri(X 7TiPi(XA = . ~~~~~~~~~~~~(5)

-t(X) rrtPt(x)By (3) and (4) we conclude that, for every i and every possiblex, the right-hand side of (5) is less or equal to 1, which impliesthat irt(x) > "r (x) for every i and every possible x.The interpretation of this result can also be stated as follows:

the posterior odds of class i against class t are equal to theproduct of the prior odds by the likelihood ratio (5). Theleast favorable result for class t against class i is that result xfor which the supremum in (3) is attained or "almost" at-tained. Inequality (4) states that the prior odds in favor ofclass t are higher than the likelihood ratio even for the leastfavorable result for class t. This implies that after testingthe feature X, although the posterior probability of class tmay reduce, it still would remain higher than the posteriorprobability of class i regardless of the actual observation x.If this advantage of class t holds against all the other classes,then 7rt(x) will remain the highest posterior probability regard-less of the result x.

2) Necessity: Assume that for every possible x class tremains the most probable class, and assume that for a certainclass, say class r, (4) is not true, i.e., irtl/rT >Kr. Using (5)with i = r, this implies that for the least favorable result forclass t against class r, the prior odds are greater than the likeli-hood ratio. In this case, the posterior probability for class rwill be higher than that of class t which contradicts our as-sumption that, for every possible x, 7rt(x) remains the highest.The following theorem, which we will make immediate use

of, is well known and its proof can be found, for instance, inBen-Bassat and Gal [2].Theorem 2: For every class i and every feature X,

Et iri (X)]I ¢i (6)

Assume now that the conditions of Theorem 1 are satisfied,i.e., for a given feature X, the prior odds in favor of class tdominate the likelihood ratio for any other class. In thiscase, the prior most probable class t remains the posteriormost probable class regardless of the result for X. This implies

E[maxI{77(X)} - E 7't(X)I = irt

where Pe denote the actual probability of error if a decisionis made prior to testing X.This result is the key for understanding the insensitivity of

the Pe rule. Its significance is that if for every class i, i # t,the prior odds in favor of the most probable class t dominatesthe likelihood ratios, then the expected reduction in the prob-ability of error is zero. This implies that by the Pe rule all thefeatures that satisfy the conditions of Theorem 1 are consid-ered equivalent. This is very undesirable because Theorem 1may be satisfied by a large group of features whose differen-tiation power is significantly different. This group includesfeatures that are irrelevant to all the classes, features that areirrelevant only to the prior most probable class, and featuresthat are relevant to all the classes but, unfortunately, are notstrong enough to change by themselves the prior most prob-able class. (Irrelevant features are defined and analyzed in [3].Briefly, a feature X is irrelevant to class i if the posteriorprobability of class i equals the prior probability of class iregardless of the result for X.)To illustrate this point, let us consider again the example

described above. Feature X4 is an irrelevant feature for allthe classes, and therefore the posterior probabilities for allthe classes remain unchanged regardless of the result for X4.On the other hand, for feature X1 different posterior prob-abilities are obtained for positive and negative results; seeTable II. A positive result yields a strong confirmation infavor of C1, r0(X1 = +) = 0.955, and the posterior probabilityof error for this result is 0.045. A negative result, on theother hand, reduces significantly the posterior probabilityof C1, 7r (X1 = - ) = 0.485, but still C1 is the most probableclass. The posterior probability of error for a negative resultis 0.515. As noted earlier, however, the expected probabilityof error for both X1 and X4 is 0.200 and therefore by thePe rule X1 and X4 are equivalent. Assume that for a givencase C2 is the true class, and we proceed sequentially byevaluating the features for just one step look ahead. If fea-ture X4 is selected first, no progress is made regardless of itsresult, and this test is simply a waste of time and money.On the other hand, if feature X1 is selected first and the resultis negative (as it is likely to be if C2 is the true class), then asignificant step is made towards ruling out C1 and identifyingC2. In fact, if in the second stage X2 is selected (after X1 ) anda negative result is obtained, then C2 becomes the most prob-able class (0.592) and this is a major step in the right direc-tion. To summarize, it is extremely insensitive to considerX1 and X4 equivalent as does the Pe rule.The conditions of Theorem 1 are not unusual and they are

quite often satisfied, particularly when one class has relativelyhigh probability compared to the other classes. This may hap-pen, for instance, in the advanced stages of a sequential classi-fication process. Consider, for instance, the case of binaryfeatures and let Pii denote the conditional probability for apositive result of a feature Xi given that class i is the trueclass. If the most probable class Ct has an 0.8 probability,and there are two other classes each with 0.1 probability, then

rrt 0.8--=0 = 8 foreveryi, i zt.rr1 0.1I

In order for (4) to be violated, a feature Xi must satisfy

max j I p j > 8

(9)

(10)(7)

where the last equality is based on Theorem 2. This meansthat

Pe(X)= 1 - 7rt = Pe (8)

which is quite a strong restriction. The feasible areas for (10)are shaded in Fig. 1.There exist situations, however, in which we can easily rec-

ognize that the conditions of Theorem 1 cannot be satisfied.

58

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 1, JANUARY 1980

TABLE IICLASS POSTERIOR PROBABILITIES AND FEATURE MIXED DISTRIBUTION

FEATURE RESULT C C2 c3 c4 EATURE MIXED

Xi1 - 0.485 0.288 0.106 0.121 0.515 0.330+ 0.955 0.007 0.022 0.015 0.045 0.670

- 0.533 0.367 0.050 0.050 0.467 0.150X2 + 0,847 0,053 0.050 0.050 0.153 0.850

- 0,865 0.100 0.028 0.057 0.135 0.693X3 + 0.652 0.100 0.098 0.150 0.348 0.307

- 0,800 0.100 0.050 0.050 0.200 0.600X4 + 0.800 0.100 0.050 0.050 0.200 0.400

Pij

o 1/8 7/8 1

Fig. 1. Illustration of inequality (10).Ptj

For a given feature X, define the support set of X under classi, Si, by

Si={x IPi(x)>0}. (I11)The conditions of Theorem 1 can only be satisfied if Si C Stfor every i, i # t. Otherwise, the supremum in (3) is notbounded. More explicitly, if there exists xo for whichPi(xo )>0 for i = t, and Pt(xo) = 0, then the occurrence of xo will re-duce 7Tt(x0) to zero while Iri(xo) > 0.

III. RESOLVING THE INSENSITIVITY OF THE Pe RULEOne way to resolve the insensitivity of the Pe rule is to

cease breaking ties arbitrarily and to use instead a "good"tie-breaking rule. The following criterion has been found to beuseful.

Tie-Breaking Rule: If two features X and Y have the samevalue for the expected probability of error, prefer the featurewith the greater variance for the probability of error.The variance of the probability of error for the features of

Table I are listed in Table III. The calculation for XI is asfollows:

E[ l - max{17(x1 )}112 = (0.045)2 X 0.670 + (0.515)2

X 0.330= 0.l01 (12)

Var [ probability of error for XI =0= 010 - (0.200)2= 0.061. (13)

This tie-breaking rule induces the same feature orderingthat we would intuitively prefer. The reasoning behind thistie-breaking rule is derived from the fact that with high vari-ance for Pe, and hence for maxi ri, the posterior probabilityof the most probable class substantially increases or sub-

TABLE IIIVARIANCE OF THE PROBABILITY OF ERROR

Xl X2 X3 X4

EXPECTED VALUE 0.200 0.200 0.200 0.200VARIANCE 0.221 0.112 0.098 0

stantially decreases depending on the observed result. Thatis, a greater variance means a stronger confirmation' in thecase that the current most probable class is indeed the trueclass, and a stronger deviation1 from the current most prob-able class if it is not the true class. In this sense, the rankingof the features obtained by this tie-breaking rule (Table III)conforms to the detailed results listed in Table II.Another way to resolve the insensitivity of the Pe rule is

not to use it, but to use instead other feature evaluation rulessuch as those derived from information theory. Table IVlists the ordering induced by evaluating the information gainusing Shannon's entropy and the quadratic entropy, e.g.,Toussaint [71. These rules do not have the deficiency of in-sensitivity because they directly take into account the entireposterior probability vector and not only the posterior prob-ability of the most probable class. For our example, the Perule with the above tie-breaking rule, and Shannon's and thequadratic information gain rules yield the same feature rank-ing (Tables III and IV). Using the e-equivalence concept [41,one can show that this result is not surprising. Experimentalresults with binary features [1 1, also reveal that on most oc-casions, the differences in ranking by these three rules are notsignificant.

IV. SUMMARYThe low sensitivity of the Pe rule to distinguish between

good and better features was demonstrated and discussed.Conditions were stated under which features with significantlydifferent discrimination power are considered as equivalent bythe Pe rule. The main reason for the low sensitivity of the Perule lies in the fact that, directly, it depends only on theposterior most probable class and that under the stated con-ditions the prior most probable class remains the posteriormost probable class regardless of the result for the observedfeature. (The prior and posterior values for the class prob-abilities may be different; however, the a priori maximumand the a posteriori maximum are attained on the same class.)Theorem 1 states the necessary and sufficient conditions forthis situation. It is shown that these conditions are quite oftensatisfied, particularly when one class has relatively high prob-ability compared to the other classes. This may happen, forinstance, in the advanced stages of a sequential classificationprocess. When this situation occurs, the expected posteriorprobability of error is equal to the prior, and the consideredfeature has no advantage over features which may be entirelyirrelevant to the classification problem while, in fact, it mayhave stronger differentiation power.A rule for breaking ties was suggested to refine the feature

ordering induced by the probability of error rule. By thisrule, when two features have the same probability of error,the feature with the higher variance for the probability oferror is preferred. Breaking ties in this manner was first usedin a medical diagnosis system, and in a large number of simu-lations of sequential classification problems with binary fea-tures. In all cases this modification was proved to be veryeffective.

'With a certain probability.

59

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 1, JANUARY 1980

TABLE IVCOMPARISON OF RULES

X1 X2 X3 X3

Pe 0.20 0.20 0.20 0.20SHANNON 0.23 0.08 0.07 0QUADRATIC 0.07 0.02 0.01 0

REFERENCES

[1] M. Ben-Bassat, "Myopic policies in sequential classification,"IEEE Trans. Comput., vol. C-27, pp. 170-174, 1978.

[2] M. Ben-Bassat and S. Gal, "Properties and convergence of a pos-teriori probabilities in classification problems," Pattern Recogni-tion, vol. 9, pp. 100-107, 1977.

[31 M. Ben-Bassat, "Irrelevant features in pattern recognition," IEEETrans. Comput., vol. C-27, pp. 746-749, 1978.

[41 -, "e-equivalence of feature selection rules," IEEE Trans. In-form. Theory, vol. IT-24, pp. 769-772, 1978.

[51 R. P. W. Duin, B. van Haresma, and L. Roosma, "On the evalua-tion of independent binary features," IEEE 7rans. Inform. Theory,vol. IT-24, pp. 248-249, 1978.

[61 G. T. Toussaint, "Recent progress in statistical methods applied topattern recognition," in Proc. 2nd Int. Joint Conf. on PatternRecognition, 1974, pp. 479-488.

[71 -, "Feature evaluation with quadratic mutual information,"Inform. Processing Lett., vol. 1, pp. 153-156, 1972.

our method overcomes the so-called "estimation of com-munalities" problem which, as discussed by Watanabe [3],always represented a major source of difficulties in applyingfactor analysis procedures. Moreover, our approach offers thepossibility of combining a factor analysis model with the inter-esting properties of Karhunen-Loeve expansion techniques[31-[6].

II. THE OPTIMAL FACTOR ANALYSIS MODEL

Given n variables X = (X1, * * *, Xn) and their correlationmatrix R = [rij], the purpose of factor analysis (FA) is to "re-produce" correlations ri-. in terms of the smallest possiblenumber m < n of variab(es F = (F1, - * *, Fm) on the basis ofthe following linear model:

Xi=aa1F1+ +aimFm +dUi, i=1, n (1)

where the common ftactors F = (F1,*-, Fm) and the specificfactors U = (U1, * * *, Un) are m + n uncorrelated variables.For convenience all variables X, F, and U are normalized withmean value 0 and variance 1.Let R' = [pii] be the covariance matrix of variables X' =

(XI,*,X) where

Xi ai1F1+ a.im Fm i= l, n.

In view of the above conditions we have

mPi = (Xi', X') = (X., Xi) = rij = E aikajk

k=l

whereas the diagonal terms

i j (2)

Feature Processing by Optimal Factor Analysis Techniquesin Statistical Pattem Recognition

GIACOMO DELLA RICCIA

Abstract-A factor analysis technique is described for processingfeatures in statistical pattern recognition problems. From input fea-tures X (X1,. * , Xn) two sets of features X'= (X;, * *, XA) andU = (Ul, -* , Un) are derived which can improve efficiency in usualclassification and clustering algorithms. Our procedure is optimal withrespect to a certain criterion imposed on the communalities of thefactor analysis model.

Index Terns-Factor analysis, feature processing, Karhuven-Loeve,statistical pattem recognition.

I. INTRODUCTIONWe stress a special characteristic of the factor analysis model

[ 1], [ 2 ] which is particularly useful in statistical patternrecognition applications. The characteristic is that the newvariables X' = (X1, * * ,Xn) extracted from initial variablesX = (X1, , Xn) by a factor analysis procedure, produce anenhancement of correlations existing in given variables X.Such an enhancement provides better insight into the structureof input data and therefore it can be used to improve accuracyin classification and clustering problems. Thus, we suggest anoriginal approach to factor analysis where, in order to opti-mize this correlation amplification effect, we use the criterionthat the sum of the so-called communalities must be minimum.Another advantage of our optimality criterion is that it

allows an easy numerical computation of the communalityvalues by using simple convex programming algorithms. Hence,

Manuscript received September 7, 1978; revised January 17, 1979.The author is with the Department of Mathematics, Ben-Gurion

University of the Negev, Beer-Sheva, Israel.

(3)m

Pii = 11 X'il2 = E a 2k = 0,,2 , i = 1***nk=l

are the unknown communalities of the problem. We also have

XiX11== I Xi112 +d,2 = ci +d1, i= l, , n. (4)

The problem then is to find communality values which mini-mize the algebraic rank m of R'. A numerical solution to thisproblem is difficult to obtain and a number of successive ap-proximation methods are described in the literature, whichdo not always converge or, if they do, do not necessarily leadto acceptable communalities and factor loadings aik, di whichsatisfy (2) and (3).Instead of minimizing the rank of R' we suggest as a crite-

rion for computing communality values to minimize tr(R') =trace of R'. This modified FA procedure can be useful inpattern recognition applications for the following reason. Thecorrelation between Xi and Xi is

Pij - rijIIlxill IIXjII oaia

Since, according to (4), communalities are always between 0and 1 we have r,- > rij with a correlation amplification coeffi-cient equal to l/aiac. Hence, if the correlation pattern ininitial data X is important in a pattern recognition process,the use of X' as extracted features should be more efficient.For instance, it may improve separation of classes or clusterswhich may exist in the original data.The correlation enhancement property of features X' is due

to the fact that, according to (1), they were derived from X byfiltering uncorrelated information (and eventually randomerrors) represented by variables U. One way of optimizing thisfiltering aspect of the FA model is to maximize the total vari-ance of extracted uncorrelated information and random errors,namely the quantity 2in l d,3. According to (4) this is equiva-lent to the requirement that

0162-8828/80/0100-0060$00.75 © 1980 IEEE

60

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 1, JANUARY 1980

n n n

Ea2 = tr(R') = (I - d 2) = n - Ed 3i-l i=l i=l

be minimum, which is precisely the criterion we suggestedbefore. We denote our procedure by OFA (optimal factoranalysis) in order to underline the optimal filtering criterioninvolved in it.In a situation where minimum tr(R') is a very small fraction

of the total variance n = tr(R) of X, it is reasonable to believethat the true data structure of X is more appropriately repre-sented by the uncorrelated variables U which should be con-sidered as the useful features whereas, in that case, the corre-lated components X' would play the role of noisy informationto be discarded. This possible use of the specific factors U inpattern recognition problems is quite original since in ordi-nary FA applications the specific factors are always treatedonly as correction terms in expansion (1).In [71 we consider the classical R. A. Fisher iris data classi-

fication problem and the computational results we obtaineddemonstrated the error-reduction capability of X'. In thesame paper we treated an example where it was found that thespecific factors U, rather than X', were the useful extractedfeatures.

III. COMPUTATIONAL ADVANTAGES OF THE OFAPROCEDURE

The characteristic property of a covariance matrix is to benonnegative definite, that is to say that all its eigenvalues arenonnegative; in particular, it is positive definite if all its eigen-values are strictly positive. We denote by R'[a] an n X n co-variance matrix with fixed off-diagonal terms riq and commu-nality values on the diagonal equal to a = (i, - *. , cj2). IfR'[o] and R'[3] are two positive definite matrices, then

OR'[a] + (1 - 0) R'[3] = R'[0ct + (I -0) ,B], 0 < 0 <1

is also positive definite. This shows that the set of communal-ity values is a convex domain in n-dimensional space. On theboundary of the domain the covariance matrix becomes non-negative definite with rank m < n. Our problem is to mini-mize the linear functional C = axl + * * * + a2 on the intersec-tion of that domain and the closed unit cube o0. a.? 1,i= 1,.*, n, hence on a closed bounded convex domain. This

is easily solved by convex programming techniques leading toaccurate numerical solutions. It is well known that a linearfunctional on a closed bounded convex domain achieves itsminimum on a boundary point of the domain. Therefore, ourcriterion which was introduced for optimal correlation en-hancement also produces a dimensionality reduction of theproblem. The rank of R' when tr(R') is minimum is notnecessarily equal to the minimum rank that R' can achieve.However, in pattern recognition problems, discrimination ismore important than increased dimensionality reduction.Under certain conditions that will be discussed in a later paper,the rank of R', when tr(R') is minimum, is also minimum.Our simpler procedure for computing communalities can thusbe useful in any FA application.Once communality values are obtained, the factor loadings

aik can be derived from a straightforward Karhunen-Loeve(KL) expansion procedure. In fact it is well known that if

'1,** , Xm are the nonzero eigenvalues of R', then in

x:=aiiFl + .+aimFm, i= ,,n (5)the column vector ak = (alk, * -

, ank) is the eigenvector ofR' associated with the eigenvalue X) with the normalization

n, aikail = )=kkl, k, I= 1, , mi=l

where 6k1 = 1 if k = 1 and 6kl = 0 otherwise. Moreover, onecan eventually truncate the KL expansion (5) and with thebest approximation in a least mean-square error sense, obtaina further dimensionality reduction of the feature space.In [8] Watanabe also suggested a way to combine FA and

KL expansion techniques. However, his method violates somebasic assumptions in the FA model and it is applicable onlywhen a certain condition called "the reach of a variable" issatisfied. Our procedure shows that, rather than competingwith each other, OFA and KL techniques can usefully mergetogether without additional conditions.

IV. CONCLUSIONThe suggested OFA procedure does not represent by itself a

new classification or clustering technique. It is, rather, a pro-cessing method which can improve the quality of given fea-tures X and as such can be useful in statistical pattern recogni-tion applications. Its advantages are derived from the fact thatit establishes an optimal division between common informa-tion and uncorrelated information in initial features and thatit can easily be implemented numerically.

REFERENCES

[1] H. H. Harman, Modern Factor Analysis. Chicago, IL: Univ. ofChicago Press, 1967.

[21 A. L. Comrey, A First Course in Factor Analysis. New York:Academic, 1973.

[31 S. Watanabe, "Karhunen-Loeve expansion and factor analysis,"in Transactions of the Fourth Prague Conference, J. Kozesnik,Ed. Prague, Czechoslavakia: Czechosolovak Acad. Sci., 1967.

[4] J. Kittler and P. C. Young, "A new approach to feature selectionbased on the Karhunen-Loeve expansion," Pattern Recognition,vol. 5, pp. 335-352, 1973.

[5] J. M. Mendel and K. S. Fu, Adaptive, Learning and Pattern Recog-nition Systems. New York: Academic, 1970.

[6] K. Fukunaga and W. L. G. Koontz, "Applications of the KL ex-pansion to feature selection and ordering," IEEE Trans. Comput.,vol. C-19, pp. 3 11-318, 1970.

[71 G. Della Riccia, F. de Santis, and M. Sessa, "Optimal factor analy-sis and pattern recognition: Applications," in Proc. 1978 Int.Conf on Cybernetics and Society, Tokyo, Japan, Nov. 1978.

[8] S. Watanabe, Knowing and Guessing. New York: Wiley, 1969,pp. 547-552.

A Decision Theory Approach to the Approximation ofDiscrete Probability Densities

DIMITRI KAZAKOS AND THEODORE COTSIDAS

Abstract-The problem of approximating a probability density func-tion by a simpler one is considered from a decision theory viewpoint.Among the family of candidate approximating densities, we seek theone that is most difficult to discriminate from the original. This formu-lation leads naturaliy to the density at the smallest Bhattacharyya dis-tance. The resulting optimization problem is analyzed in detail.

Index Terms-Histogram reduction, probability density approxima-tion.

Manuscript received August 28, 1978; revised May 14, 1979. Thiswork was supported by NSF Grant ENG 76-20295.The authors are with the Department of Electrical Engineering,

State University of New York at Buffalo, Amherst, NY 14260.

0162-8828/80/0100-0061$00.75 C 1980 IEEE

61


Recommended