+ All Categories
Home > Documents > An Algorithm for Nonsupervised Pattern Classification

An Algorithm for Nonsupervised Pattern Classification

Date post: 25-Sep-2016
Category:
Upload: israel
View: 218 times
Download: 3 times
Share this document with a friend
9
66 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-3, NO. 1, JANUARY 1973 or, since f(4) > f(b), a c exists arbitrarily close to b such Cybern. (USSR), no. 3, pp. 252-260, 1966. [4] E. M. Vaysbord, "Optimal control of systems with distributed that (c # b) parameters," Eng. Cybern. (USSR), no. 5, pp. 163-168, 1966. [5] , "Convergence of a method of random search" Eng. Cybern. f(b)Pb{Xt X b, all t > O} . f(C)Pb{Xt # b, all t > 0} (USSR), no. 3, pp. 44-48, 1968. [6] A. Cauchy, "Methode gdnerale pour la resolution systdmes which is possible only if d'equations simultanees," C.R. Acad. Sci. (Paris), vol. 25, 1847. [7] R. Fletcher and M. Powell, "A rapidly convergent descent method Pb{X b, all t > O} = o. for minimization," Comput. J., vol. 6, July 1963. [8] L. A. Rastrigan, "The convergence of the random search method Therefore, in the extremal control of a many-parameter system," Automat. Therefore, Remote Contr. (USSR), vol. 24, pp. 1467-1473, 1963. [9] K. Ito, "Lectures on stochastic processes," Tata Inst. Fundamental Pb{Xt b, all t > 0} = 1. Research, Bombay, India, 1961. [10] , "On stochastic differential equations," Mem. Amer. Math. REFERENCES Soc., no. 4, 1951. [11] K. Ito and H. P. McKean, Diffusion Processes and Their Sample [1] R. Z. Khas'minskii, "Application of random noise to optimiza- Paths. New York: Academic Press, 1967. tion and recognition problems," Probi. Peredach. Inform., vol. 1, [12] G. V. Karumidze, "A method of random search for the solution no. 3, pp. 113-117, 1965. of global extremum problems," Eng. Cybern. (USSR), no. 6, [2] D. B. Yudin, "Methods of quantitative analysis of complex 1969. systems," Eng. Cybern. (USSR), no. 1, 1966. [13] R. L. Barron, "Adaptive flight control systems," presented at the [3] L. S. Gurin, "Random search in the presence of noise," Eng. NATO AGARD Bionics Symp., Brussels, Belgium, Sept. 1968. An Algorithm for Nonsupervised Pattern Classification ISRAEL GITMAN Abstract-An algorithm for classifying a data set into an initially data set into so-called unimodal fuzzy sets. The distinguish- unknown number of categories is presented. It is composed of a pro- ing characteristic of this technique is that it associates with cedure for selecting initial points, a mode estimation procedure, and a classification rule. An integer valued function is defined on the sample every sample point a characteristic value proportional to space and a gradient search technique is used for estimating its modes. the local concentration of points. This characteristic value A procedure for mode estimation in the case of an infinite data set is contributes additional information for discriminating be- also proposed. Sufficient conditions for the convergence to the neighbor- tween clusters. In this paper a new clustering technique is hood of the modes have been stated. The algorithm was used for cluster- proposed. It uses the same concept of characteristic value; ing multicategory artificially generated data sets and was compared however, the definition of a local maximum (mode) and with an optimal classification scheme. the procedure for detecting it are different. I. INTRODUCTION A local maximum in [10] was defined as a sample point xi with the highest characteristic value in its neighborhood. WO DIFFERENT problems in nonsupervised pattern The necessary conditions for detecting xl are that it be an classification (clustering, categorization) can be iden- interior point in a symmetric discrete fuzzy set. That is, tified: 1) one in which the number of categories (classes) if x' and x. are two sample points in the neighborhood of is known [4]-[6], [13]; and 2) one in which the number of x, and if x' is nearer to xl than x2, then xi must have a categories is initially not known [2]-[8], [10], [11], [14], larger haracteristic value than X2. [16]. Some of the clustering techniques are composed of lg characteriticlval tan x two separate parts: a) a mode estimation procedure; and In this p e the local m aximum is notrequred tobet b) a pattern classification rule. The task of estimating the sample spoeinth the higen st rater,it c an ue an it modes is usually the more difficult one, in particular, when i t s information about the underlying data st is . neighborhood. Also, no symmetry conditions in its neighbor- no a nriori information about the underlvinLg data set iS I no a priori ~~~~~~~~~hood are required. These advantages have practical signifi- available and when the number of modes iS not known... ... cance, for, conditions such as 1) and 3) of Theorem 1 or A nnsuervsedclasifcaton lgoith ino a intialy condition 2) of Theorem 4 in reference [10] are not neces- unknown number of categories has been proposed by Gitman and Levine [10]. This algorithm classifies the given sary fo detectig a ode. This algorithm iS composed of the following procedures: 1) a selection of initial points; 2) a gradient search technique; Manuscript received April 10, 1972; revised July 7, 1972. and 3) a classification rule. The technique proceeds as The author was with Bell-Northern Research, Ottawa, Ont., Canada. He is now with Network Analysis Corporation, Glen Cove, N.Y. 11542: follows. An integer-valued function g iS defined on the
Transcript
Page 1: An Algorithm for Nonsupervised Pattern Classification

66 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-3, NO. 1, JANUARY 1973

or, since f(4) > f(b), a c exists arbitrarily close to b such Cybern. (USSR), no. 3, pp. 252-260, 1966.[4] E. M. Vaysbord, "Optimal control of systems with distributed

that (c # b) parameters," Eng. Cybern. (USSR), no. 5, pp. 163-168, 1966.[5] , "Convergence of a method of random search" Eng. Cybern.

f(b)Pb{Xt X b, all t > O} . f(C)Pb{Xt # b, all t > 0} (USSR), no. 3, pp. 44-48, 1968.[6] A. Cauchy, "Methode gdnerale pour la resolution systdmes

which is possible only if d'equations simultanees," C.R. Acad. Sci. (Paris), vol. 25, 1847.[7] R. Fletcher and M. Powell, "A rapidly convergent descent method

Pb{X b, all t > O} = o. for minimization," Comput. J., vol. 6, July 1963.[8] L. A. Rastrigan, "The convergence of the random search method

Therefore,in the extremal control of a many-parameter system," Automat.

Therefore, Remote Contr. (USSR), vol. 24, pp. 1467-1473, 1963.[9] K. Ito, "Lectures on stochastic processes," Tata Inst. Fundamental

Pb{Xt b, all t > 0} = 1. Research, Bombay, India, 1961.[10] , "On stochastic differential equations," Mem. Amer. Math.

REFERENCES Soc., no. 4, 1951.[11] K. Ito and H. P. McKean, Diffusion Processes and Their Sample

[1] R. Z. Khas'minskii, "Application of random noise to optimiza- Paths. New York: Academic Press, 1967.tion and recognition problems," Probi. Peredach. Inform., vol. 1, [12] G. V. Karumidze, "A method of random search for the solutionno. 3, pp. 113-117, 1965. of global extremum problems," Eng. Cybern. (USSR), no. 6,

[2] D. B. Yudin, "Methods of quantitative analysis of complex 1969.systems," Eng. Cybern. (USSR), no. 1, 1966. [13] R. L. Barron, "Adaptive flight control systems," presented at the

[3] L. S. Gurin, "Random search in the presence of noise," Eng. NATO AGARD Bionics Symp., Brussels, Belgium, Sept. 1968.

An Algorithm for Nonsupervised PatternClassification

ISRAEL GITMAN

Abstract-An algorithm for classifying a data set into an initially data set into so-called unimodal fuzzy sets. The distinguish-unknown number of categories is presented. It is composed of a pro- ing characteristic of this technique is that it associates withcedure for selecting initial points, a mode estimation procedure, and aclassification rule. An integer valued function is defined on the sample every sample point a characteristic value proportional tospace and a gradient search technique is used for estimating its modes. the local concentration of points. This characteristic valueA procedure for mode estimation in the case of an infinite data set is contributes additional information for discriminating be-also proposed. Sufficient conditions for the convergence to the neighbor- tween clusters. In this paper a new clustering technique ishood of the modes have been stated. The algorithm was used for cluster- proposed. It uses the same concept of characteristic value;ing multicategory artificially generated data sets and was compared however, the definition of a local maximum (mode) andwith an optimal classification scheme.

the procedure for detecting it are different.I. INTRODUCTION A local maximum in [10] was defined as a sample point

xi with the highest characteristic value in its neighborhood.WO DIFFERENT problems in nonsupervised pattern The necessary conditions for detecting xl are that it be anclassification (clustering, categorization) can be iden- interior point in a symmetric discrete fuzzy set. That is,

tified: 1) one in which the number of categories (classes) if x' and x. are two sample points in the neighborhood ofis known [4]-[6], [13]; and 2) one in which the number of x, and if x' is nearer to xl than x2, then xi must have acategories is initially not known [2]-[8], [10], [11], [14], larger haracteristic value than X2.[16]. Some of the clustering techniques are composed of lg characteriticlval tan xtwo separate parts: a) a mode estimation procedure; and In this p e the local m aximum isnotrequred tobetb) a pattern classification rule. The task of estimating the sample spoeinththe higen st rater,itc an uean itmodes is usually the more difficult one, in particular, when i t s

information about the underlying data st is. neighborhood. Also, no symmetry conditions in its neighbor-no a nriori information about the underlvinLg data set iS Ino a priori ~~~~~~~~~hoodare required. These advantages have practical signifi-available and when the number of modes iS not known...... cance, for, conditions such as 1) and 3) of Theorem 1 or

A nnsuervsedclasifcaton lgoith ino a intialy condition 2) of Theorem 4 in reference [10] are not neces-unknown number of categories has been proposed byGitman and Levine [10]. This algorithm classifies the given sary fo detectig a ode.

This algorithm iS composed of the following procedures:1) a selection of initial points; 2) a gradient search technique;

Manuscript received April 10, 1972; revised July 7, 1972. and 3) a classification rule. The technique proceeds asThe author was with Bell-Northern Research, Ottawa, Ont., Canada.

Heisnowwith NetworkAnalysis Corporation, GlenCove,N.Y. 11542: follows. An integer-valued function g iS defined on the

Page 2: An Algorithm for Nonsupervised Pattern Classification

GITMAN: ALGORITHM FOR NONSUPERVISED PATTERN CLASSIFICATION 67

sample space. A set of initial points is selected and a Pn+1 < Pn (4b)gradient search technique is applied to each of these, aiming lim Pn-+ 0. (4c)to estimate the local maxima of g. Since a local maximum n- ooof g is not necessarily a unique point, we apply an elimina- . btion procedure on the set of final points arrived at. The The bounds o the hypubed D which contains thepoints not eliminated are considered as local maxima of g. sampl p S arecmuedTrThe classification rule of [10] is then used to partition the pii,/i 1,2 ,L, are given bydata set. The conditions under which convergence to the pj = min (xi ej)neighborhood of the modes is obtained and the conditionsfor the procedure of selecting initial points which ensure pjj = max (xi'* ej). (5)the detection of all modes are stated in the theorems. i

The algorithm was used for clustering data sets drawn Whenever a mode estimate obtained by (2) falls outside D,from ellipsoidal normal distributions defined in a two- it is projected orthogonally onto it bydimensional space. The results were compared with those ifobtained by using an optimum classification technique.

Xnjixfj < pj

The procedure for selecting initial points was compared X = x ., if x x,1 (6)with one that uses a uniform distribution over the domain pjJ, i (jjto which the data set is confined. Experiments for determin- where xnj is the jth component of Xn.ing the sensitivity to change in parameters are also reported. In some cases the given data set is of infinite (or very

large) size and may be characterized by an unknownII. THE ALGORITHM probability density function f(x). One may then use a

A. Mode Estimation Technique different random sample of finite size N after each iterationLet X be an L-dimensional metric space, with the unit for evaluating g(T,x). However, in this case g is a random

base vectors e,,e2,.*,eL, and x E X. S = {x1,....,I i ... ,xN} variable with the expected valueis the given set ofN data points in X to be clustered. (Super- -scripts will be used to denote sample points of S; subscripts E[g(T,x)] = N f(x) dx. (7)will be used to denote any other point in X.) Define the xinteger-valued function g(T,x): The variance of g is finite, and it may be assumed that the

observations are unbiased. Under these conditions, ag(T,x) = I{Y d(x,y) < T}--irX (1) stochastic approximation technique of the type [12], with

where q denotes the number of sample points in the neigh- a constant parameter c [17, ch. 6], can be used for estimat-borhood T of x. T is a specified parameter and d is the ing x°, which maximizesf In this paper, a single finite-sizemetric. The usual metric will be used, that is, d(x,y) = sample has been assumed; hence g is deterministic. Thus,lix - y 11, where llx II is a norm. This function g was adopted although the gradient technique used is similar in formbecause of its intuitive meaning in the clustering problem. to that in [12], the conditions on the sequence {p,} can beIt associates high values with points in the center of a eased.cluster and low values with what are sometimes called Some definitions which will be used in the theorems are"wild shots." now introduced. Suppose that g(T,x) is a unimodal integer-The objective in this section is to estimate all the maxima valued function defined on a one-dimensional space.

of g. A maximum of g does not necessarily correspond to Given any x e D, define the nondisjoint seta unique point in X since g is an integer-valued function. R = ly E D y 2 x and g(Ty) = g(Tx)}. (8)The following iterative procedure is used for mode estima- x y Dtion. Given the current estimate of a mode, xn, the next Let r(x) denote the diameter of Rx, that is,estimate, x,, , is evaluated by the gradient search tech-nique [12]: r(x) = sup {d(x,y) x, y E Rx} (9)

Xn+ 1 = Xn + PnPn (2) rs = max r(x). (10)Dwhere {Pn} is a suitable scalar gain sequence and Pn is an D

average slope of g at x". This slope is given by Furthermore, denote by Ro the set (interval) on which

i L g(T,x) attains its maximum value, by ro the diameter of Ro,Pn = ejg(T, x- + ce.) - g(T, x - ce.)] (3) and by xm the midpoint in R.. Using the gradient technique

2c= (2) with the average slope (3) and a sequence {Pn,} whichwhere c is a real constant. satisfies (4), one can state the following theorem.A necessary condition for the convergence of x,, to the Theorem 1: If

neighborhood of a maximum ofg requires that the sequence c . r/2 (1 1){Pn,} satisfy the following conditions:

O ~~~~~~~~~1)there exists a finite n0 such that for n > n0~Pn=o (4a) llXn Xml11l<C (12)

Page 3: An Algorithm for Nonsupervised Pattern Classification

68 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JANUARY 1973

2) Corollary 2 is an immediate consequence of Theorem 1,lim llxn- XmI! < c - r0/2. since (21) implies that llxm-xll < r0/2. This corollaryn-o n demonstrates that it is possible to come as close as one

Proof: Define wishes to xo, providing r, is sufficiently small and (21) isvalid.

RC = {x lix - X.ll < c}B. Selection of Initial Points

R, xI lx-~x .1i c - ro/2}. 1 2Rs= {x |lIX-Xmix<C-rO/2}One begins by constructing the sequence A = {y,y *,Equation (11) implies that lIP,,11 > 0, for all x E D. Also, yN} of the sample points of S, ordered according to theirsince characteristic values. That is, g(T,yi) . g(T,yj), for i < j.

max g(T,x) < N (13) Then the following procedure is used.xeD Procedure: Select y'; eliminate from A all the sample

one obtains points in IF1 and denote the new ordered sequence by A1.0 < IIP,,lI . N/2c (14) Select the first point of A1, say, yk; eliminate from A1 all

sample points in Fyk to result in A2. This process is con-for all x E D. As a consequence of (14), it is ensured that tinued until for ult nA This pc is

f t~~~~~~~~~~~~is,edThis ,fo some i, there are no points in A,.Pn = Pn lP, II satisfies conditions (4a) and (4c). That is, This procedure ensures that there is an initial pointthere is infinite correction effort, and the correction effort within the T neighborhood of any sample point of S. Itreduces to zero in the limit. also ensures that some of the initial points will be locatedBy (4), there exists an integer n, such that in a small (less than T) neighborhood of the modes, be-

Pn,N/2c < r0/2. (15) cause from every Fyk in the procedure the sample pointwith the highest characteristic value is selected.

For n .ns, there is no possibility of overshooting. Thus Suppose now that g(T,x) is a multimodal integer-valuedthere exists an n0 2 n1 such that function, with K modes, defined in a one-dimensional

xlo-XmII < c (16) space. Denote by xmi and ro0, i = 1,2,o ,K, the midpointand diameter of mode i, respectively, and let ro = min,

For n . no one has roiand'ru = maxi r0o. Let {,r2,r2 ,rK} denote a partition

a) if xn is in Rs, then xn+ 1 is in R, by (15); of D into sets over each of which g(T,x) has only one mode.b) if x, is in (R, - Ru), then Denote by li the location in the sequence A of the sample

points with the highest characteristic value in Fi, i=lIXn+1 - Xmll . IlXn -X.lI (17) 1,2, *,K. One may assume that the modes of g are ordered

thus xn,,1 is in RC. according to their characteristic values g(T,y'i). Finally,define AK = {Y",y2, . .ylK}, and let AJ = {XO ,XO X I

This proves 1). To prove 2), let xoj} denote the set of initial points selected by the pro-cedure. The sufficient conditions which ensure convergence,

= {X lix - Xml . cC- r/2 + 8} in the sense of theorem 1, to all the modes of g are nowfor any e . 0. From (4), there exists an integer n6 and an stated.n, > ne, such that Theorem 2: If the conditions a)-d)

pn. * N/2c < e (18) a) lyi- yjll > T, for i = 2,3,. ..,K,

||Xn6 - Xmll < c - r0/2 + £. (19)j = 1,2, *,ji 1 (23)

b) there exists at least one sample point from eachTo establish 2) we use arguments a) and b), with n, and R, Roi, i = 1,2,..*replacing no and Rc, respectively, c) c > r /2

Corollary 1: If ro > r,/2 and ro . c . rJ/2, then c) c< rc/2

Rs a Ro, and, consequently, d) Po < r0c/N (24)

lim g(T,xn) = max g(T,x) = g(T,xm). (20) are satisfied, thenn- oo xeD 1) there exists a finite no such that, for n 2 no,

Suppose thatf is a unimodal probability density function min Ixn-xmiX1l < c, for all j = 1,2,* * ,J (25a)which characterizes the given data set, with maximum at i

x°. Then one can state the following. mmin lx-Xm1ll . c, for all i = 1,2,~*,K (25b)Corollary 2: If

g(T,x1) . g(T,x2) =l.f(x1) . f(x2), for all x1,x2 e D 2) lim min IX,j - Xm'Il . c -r"(21) for all]j= 1,2, ,J (26a)

then lim min llx,,j- Xm'II < c -rOlim llxn- x°ll < c. (22) n x j ~~~for all i = 1,2,' .*,K. (26b)

Page 4: An Algorithm for Nonsupervised Pattern Classification

GITMAN: ALGORITHM FOR NONSUPERVISED PATTERN CLASSIFICATION 69

fj A LRule: Assign yi of location j in A into the cluster inwhich its nearest neighbor with a higher characteristicvalue (all points that precede yJ in A) has been assigned.

This rule applies to all sample points except for the modesthat generate new clusters.

This classification rule partitions S into subsets such thatthe characteristic function over each subset has only onemode. Sufficient conditions under which an optimal

| I 1l partition, in the sense of maximum separation [18], isX1 h1 h2 x2 obtained are stated in [10]. The advantage of this rule

Fig. 1. Optimal partition for two different classification rules given over others (see [1], [15]) is its capability of identifying alocal maxima x10 and x20. h1 denotes separating point when using cluster whose closure forms a nonconvex set in the sample"nearest distance to local maximum" rule, and h2 when using rule . .pof this algorithm. space. As an example, Fig. 1 shows the optimal partitions

obtained using this rule and the rule which assigns eachProof: By definition and conditionc)sample point to the nearest local maximum, given x10Proof: By definition and condition c), Ro c: R, for n

all i. The procedure for selecting initial points and condition a 20.a) ensure that y" E AJ, for i = 1,2, *,K; hence AK ' AJ. III. EXPERIMENTSIt is also claimed that y" E Ro0. For, suppose y" Roi, A Gerthen by b) there exists some y E Ro' and the definition ofRo' implies that g(T,y) > g(T,yl'). Thus y will precede This algorithm was applied to clustering data sets drawnyU in the sequence A, which contradicts the definition of from elliposoidal normal distributions defined in a two-

yli dimensional space. The advantage in using an artificially

Consider any initial point xo' =y c- AK. Conditions generated data set is that two reference partitions withc) and d) and arguments similar to a) and b) of the proof which to compare the results are available. One is theof theorem 1 result in, for n > 1 partition in which the sample drawn from each distribution

is used as a reference category, and the other is the partitionIIxn- xm 11 < c, for all i = 1,2,* ,K. (27) obtained by optimally classifying the generated data set.

This proves (25b). A proof similar to that of 2) of Theorem 1 The former was used as the reference partition, and thecan be used to obtain (26b). error of the latter was used, for comparison, as an indication

Consider any initial point xoj E (AJ -AK). Using con- of the difficulty of the specific data set for clustering.dition c) and (4) one may apply Theorem 1 to prove (25a) The optimum classification was obtained by using theand (26a). optimum Bayes recognition procedure [9]: assign x" to

Conditions a) and b) of Theorem 2 ensure that at least category wi if

one initial point sufficiently close to each mode will be hi(Xk) > hj(Xk), for allj # i (28)selected. By bounding the maximum correction step where(condition d)), it is ensured that these points remain within hi(x) = log P(wi)p(x wi), i = 1,2,3. (29)a distance c from the modes. All other initial points may"cross" the boundaries from Fi to Fj, i = j. However, P(w ), i = 1,2,3, are the a priori probabilities (assumedcondition c) and (4) ensure that each of these points will equal), andconverge, in the sense of theorem 1, to some mode. p(x wi) = (27)-LI211il-112Remark: The integer value function g, under the con-

ditions stated and the gradient technique used, behaves as exp [-1/2(x- t)'X7 1(x-a continuous function with no stationary points. As a i = 1 2 3.result, an estimate xn will oscillate in the neighborhood of '1the mode. Gradient search techniques are naturally extend- The covariance matrices 1i and the mean values i, =able to a multidimensional space [17, ch. 4]. Therefore, it 1,2,3, in (30) were assumed known.may be possible to establish results similar to those of Two types of errors were used to grade the partitions.theorems 1 and 2 for the multidimensional case. 1) Em [10], the mixing error, defines the error caused by

points of reference category i being assigned to the sameC. Classification Rule -cluster as points of reference category]j, I#The classification rule used is essentially the same as in 2) Ft, the total error, is determined by assuming that the

[10], with the exception that the modes detected are not largest subset of each reference category in the partitionnecessarily sample points of S. The modes detected are is correctly classified. All the remaining points are classifiedintroduced into the sequence A according to their values in error. However, one may not assume more than oneg(T,xn). The sample points are now assigned to clusters subset from a cluster as being correctly classified.in the order that they appear in A, using the following Each of the data sets used consists of three categories,rule. each containing 100 points. The coordinates of the mean

Page 5: An Algorithm for Nonsupervised Pattern Classification

70 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JANUARY 1973

TABLE I

________ , 2 13

o][so o [50 lTEST 1 0 50 0 °[0 °5]

[10 01 [25 0 [0 0TEST 2 L0 20[ 0 10L 0 10

TEST 3 [0 1 i 0 1i Lo 1i]

,++ +++++

+ +

+++ +0 0 C13 + ~ 0

0 C) 0 d 0 b 600 0 0~

D + 0

*+ 4++

+ +

t+++

(a)

00%~~~~~~Do0 00

O.O O+U 000 01 ~ 000 00

o+ ojo ,o I o o° D8 0

'0 ++ 0+d

++~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+.~~~~~~~~~~~~~~~~~~~~~~~~ ' .+ ri 4t ) +

(b) (c)Fig. 2. Data sets for three tests. (a) Test 1. (b) Test 2. (c) Test 3. + -Points of reference, category

1; a-points of reference, category 2; * -points of reference, category 3.

Page 6: An Algorithm for Nonsupervised Pattern Classification

GITMAN: ALGORITHM FOR NONSUPERVISED PATTERN CLASSIFICATION 71

TABLE II

|_______ TEST 1 TEST 2 TEST 3

CLUSTER INITIAL POINTS INITIAL POINTS INITIAL POINTSOPTIMAL OPTIMAL OPTIMALNUMBER FUNCTION UNIFORM FUNCTION UNIFORM FUNCTION UNIFORM

1 8 0 36 8 0 38c 0 22(E 2 5 37 72 38 1 2 11 90)1 43 7j0 0 38

2 87 4 7 14 1 034 7 35 14@ 2 1224 1 1 0 29090 2(@ 0

3 5 OB ® 015 0 5() 23 1 18 022 134)1 9 °(j) 8 OS 8 128

4 0 635 034 0 0 o6o( 1216 0 0 023 0 023

5 17 1 0 0 1 21 12 17 0 1 15 0 20 0 0

6 1 0 0 13 00 2 0 3 0 5 0 0 0 5

7 0 80 0 04 0 03

8 0 03 2 10 02 0

9 0 02 0 11 0 02

10 2 0 0 0 1 0 0 1 0

11 0 1 0 0 0 1 0 0 1

12 00 1

F (E] 8.00 24.33 26.00 30.00 30.00 35.00 7.66 18.00 16.33

Et(E] 8.00 42.00 54.00 30.00 44.00 54.33 7.66 25.66 35.33

values of the distributions in each data set are p1 = (32.64, The data set for each test consists of sample points drawn45.84), /12 = (56.25, 74.94), and P3 = (18.36, 35.70). A from three ellipsoidal normal distributions. The covariancemaximum of 30 iterations per initial point and a maximum matrices of the distributions for each of the three tests areof 25 initial points were specified. An elimination procedure given in Table I. The corresponding data sets that werewas applied to the resulting modes. The justification for generated are shown in Fig. 2(a), (b), and (c). In each ofthis is that a mode of g is not a unique point in X. It also the tests performed, the same set of parameter values werecompensates for not specifying a sufficient number of used:iterations. The elimination procedure is the same as that T2 = 80.0; Po = 0.2; c = 3.0.used for selecting initial points. This implicitly assumesthat the pairwise distances between modes is greater than The parameter p decays with iterations asor equal to T. 1The core memory required by the computer program is Pn pOn , n = 1,2, (31)

L(2 + K) + N + S words, where S is the storage for the Table II contains detailed information on the partitionsdata set and K is the maximum number of initial points, of the data for each test. The first, second, and third columnwhen specified. The program was written so that g(T,x) is of each test give the results obtained by the optimal tech-evaluated at x,, and only modifications in g are made for nique, the proposed technique, and the technique whichevaluating g(T,x, ± cej), j = 1,2, ,L. If t denotes the uses a uniform distribution for selecting initial points,computing time for one sample point in one dimension, respectively. The clusters of each partition (in Tables IIthen the computing time per iteration is approximately and III) are ordered vertically according to the number of3LNt. points contained. Each cluster is split horizontally accord-

ing to the number of points contained from each referenceB. Application to Three Data Sets category. The circled numbers indicate the number of

This section considers the results obtained from the correctly classified points, used for determining Et. Em andapplication of the technique to three data sets. The errors Et represent the errors of the partitions, given in percentages.Em and Et are treated as performance measures of the A comparison of the second and third column of eachtechnique as a whole. The contribution to Em and E, test (Table II) shows that the number of clusters that resultfrom each of the three procedures that constitute the when using a uniform distribution is always greater thantechnique has not been estimated. Furthermore, for com- that which results when using the proposed procedureparison, the proposed technique is used again, with the for selecting initial points. Also, the former yielded higherexception that the initial points are selected by using a Em and Ft values, except for Em in test 3. This is due to theuniform distribution over the domain to which the data fact that some of the initial points generated by the uniformset is confined. In the latter case, the same number of initial distribution are located in low-density regions. Anotherpoints (i.e., 25) was used. Also, the same elimination contributing factor is the limited number of iterationsprocedure was employed on the final modes. performed.

Page 7: An Algorithm for Nonsupervised Pattern Classification

72 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JANUARY 1973

+ +

+ +

(a)

(b) (c)Fig. 3. Partitions obtained by algorithm. Form of sketched boundaries has no quantitative

significance.

The error Em in tests 2 and 3 is relatively low and corn- It can be seen that the closure of some of the clusters formpares favorably with the optimal error. The large difference nonconvex sets. This is a characteristic of the classificationbetween errors in column 1 and 2 of test 1 is due to the rule used.fact that the partition of the space associated with theoptimal classification generates disjoint subsets, whereas C. Sensitivity Analysisin the clustering problem, it is explicitly required to obtain The aim of the series of experiments in this section wasnondisjoint subsets. The error Et is relatively high. This is to determine the sensitivity of the partitions to changesdue mainly to the high degree of overlap between data in the parameters T, Po' and c. For this purpose, the datapoints belonging to various reference categories. set of test 1 was employed. The proposed algorithm was

Fig. 3(a), (b), and (c) shows the partitions of the three then used to partition the data set for different values ofdata sets obtained by the proposed technique. The form parameters. Table III shows the partitions obtained and theof the sketched boundaries has no quantitative significance. parameters used.

Page 8: An Algorithm for Nonsupervised Pattern Classification

GITMAN: ALGORITHM FOR NONSUPERVISED PATTERN CLASSIFICATION 73

TABLE III

22=60 T2 80 _22=

CLUSTER PO 0.1 p0 = 0.2 po = 0.3 P0 = 0.1 0Po 0.2 Po = 0.3 P0 = 0.1 po = 0.2 p0 = 0.3NUMBER c = 4.0 c =3.01 c = 2.0 c = 4.0 c =3.0 c = 2.0 c = 4.0 c =3.0 c = 2.0

1 34 U8 C 34 (!) 0J34 0 36 92) 0 O 36 92 0 40(r 0 40(r) 0 40O

2 6 053 8 054 8 0 14 1514 1514 1 16 05 16 05116 0

3 0 14 0 140 0 140 j 0156 015 32015 0 145 0 145 0 145

4 0 2 1) 0 2 1 0 2 0 6 35 0 6 35 0 6 35 (i 0 3 310 3 ( 0 3

5 20 1 020 1 0 20 1 0 17 1 017 1 0 17 1 0 13 0 0 13 0 0 13 0 0

6 4 0 4 5 0 1 4 0 4 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 1

7 50 1 203 300

Emrn%] 16.33 16.33 16.66 24.33 24.33 24.33 20.00 20.00 20.00

Et[%] 39.33 39.00 39.00 42.00 42.00 42.00 39.66 39.66 39.66

An examination of Table III shows that for a fixed from this technique was relatively high with respect to thatvalue of T2, very little change in the partitions resulted resulting from optimal classification. This is most likelyfrom varying po and c. In fact, for TP = 80 or T2 = 100, due to the high degree of overlap between data points of theidentical partitions were obtained for the three different various reference categories.combinations of po and c. It must be emphasized that this The sensitivity analysis demonstrates that the techniquedoes not imply that the results are not dependent in po is not sensitive to perturbation of the parameters po andand c. It does, however, indicate that there is a fairly large c in the range of values considered. The changes resultingrange of values of these parameters for which the results from the variations of parameter T2 are not large when theshow almost no variation. range of T2 used is considered. However, T2 seems to beThe changes in the results due to variations in the value the dominating parameter, and users may have to try

of TP are larger than any variations observed when po several values before making definite conclusions from theand c were altered. However, when the changes in the partitions obtained. The proposed technique partitions aresults due to the variation of T2 are considered together data set into clusters on the basis of the difference in thewith size of the variation (i.e., T2 = 60 to T2 100) density of points in the sample space. This is inherent inand the fact that there are no clear boundaries between the characteristic function used. As a result, one maythe reference categories, it does not appear that the parti- encounter difficulties when the density of sample points intions that resulted are very sensitive to variation in T2. the space is almost uniform.However, it is clear that T2 is the dominant parameteramong the three.Some knowledge about the choice of T can be gained by The author is grateful to C. A. Dykema of CDC, San

examining the distribution of pairwise distances [7]. Diego, Calif., and C. S. Jayasuriya of Bell-NorthernFurthermore, given a sample from two distributions with Research, Ottawa, Ont., Canada, for their assistance inknown covariance matrices, there is a minimum distance preparing the final manuscript.between the mean values under which the mixture distribu- REFERENCEStion becomes uinimodal. This can be used for choosing thetionabecomeser unimodal.TThis can be used for choosingthe [1] G. H. Ball, "Data analysis in the social sciences: what about theparameter T. details?," in 1965 Fall Joint Computer Conf, AFIPS Conf Proc.,

vol. 27, pt. 1. Washington, D.C.: Spartan, 1965, pp. 533-559.IV. CONCLUSIONS [2] G. H. Ball and D. J. Hall, "ISODATA, a novel method for data

analysis and pattern classification," Stanford Res. Inst., MelnoAn algorithm for nonsupervised pattern classification Park, Calif., Apr. 1965.

[3] R. E. Bonner, "On some clustering techniques," IBM J. Res.into an initially unknown number of clusters has been Develop., vol. 8, pp. 22-32, Jan. 1964.proposed and examined. The algorithm is composed of a [4] D. B. Cooper and P. W. Cooper, "Adaptive pattern recognition

for selecting initial points, a gradient search and signal detection without supervision," in 1964 IEEE Int.procedure for selectlng lnltlal po1nts, a gradlent search Conv. Rec., pt. 1, pp. 246-256.procedure, and a classification rule. The first two pro- [5] A. A. Dorofeyuk, "Teaching algorithms for pattern recognition

'.A.-_ .-1-- ._A__ -r __ e- machine without a teacher based on the method of potentialcedures estlmate the modes of an integer-valued function functions," Automat. Remote Contr., vol. 27, pp. 1728-1737,defined on the sample space. With these modes determined Dec. 1966.classification rule that has been previously proposed ' [6] W. D. Fisher, "On grouping for maximum homogenity," .J.a [lslialnrl hths enpelu rpsd10] Amer. Statist. Ass., vol. 53, pp. 789-798, 1958.is then employed for partitioning the data set. Suficient [7] J. A. Gengerelli, "A method for detecting subgroups in a popula-

.. . ~~~~~~~~~~~~~~tionand specifying their membership," J. Psychol., vol. 55,conditions for the detection of all modes are stated. pp. 457-468, 1963.The technique was applied to clustering data sets drawn [8] E. W. Forgy, "Detecting natural clusters of individuals," presented

at the 1964 Western Psychol. Ass. Meeting, Santa Monica, Calif.,from ellipsoidal normal distributions. The total error Et Sept. 1964.

Page 9: An Algorithm for Nonsupervised Pattern Classification

74 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-3, NO. 1, JANUARY 1973

[9] K. S. Fu, "Statistical pattern recognition," in Adaptive, Learning multivariate observations," in Proc. 5th Berkeley Symp. Mathe-and Pattern Recognition Systems: Theory and Application, matics, Statistics, and Probability. Berkeley, Calif.: Univ.J. M. Mendel and K. S. Fu, Eds. New York, Academic Press, California Press, pp. 281-297, 1967.1970. [14] R. L. Mattson and J. E. Dammann, "A technique for detecting

[10] I. Gitman and M. D. Levine, "An algorithm for detecting uni- and coding subclasses in pattern recognition problems," IBM J.modal fuzzy sets and its application as a clustering technique," Res. Develop., vol. 9, pp. 294-302; July 1965.IEEE Trans. Comput., vol. C-19, pp. 583-593, July 1970. [15] G. Nagy, "State of the art in pattern recognition," Proc. IEEE,

[11] T. Kainuma, T. Takekawa, and S. Watanabe, "Reduction of vol. 56, pp. 836-862, May 1968.clustering problem to pattern recognition," Pattern Recognition [16] D. J. Rogers and T. T. Tanimoto, "A computer program forJ., vol. 1, pp. 195-205, 1969. classifying plants," Science, vol. 132, pp. 115-118, Oct. 1960.

[12] J. Kiefer and J. Wolfowitz, "Stochastic estimation of the maximum [17] D. J. Wilde, Optimum Seeking Methods. Englewood Cliffs, N.J.:of a regression function," Ann. Math. Statist., vol. 23, no. 3, Prentice-Hall, 1964.pp. 462-466, 1952. [18] L. A. Zadeh, "Fuzzy sets," Inform. Contr., vol. 8, pp. 338-353,

[13] J. MacQueen, "Some methods of classification and analysis of 1965.

Multicategory Learning Classifiers forCharacter Reading

MASAMICHI SHIMURA

Abstract-This paper presents properties of several different algorithms However, there exist some difficulties in constructing thesuitable for multicategory classification of hand-printed alphanumeric multicategory classifier because of its complexity, its longcharacters. In the character reader the input patterns are generally training period, etc. If we use a well-known technique,composed of the template characters and their distorted ones. Using thetemplate patterns, a nonparametric procedure is developed for determin- there are two approaches to obtain a nonparametric multi-ing linear discriminant functions. Furthermore, we propose the mechanism category classifier. One is the Perceptron-like machine [1]which has the ability to recognize even a misprinted character by using which has more than two receptor units. The other is thethe information of the preceding character. The algorithms offer the machine in which classification is made by detecting thefollowing advantages: flexibility (cost assignments), simplicity, adapta- output corresponding to each category. For thetion, and acceptable performance. Performance of the machines is maximum otu haspon poset rategorygoritheanalyzed and convergence proofs of the learning procedures in the latter type Nilsson [2] has proposed training algorithmsmachines are derived. We also present some results of computer and made an analysis of the learning process. Also, Dudaexperiments. and Fossum [3] have given the convergence proof in a

I. INTRODUCTION different manner. It is known, however, that such machinesTN RECENT YEARS some attempts have been made to generally require n x a variable weighting devices inc

. . . . order to classify n-dimensional patterns into a categories.a nonparametric training method. However, when any Tege s become more complicated as the number of

a priori information regarding input patterns is unknown categories increases. For character reading the number

or when the probability densities of them cannot be expres- o

sed mathematically, the nonparametric method must be categories for alphanumeric. Therefore, a large number of

usedfor obtaining discriminant functions. In this *pTper devices is needed and a rather longer training time is re-used for obtaining discriminant functions. In this paper urdwe discuss the three types of the nonparametric linear q

Let us consider the case of character reading. For hand-machine usedinharacte d . .printed characters, the number of possible variations of aMost of the literature on deterministic adaptive pattern c i q

clasifiatin iretritedto te to-ctegry robem. a template or standard pattern, the hand-printed charactersThe multicategory problem can be readily solved by a

techiqu devlopd fr th tw-catgor clssifcaton, can be considered to be distorted patterns of the template.In general, the problem of pattern classification or

recognition can be separated into two problems. The firstManuscript received June 9, 1971; revised May 30, 1972. problem is that of abstracting significant features orThe author is with the Faculty of Engineering Science, Osaka chrceitc frm tepten.en osdrd

University, Toyonaka, Osaka, Japan. hrcestc frm heptrn bigco id e-


Recommended