+ All Categories
Home > Documents > Cluster Mapping with Experimental Computer Graphics

Cluster Mapping with Experimental Computer Graphics

Date post: 23-Sep-2016
Category:
Upload: ii
View: 215 times
Download: 3 times
Share this document with a friend
5
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-18, NO. 11, NOVEMBER 1969 Cluster Mapping with Experimental Computer Graphics EDWARD A. PATRICK, MEMBER, IEE, AND FREDERIC P. FISCHER II, STUDENT MEMBER, IEEE Abstract-The unsupervised estimation problem has been con- veniently formulated in terms of a mixture density. It has been shown that a criterion naturally arises whose maximum defines the Bayes minimum risk solution. This criterion is the expected value of the na- tural log of the mixture density. By making the assumptions that the component densities in the mixture are truncated Gaussian, the cri- terion has a greatly simplified form. This criterion can be used to re- solve mixtures when the number of classes as well as the class co- variances are unknown. In this paper a technique is presented where an assumed test covariance is supplied by an experimenter who uses a test function as a "portable magnifying glass" to examine data. Because the experimenter supplies the covariance and thus the test function, the technique is especially suited for interactive data analy- sis. Index Terms-Clustering, computer display of mixed data, com- puter graphics in pattern recognition, interactive data analysis, inter- active pattern recognition system, mixture density, pattern recogni- tion, sorting data unsupervised estimation of densities. INTRODUCTION T HE UNSUPERVISED estimation problem is conveniently formulated in terms of mixtures and mixing parameters; let xi, x2, - * *, x, be I dimen- sional vector samples where x has a density function h(x| B): M h(x I B) f(xI ai)P( Qxj) (1) i=1 B = {c4, P(z) i} Z (2) where B is in the parameter space and h(x| B) is called a mixture. It has been shown by Patrick and Costello [12], [13] that a criterion naturally arises whose maxi- mum defines the Bayes minimum risk solution. This criterion denoted 7(B) is the expected value of the na- tural logarithm of the mixture density of the observa- tion vectors: r/(B) = fIn h(x B)h(x B*)dx (3) where B* is the true parameter point. They have also shown that it is possible to approximate the Bayes solu- tion by using (B) as a regression function for stochastic Manuscript received June 9, 1969; revised July 1, 1969. This work was supported by the Rome Air Development Center under Contract F 30 602-68-C-0186. This paper was presented at the IEEE Com- puter Group Conference, Minneapolis, Minn., June 17-19, 1969. The authors are with the School of Electrical Engineering, Purdue University, Lafayette, Ind. approximation or for finding a partition of the sample space. The latter proceeds as follows: for an asymptotic minimum risk solution it is sufficient to find the param- eter vector B that maximizes r(B) In [ f(x ak)P( ak) h(x)dx. The sample space is partitioned into M disjoint regions where the regions are defined SkA I X f(X I ak)P(k) > f(x aj)P(aj) all j # k} k = 1, 2,-M. It is assumed that over each partitioned set the class density is Gaussian having mean vector Tk, covariance matrix q1k, with the density truncated at the partition boundary. The true mixture h(x) is assumed bounded. Under these assumptions 7(B) can be expanded and reduced to ir.i l(2ir) l/2 X Ei l/2j 2 (5) Then maximizing 7(B) is equivalent to finding the par- tition (4) which maximizes (5). Given the partition (4), the maximum likelihood estimate of the mean vector rk is the sample mean of the samples in the kth region of the partition. If the covariances {14 } are known and the classes are "well separated," there is no problem in visually determining the samples belonging to a class for the two dimensional case. For data vectors of dimensionality higher than two, it is clear that data vectors close to a particular data vector xi belong to the same class as does xi. In this paper we take such a "clustering" ap- proach to the problem which can be motivated by the criteria 7(B) under the assumptions that the covariances are known IA,} and separability. The approach is to define a cluster as a set of observations which "likely" originated from the same mixture component; thus a local neighborhood philosophy is utilized. Essentially, each observation xj is associated with (mapped to) the parameters of a Guassian mixture component dominat- ing in a neighborhood about xj. Hence, observations drawn from the same mixture component are mapped approximately to the same point in the parameter space. Clusters are identified by finding the subsets of the ob- servations mapped to the same "fuzzy" point. 987
Transcript

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-18, NO. 11, NOVEMBER 1969

Cluster Mapping with ExperimentalComputer Graphics

EDWARD A. PATRICK, MEMBER, IEE, AND FREDERIC P. FISCHER II, STUDENT

MEMBER, IEEE

Abstract-The unsupervised estimation problem has been con-veniently formulated in terms of a mixture density. It has been shownthat a criterion naturally arises whose maximum defines the Bayesminimum risk solution. This criterion is the expected value of the na-tural log of the mixture density. By making the assumptions that thecomponent densities in the mixture are truncated Gaussian, the cri-terion has a greatly simplified form. This criterion can be used to re-solve mixtures when the number of classes as well as the class co-variances are unknown. In this paper a technique is presented wherean assumed test covariance is supplied by an experimenter who usesa test function as a "portable magnifying glass" to examine data.Because the experimenter supplies the covariance and thus the testfunction, the technique is especially suited for interactive data analy-sis.

Index Terms-Clustering, computer display of mixed data, com-puter graphics in pattern recognition, interactive data analysis, inter-active pattern recognition system, mixture density, pattern recogni-tion, sorting data unsupervised estimation of densities.

INTRODUCTIONT HE UNSUPERVISED estimation problem is

conveniently formulated in terms of mixtures andmixing parameters; let xi, x2, - * *, x, be I dimen-

sional vector samples where x has a density functionh(x| B):

M

h(x I B) f(xI ai)P( Qxj) (1)i=1

B = {c4, P(z) i} Z (2)

where B is in the parameter space and h(x| B) is calleda mixture. It has been shown by Patrick and Costello[12], [13] that a criterion naturally arises whose maxi-mum defines the Bayes minimum risk solution. Thiscriterion denoted 7(B) is the expected value of the na-tural logarithm of the mixture density of the observa-tion vectors:

r/(B) = fIn h(x B)h(x B*)dx (3)

where B* is the true parameter point. They have alsoshown that it is possible to approximate the Bayes solu-tion by using (B) as a regression function for stochastic

Manuscript received June 9, 1969; revised July 1, 1969. This workwas supported by the Rome Air Development Center under ContractF 30 602-68-C-0186. This paper was presented at the IEEE Com-puter Group Conference, Minneapolis, Minn., June 17-19, 1969.

The authors are with the School of Electrical Engineering, PurdueUniversity, Lafayette, Ind.

approximation or for finding a partition of the samplespace. The latter proceeds as follows: for an asymptoticminimum risk solution it is sufficient to find the param-eter vector B that maximizes

r(B) In [ f(x ak)P( ak) h(x)dx.

The sample space is partitioned into M disjoint regionswhere the regions are defined

SkA IX f(X I ak)P(k) > f(x aj)P(aj) all j # k}k = 1,2,-M.

It is assumed that over each partitioned set the classdensity is Gaussian having mean vector Tk, covariancematrix q1k, with the density truncated at the partitionboundary. The true mixture h(x) is assumed bounded.Under these assumptions 7(B) can be expanded andreduced to

ir.i l(2ir) l/2 X Ei l/2j 2(5)

Then maximizing 7(B) is equivalent to finding the par-tition (4) which maximizes (5). Given the partition (4),the maximum likelihood estimate of the mean vectorrk is the sample mean of the samples in the kth regionof the partition.

If the covariances {14 } are known and the classesare "well separated," there is no problem in visuallydetermining the samples belonging to a class for the twodimensional case. For data vectors of dimensionalityhigher than two, it is clear that data vectors close to aparticular data vector xi belong to the same class asdoes xi. In this paper we take such a "clustering" ap-proach to the problem which can be motivated by thecriteria 7(B) under the assumptions that the covariancesare known IA,} and separability. The approach is todefine a cluster as a set of observations which "likely"originated from the same mixture component; thus alocal neighborhood philosophy is utilized. Essentially,each observation xj is associated with (mapped to) theparameters of a Guassian mixture component dominat-ing in a neighborhood about xj. Hence, observationsdrawn from the same mixture component are mappedapproximately to the same point in the parameter space.Clusters are identified by finding the subsets of the ob-servations mapped to the same "fuzzy" point.

987

IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1969

Whereas the quality of unsupervised estimation canbe measured by a criteria (B), the clustering approachtaken in this paper does not utilize such a criteria. Theadvantage of not using such a criteria when clusteringis simplification, as well as the fact that the techniquewill always work when samples are "well separated"and the covariances known. A disadvantage of theclustering approach is that it can fail to separate classeswhen the above assumptions are violated; then a cri-teria like 77(B) should be used.There are many approaches which can be called

clustering as is indicated in a literature review by Ball[1]. The least complex solution may be that of Se-bestyen [2], where clusters are defined to be the set ofall points within a distance T of a cluster center. Clus-ters also can be defined in terms of a thresholded similar-ity matrix [3]. Then, a cluster may be the set of allpoints which can be connected together with links oflength less than a threshold T. The entire information ofthe data set may be used to define a cluster, as did Balland Hall when they considered finding the mean dis-tance from the samples within a group to their mean. Apartial listing of papers on clustering and unsupervisedestimation are [l]-[7], [8], [10], [12], [13]. Pearson[4] appears to be the earliest work, and Nagy [14] haspresented a recent review of pattern recognition includ-ing the clustering problem.

THE CLUSTER MAP

Although this approach will not be taken, an intui-tive way to achieve the above results using a neighbor-hood philosophy is as follows: suppose that we are givena neighborhood about the point xj. For example, theneighborhood may be the set of all points inside thecircle drawn through the rth nearest neighbor to xj(see Fig. 1). Let t(x) equal one on the inside of thisregion, and let t(x) equal 0 elsewhere. For this particulart(x) and xj, assume that the dth mixture componentdominates all other components inside this neighbor-hood. More specifically, assume that

M

h(x)t(x) = t(x) j Pif(x (Di, yi)i=1 (6)

- t(x)Pdf(x d, Yd)where f(Xl1d Td) is a Gaussian mixture componentdominant at xj. Thus, h(x)t(x) is approximately a single,truncated, multivariate Gaussian function. The trun-cating function t(x) removes the other mixture com-ponents from consideration, as is shown in Fig. 2. Sinceh(x)t(x) is a truncated Gaussian density, a relationshipbetween the moments of f(x)t(x) and the moments ofh(x) might be determined. Using a sample densityi h(x)instead of h(x), estimates of the moments of h(x)t(x)could be used to obtain estimates IFd1, rdj}. If thesame procedure is carried out at each observation, ob-servations under the same component would be asso-

n

1 h(x) = I/n, 6(x-xi); 5(x) is the Dirac delta.i-_

Cluster

Cluster

C X~~~~~~Cluster

xi

Fig. 1. Clustering.

Xj

Fig. 2. 0-1 truncating function.

ciated with approximately the same estimates in theparameter space. It would then be a simple accomplish-ment to identify these tight "clusters" in the parameterspace. Although conceptually promising, this approachis rejected for the reason that the relations betweenmoments of a truncated multivariate Gaussian and themoments of the Gaussian are unknown, except for theunivariate case [10]. To avoid this problem, anotherway is used to estimate the parameters of the mixtureusing moment estimators.

Instead of choosing a truncating function t(x) whichis one inside and zero outside a neighborhood, choose abetter suited function-the ubiquitous multivariateGaussian density function centered at xi with covar-iance matrix2 l. We still may visualize a "pseudo-neighborhood" of xj to be those points within thecircle of concentration of the Gaussian function,t(x) = t(x II, xj) with mean x; and covariance 1j. Thechoice of a Gaussian function leads to great mathemati-cal simplicity.Then using a theorem by Miller [ll], on page 24,

t(x I, x,j)f(x I5j3, 'y) = kf(x R, C)R = (4bd + N-i)-1C = R(4)PF fld + X-X1X)

1 \n /2k = -) ( | R- ||14dII i)-1/2

exp I 2 [CtRi1C - Ydt'4d-lYd - xjti-ixjj}.

(7a)(7b)

2 This covariance matrix 2 for the test function will be suppliedinteractively by the operator.

988

x2

PATRICK AND FISCHER: CLUSTER MAPPING WITH COMPUTER GRAPHICS

In other words, a "truncated" Gaussian mixture com-ponent is also Gaussian, and its mean and covariancematrix is given above in terms of the moments of themixture component and the truncating function t(x).(See Fig. 3.)

In particular, given the moments of the truncatedfunction and the function t(x), the inverse relationshipis easily obtained:

qxd = (R-1-

Yd = (Idl 1 + I) (C - Xj) + Xj.

Thus, by (6) and (7),

h(x)t(x) = Pdkf(x I R, C).

(8)

(9)

The preceding development is used in the followingway: the moments of the function h(x)t(x) are estimatedusing the moments of the function hf(x)t(x) where A(x)is the sample density. That is, letting

1 n

mO= t(- (x1 IXj)mtns=1

1 1 n

m0 n 8=1

then

= (Xi, I21, *xi) L

and

A 22 2 1

R = [rn2rn22 ih@2J (10)

M'4V- ml

Assuming that the best function is such that relation(9) is true, the above sample moments (10) are usedas estimators of the truncated Gaussian function. Thus,estimators of the parameters characterizing that domi-nating mixture component can be found using the in-verse relations (8).

d= (iX- + I)(C - x,) + x. (11)

By (11), a set of parameters (I'd,, 4d,) may be asso-ciated with x;, j= 1, 2, * * *, n. (d corresponds to themixture density belonging to sample Xj.) If the clustermodel of separability and t(x) are satisfied and there area sufficient number of observations, the set { (I'd,, 4.,z)}should be well clustered in the parameter space.

In some special applications, the covariances 4, areequal and known such that much of the calculations canbe avoided. Then it is sufficient to obtain estimates ofjust the means of the dominantcomponent:

Yd = (-,1+ I)(C - x,) + xj

Fig. 3. A graded truncating function.

where 4bi=4 for all indexes i. If I is chosen to be ana-1 multiple of 4, then

Yd = - ax, + (1 + a)C. (13)

Implementation complexity of this algorithm for theabove special case grows with the product of n21 insteadof the product n212. In either case, the full data set mustbe stored.

This approach is especially suited for interactive dataanalysis and classification where the experimentersupplies the test function covariance matrix I fromthe keyboard. If he assumes that S = 4*,, then the simpli-fied expression (13) results. Using a computer outputdisplay and assuming a two-dimensional observationspace, the experimenter observes that the point xjmaps to the point )rd according to (13). The effect ofmapping all the samples this way is to produce a tightercluster of points. When dimensionality is greater thantwo, the problem remains as to how to display theclustered points on a computer output display. Solu-tions to this problem utilize mappings from a higherdimensional space to either one or two dimensionalspace such as those in [9 ], [15 ], [16].

If 4bi=4* is unknown, it may not be unreasonable toassume 4' = v and make a good initial guess at 4', say4° and R,

1 nqAn=(so + ~~Rn+ 1 n+1

where the terms in R are

MAt2-= (l 1 £ (x,y - t8)(X,, - M- , 1)f(x, 4'n1, xi)mO n __

with

A nnm - h (x, jn-l)Xj)n 8=1

A E x8,1t(x8 4n-1, xj).kO n 81

989

(12)

IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1969

Although this estimate of 4 is calculated for the classassociated with xj, it can be used for all classes becauseof the assumption =4b. Furthermore, ¢'s calculationfor various points xj can be averaged to produce an

estimate of 4> with lower variance.The conclusion is that the test function approach is a

useful technique in an interactive data analysis andclassification system when it is used properly withinthe limitations of the assumptions of the clusteringapproach. The alternative is a rigorous technique ofunsupervised estimation based on a criteria such as?7(B).

Interactive analysis of data provides the experimenterwith results not easily described theoretically. For in-stance, as shown in the experimental examples in thenext section, successive application of this cluster map-

ping to points mapped from the observation space to theparameter space results in tight clusters in the param-

eter space.

EXPERIMENTAL EXAMPLE

Examples using the new clustering algorithm, pro-

grammed on the CDC 6500 computer with clustersdisplayed on the CDC 252 display screen, are presentedbelow; for each example, the test function covariancematrix was determined interactively.Example 1: The first example is chosen so that there

are three well-separated clusters. Specifically, the mix-ture density (5) is

3 1

f(x) = A -f(x 4i, i)i=i 3

(14)

where f(x IA, ri) is the bivariate Gaussian distributionwith parameters:

1.5 04bi = _0 1 .5( )

Yi = (O, O)

i = 1,2,3

(15)

Y2 = (5, -5)

y3= (-5, 5).

Two-hundred and fifty observations were independentlydrawn from a random vector generator designed to havethe density (14). These observations are shown in Fig. 4.The clustering transformation (11) was then used to

map the 250 observations to the parameter space ofmeans and covariances. The result was that the ob-servations became more tightly clustered in the param-

eter space.

The transformation was reapplied to the points in theparameter space which resulted from the transforma-tion of points in the observation space; then the trans-formation was sequentially applied four times to thepoints in the parameter space. Fig. 5 shows the resultsafter the fifth application of the clustering transforma-tion. Approximately ten seconds of computation timewere required to transform all the observations once.

: )1h ;

Fig. 4. Originlal data.

Fig. 5. After fifth mapping.

Fig. 6. Labeled original data.

Example 2: The second example is identical to Ex-ample 1 except that the three classes are not as wellseparated:

[2.25 024bi = _0 2.25

y, =(4.0, °)

i = 1, 2,3

(16)

T2= (-4.0, -2.0)

T3 = (4.0, -2.0).

One-hundred and fifty observations were drawn ran-

domly with density (14) with parameter values (16).Fig. 6 displays these observations. Fig. 7 indicates theresults after the seventh application of the modifiedcluster algorithm.Example 3: This third example is a two-class two-

dimensional problem where class 1 has a relativelylarge variance in dimension two, while class 2 has a

A

4

A 04 A 4

A

C A00

C

CC0 C cit C

: cC~ojt0 pc sc cJ

CL3

990

PATRICK AND FISCHER: CLUSTER MAPPING WITH COMPUTER GRAPHICS

Fig. 7. After seventh mapping

AOA

Fig. 8. Original data.

Fig. 9. After fifth mapping.

relatively large variance in dimension one. Specifically,

the respective covariance matrices are

4 0-

42= LO4 0-

2=_04-

the respective mean vectors are

= (0, -2)

Y2 = (0, 5),

and the respective class probabilities are equal.' Atotal of 100 observations were independently drawnfrom a random vector generator designed to have den-sity (14) for each class. A computer output display ofthe 100 samples is shown in Fig. 8. A's identify observa-tions from class 1 while B's identify observations fromclass 2. The mapped samples after five applications ofthe mapping are shown in Fig. 9.

This algorithm is part of INTERSPACE (InteractiveSystem for Pattern Analysis, Classification, and En-hancement).

REFERENCES[1] G. H. Ball, "Data analysis in the social sciences: what about the

details," 1965 Fall Joint Computer Conf., A FIPS Proc., vol. 27,pt. 1. Washington, D.C.: Spartan, 1965, pp. 533-559.

[2] G. S. Sebestyen, "Pattern recognition by an adaptive process ofsample set construction," IRE Trans. Information Theory, vol.IT-8, pp. 82-91, September 1962.

[3] R. E. Bonner, "On some clustering techniques," IBM J. Re-search and Develop., vol. 8, pp. 22-32, January 1964.

[4] K. Pearson, "Contributions to the mathematical theory ofevolution," Philosophical Trans. Royal Society of London, seriesA, vol. 185, pp. 77-100, 1894; also, in K. Pearson, Early Statisti-cal Papers, reprinted for the Biometrika Trustees, London,England: Cambridge University Press, 1948, pp. 1-40.

[5] C. R. Rao, Advanced Statistical Methods in Biometric Research.New York: Wiley, 1952.

[6] D. B. Cooper and P. W. Cooper, "Nonsupervised adaptivesignal detection and pattern recognition," Information and Con-trol, vol. 7, pp. 416-444, September 1964.

[7] J. C. Hancock and E. A. Patrick, "Learning probability spacesfor classification and recognition of patterns with or withoutsupervision," School of Electrical Engineering, Purdue Univer-sity, Lafayette, Ind., Tech. Rept. 65-21, November 1965.

[8] V. Hasselblad, "Estimation of parameters for a mixture ofnormal distributions," Technometrics, vol. 8, pp. 431-444,August 1966.

[9] E. A. Patrick, D. R. Anderson, and F. K. Bechtel, "Mappingmultidimensional space to one dimension for computer outputdisplay," IEEE Trans. Computers, vol. C-17, pp. 949-953, Oc-tober 1968.

[10] A. C. Cohen, "Estimating the mean and variance of normalpopulations from singly truncated and doubly truncatedsamples," Ann. Math. Stat., vol. 21, pp. 557-569, 1950.

[11] K. S. Miller, Multidimensional Gaussian Distributions. NewYork: Wiley, 1964.

[12] E. A. Patrick and J. P. Costello, "On unsupervised estimationalgorithms," Proc. 1969 IEEE Internatl. Symp. on InformationTheory, January 1969; also, School of Electrical Engineering,Purdue University, Lafayette, Ind., Tech. Rept. 69-18, June1969.

[13] , "On some approaches to unsupervised estimation,"School of Electrical Engineering, Purdue University, Lafayette,Ind., Tech. Rept. 68-7, August 1968.

[14] G. Nagy, "State of the art in pattern recognition," Proc. IEEE,vol. 56, pp. 836-862, May 1968.

[15] J. W. Sammon, Jr., "A nonlinear mapping for data structureanalysis," IEEE Trans. Computers, vol. C-18, pp. 401-409,May 1969.

[16] R. N. Shepard and J. D. Carroll, "Parametric representation ofnonlinear data structures," in Multivariate Analysis, P. R.Kreshnaiak, Ed. New York: Academic Press, 1966.

3The two mean vectors and covariance matrices are unknown andestimated according to (12).

a

w-~~~

991


Recommended