BAYES MODIFICATION OF SOME CLUSTERING CRI7ERIA … · BAYES MODIFICATION OF SOME CLUSTERING...

~..

BAYES MODIFICATION OF SOME CLUSTERING CRI7ERIA

By

M. J. Symons

Department of BiostatisticsThe University of North Carolina at Chapel Hill

Institute of Statistics Mimeo Series No. 880

AUGUST 1973

Revised OCTOBER 1977

\

Bayes Modification of Some Clustering Criteria

M. J. Symons

Department of BiostatisticsUniversity of North Carolina

Chapel Hill, N. C. 27514, USA

Summary

Clustering criteria, like the trace and determinant of the \'lithin groups

sum of squares matrix, can be related to models involving multivariate normal

distributions. Inherent in these criteria is a tendency to split a data set

into approximately equa1 sized groups regardless of the underlying proportion

of mix, unless the groups are widely separated. In this paper we derive criteria

using a Bayesian approach which are more sensitive to disparate group sizes.

These are extensions of standard criteria which include the number of units

assigned to each sub-group, the number of sub-groups and the number of measure

ment's on each unit, as part of the criterion to be optimized. In addition, a

new criterion is deduced for the situation where the covariance matrices of the

clu~ters are not homogeneous. An example is included.

A Bayes criterion is also compared with the maximum likelihood approach

of Wolfe (1967, 1969) and Day (1969). The results suggest that the solution

using the Bayes criterion will provide useful initial parameter estimates for

maximum likelihood routines with mixtures of multivariate normals.

1. INTRODUCTION

Numerous techniques are available for partitioning a set of observations

into homogeneous sub-groups. For'examp1e, the method of cluster analysis dis-

cussed by Edwards and Cavalli-Sforza (1965) measures homogeneity by minimizing

the trace of the within groups sum of squares, using an algorithm that imposes

a hierarchical structure on the sub-groups. In such cases the criterion and

algorithm together implicitly determines the definition of a cluster. Rao (1952\ estates that a cluster has a very vague meaning and perhaps the Kendall and Buckland

\

2

(J958) definition as "a group of contiguous elements of a statistical population"

is specific enough for most purposes. An alternate approach is to first define

a cluster and then deduce the criterion which measures contiguousness or similar-

1ty in a cluster. A subsequent step would be the devising of an algorithm to

determine clusters so defined. This seems preferable to the uncertain interpreta

. tion of clusters implicitly defined by a criterion-algorithm combination.

The model discussed here is for a clustering situation where the sampled

population is thought to be composed of several distinct sub-populations and the

intent is to group together all those observations belonging to the same sub

population. As statistical structure~ we suppose that each observation arises

from one of a small number of different distributions, one modeling each sub

population. Hence the model for the:population is a mixture, the component. .

distributions modeling the sub-populations. Various choices of distributions

would lead to different criteria. Here they are taken as multivariate normals

and this is the same model as discussed by Wolfe (1967, 1969) and Day (1969).

However, their approach is from a traditional point of view. Geisser (1966)

uses the same model and a Bayesian approach, but assumes that the mixing weight

for each component is known. This model is also a special case of the model in

Scott and Symons (1971b), however a Bayesian point of view is stressed here. A

discussion of the approach in the present paper and that of Wolfe and Day is

presented in Section 4 along with a comparison· of resul ts from the two approaches,

using the Fisher Iris data.

• A cluster is' defined as those observations coming from the same component

distribution in the mixture. The problem is formulated as one' of predicting the

component origin of each observation •. Scott and Symons (197lb) approached this

problem using maximum likelihood methods and found that the solution is equivalent

to some standard clustering-;methods, depending on the assumptions about the COvCi ' ..

iance matrices of the multivariate normals in'the mixture and assum.ing that ,the

.e

observations are equally likely froma~y component. This suggested that the

standard clustering methods would perform best when the sub-groups are repre

sented in about the same proportions. The approach here is closely related to

the joint normal classification of Geisser (1966), but with prior probabilities

of an observation coming from any component being unknown. Also to conform

with the clustering setting presented by Scott and Symons (197lb), sample data

from the components are presumed unavailable.

Two of the criteria derived are shown to be modifications of standard

'clustering criteria. The modifications primarily involve the number of units

assigned to each sub-group, the number of sub-groups, and the number of measure

mentson each unit. The Bayes criteria are shown to be more sensitive to dis

parate group sizes than the corresponding standard criteria. When the covariance

matrices of the component normals may not be presumed homogeneous, a new criterion

is deduced using the Bayes approach .

2. Bayes Approach

Let thesamp1e cons i st of n independent observat ions r = (~1' .... , ~n)'

with li representing p measurements on the i th unit. No provision foruprevious

samples from any of the components is made since little or no prior knowledge

is available in most applications where clustering techniques are used. However,

1n a more general discussion of classification such a provision is reasonable

and easily incorporated; see for example Scott and Symons (197lb) and especially

Geisser (1966). The sampl ed population is model ed by a mi xture of G p-variate

3

normal distributions with means ~l' •.• , ~G and covariance matrices ~l' •.• , ~G·

eaCh' observation may arise from th~ gth constituent normal with probability ng.

Then 0 S ng ~ 1 for g = 1, ••• , G and Eng = 1.

All the observations from the same component in the mixture model are

e ·considered as a cluster. A sample observation li from a mixture model can be.,thought of as resulting from two random steps. First there is a multinomial

trial,specifically, the realization of an indicator random variable. taking

(1)

4

-e

unobservable multinomial trial for each observation y., call it Zl" that we would...1

like to predict.

The device of introducing notation for the missing information on class

membership of an observation is not new. Hartley and Rao (1968) use a "decision

parameter" which takes on the value of unity if the gth group is the one selected

fo~ the observation and zero if not. With this parameter introduced and in an

ANOYA setting, they formulate the problem of classification of an observation of

uncertain class origin, or ANOVA cell, and the estimation of the usual parameters

of an ANOVA as one single estimation problem. They presume "pilot samples," i.e.,

observations of known class origin as part of the available data. The class origin eis presumed to be deterministic and, quite appropriately in their ANOVA setting, so

there is no consideration of any random structure for determining the class origin

of an observation, such as a mixture of the possible ANOVA cell origins for each

observation. Their main emphasis is on the estimC).tion of the parameters of the

ANOYA model parameters when the class origin is uncertain for some observations.

The 1ikel ihood of! is completely determined by the parameters ~ ::: (7Tl , .• "t

wG' Pl' ••• , PG' Ll , .•• , LG) and Z = (zl' ••• , z). Note that 1 S z. S G and~ .....". - ..... ... . n 1

.¥.~~ =. ~ ..Wi.~h_ p.r~ab{i~itY. 1T~nTh}e ~.ike{lihO~ of the data :en is 9iJven by

• L(YI9,Z) = n 7Tggltgl g exp ~ L r (!i-~g)'~g- (!i-~g)... ... ... g=l g=l C

g

where Cg is the collection of y.'s with zl' = g, and n is the number of observ~- ....1 g

•tions in C9• Note that this likelihood is conditional on a 'speCific va'lue of ~ e

and that there are Gn possibl'e allocations of the n observations to the G componel ... ;.

5

The clustering problem then is to find the optimum assignment of ~i to the

components and hence to the c1ust~rs C1, ••• , eG• A Bayesian approach is taken

to jointly predicting the most likely outcome zi of the unobservable multinomial

trial in the generation of each of the n observations. This requires the specifications

of a prior distribution for ~, p(~), and the combination of this prior with the

likelihood (1). The parameters 9 are averaged over since they are not of central-'interest. This is in contrast to the emphasis of Hartley and Rao (1968), where

the parameters of the ANOVA model are of primary interest. Here the predictive

distribution of ! is of primary importance, i.e., the joint likelihood of ~i com-,

1ng from th~ component identified by zi' i =1, ••• , n. This can be written as

L(!I~) =~L(!l~,~) p(~)d~, (2)

where the integration is over the parameter space of 9. Alternatively, the-predictive distribution, f(YIZ), is the normalized L(YIZ) in (2). Following

"'" .... "'" ..,e Geisser (1966), this could be referred to as the joint prior, rather than pos-

terior, predictive probability of ~i coming from component zi for i = 1, ••• , n,

since samples from the components of the mixture are not available in this clus

tering situation.

The mode ofL(YIZ) is taken as the Bayes estimate of the cluster solution.-- .

There are predictable difficulties in determining this mode since the Gn possible

values for Z are not ordered in any simple way. Local extrema are frequently-encountered in applications. The explicit form of (2) and a discussion of finding

the mode of (2) is presented in the next sub-sections for three cases considered

to have the most practic.al importance ••

2.1 Covariance Matrices Equal and Known

If~9 =~ for g = 1, ••• , Gand ~ is known, the prior···.. ·...........·-r

...-..- ..--...

.-.•"0. ......

-1:

1>(~) - 1>(~l; 'i" •!!cHi (1Il ••• , ,1IG) 0< [g~l 1Ig]

.. .........

(3)

6

is selected. This is the product of two independent priors of ignorance, one

for the means and the other for the mixing proportions. The choice is consistent

with a general philosophy in cluster analysis that the data are to detennine the

clusters. The priors chosen on1y delineate the bounds of the parameter space;

see Jeffreys (1961) or Lindley (1965) for a further discussion.

The integral of the product of (1) and (3) over the parameter space c~n ~eI.:· J

accomplished in two steps. The likelihood factors into a portion which involves

only the mixing proportions and the cluster sizes and the remainder, composed of

the determinant of the common and known covariance m~trix and the exponential

portion. The integral over the mixing parameters amounts to the normalization of

a G - 1 variate Dirichlet distribution (see Johnson and Kotz (1972), p. 233) and

yields the terms r(ng) = (ng - 1)1 that depend on the cluster sizes. ,Constants

like n=Eng are not included. The integral over the mean vectors is the product

of the normalization of G p-variate normals, all with the same known covariance ee matrix. It contributes factors of n~~ and an exponential term' of ~tr(~~-l).

Similar details of the required matrix manipulations are given in Anderson (1958),

p~ 46. The result then, L{YI~), is proportional to

where

exp { ~tr(!i~-1)} ~ {. r(n )n-l,p }• g=1 g g

(4)

(5)

~nd ~g is the mean o'f-'~hose ~i a11oc~~ed' to the gth·g~OUp and ~ is the within

groups sum of squares matrix.

The Bayes prediction of ~, denoted Z, is ~hat allocatiC?n of each ~i' i = 1,

..e. n', to one of the G components in the mixture that maximizes (4). Equivale., t1 v, . .

the grouping minimizes the negative of two times the natural logarithm of (4)

or

7 '

.e(6)

with a pre-analysis rotation to remove correlations and standardize the variances

of the observations so that L = I, the p x p identity matrix. For the situation- -where fis known, this is easily accomplished by the transformation x = Ty, where- --.! is the square root matrix or the transpose of the orthogonal matrix in the

spectral decomposition of the covariance matrix ~. Edwards and Cavalli-Sforza

(1965) proposed the trace of the within sum of squares matrix, tr(W), as a-'clustering criterion.

The Bayes formulation (6) appends to tr{~) terms involving the number of

observations allocated to each component, the number of components and the number

of variates. The overall effect of these terms is to favor unequal splits rather

than equal ones. This can be seen in Tables 1 and 2 where the total contribution

of the Bayes modification is partitioned into portions coming from the averaging

over the'mixing parameters and the averaging over the means. The contribution of

the terms from the mixing parameters dominates that of the terms from the means,

except where the number of variables and number of groups becomes large relative

to the total sample size. This is evident in Table 2 for four groups and parti

tions of 20 observations with four and eight variables. Also with fixed total

sample size and fixed number of groups, as more variates are added less importance

'\s,given to unequal splits, i.e., the value of the component from averaging over

the mixing proportions becomes less negative. This provides protection against

determining spurious clusters of unbalanced size due to variation in irrelevant

tit ·additional measurements. Analogus statements can be made for a fixed total sample-..

size and a fixed number of variates as the number of groups increase. Also, hold in_

the number of groups and the number of variables both fixed, the magnitude of the

contribution of the Bayes modification increases ~/ith the total sample size, pro-

8

vid1ng additional sensitivity to disparate group sizes. The trace of the within

e groups sum of squares, of course, enters the final determination of the clusters

and could be the deciding factor for clearly separated groups. In summary the

Bayes modification adds the potential for increased sensitivity to disparate

group sizes.

The Bayes modification of the trace of the within groups sum of squares will

be optimized by an allocation of the ~ observations to the G groups which balances..the preference of compact, equal sized groups by the trace portion and the net

. tendency toward unequal spl its by the termsinvol ving the cl uster sizes, number

of clusters and number of variables. As noted above, less weight is given to

. unequal sized splits when the number of variables and number of groups is large

relative to the total sample size. In such situations one could expect equal

sized cluster solutions, un1ess the groups are clearly separated and then the

trace of the within groups sum of squares alone could be decisive in determining ea sound cluster solution. (Insert Tables 1 and 2).

The Bayes criterion with known mixing parameters would include the same

tenns from averaging over the means and the term -2 L ng 1n (7£g) rather than

-2 I In{r{ng)}, each sum being over g =1, ... , G, since no averaging over the

uncertainty of the mixing parameters would be required when they are known. Note

that if each group is equally likely as the origin of each observation, i.e.,-1

1I"g c: G , for g =1, 2, ... , G, then the sum over the G groups of this new term

is a constant whatever the cluster sizes. Hence the criterion

(7)• Gtr(W) + p L 1n{n

9)

... g=l

'is the Bayes analogue of the tr{~) criterion proposed by Edwards and Cava11i-

. Sforza (1965). Note that the Bayes criterion (7) makes explicit use of the

,e assumption that each of the'G components is equally likely as the origin of

anyone of the n·observations.

9

The practical aspects of finding the allocation which minimizes (6), (7),

or the trace criterion alone requires search routines such as the one compiled

by McRae (1971). After saving the best of several randomly determined initial

allocations of li to the G groups, his routine produces a relative minimum

for a chosen criterion in the sense that any reassignment of one of the obser-

. vations to a different group results 1n a larger' value of the criterion, but

does not guarantee an absolute minimum. Although the mixture model accommo

dates the possibility of no observation coming from anyone group, generally

for cluster analysis applications the search over the Gn values of t is re

stricted to insure that at least one observation is allocated to each group.

Note that (6) is unbounded for any ng = 0 and clearly would not correspond to

a partition of the observations minimizing (6) or (7).

2.2 Covariance Matrices Equal and Unknown

When 1: = 1: for g=l, ••• ,G and .L·is unknown, the vague prior....g .... -

(8)

.....- --- . -. _ .

is used to delineate the bounds of the parameter space. . ~ 1~_l-~(P+l),The prlor on~, L.

is a generalization of the 1/02 prior of ignorance for the univariate normal

variance parameter discussed by Jeffreys (1961). It is of the general form

used by'Geisser and Cornfield (1963) and is invariant under power transformations.

After forming the product of the likelihood (2) and prior (8), an integration

over the parameter SUb-space of L' is required in addition to that described in. . -Section 2.1. This involves 'the normalization of a Wishart distribution, the resul~"

of which can be obtained by inspection from Anderson (1958), p. 154. This contrib

utes a factor containing the determinant of the within group sum of squares and

some constants. The ma'rgin~;l likelihood of the data given an allocation t' ~(Y~:l'e

then is proportional to 0, , .

IWI -I;(n-G) °nG [( ) -~]f ngo n

9•

... g=l{9).

10

The Bayes partition of the data into G groups maximizes (9), or equivalently,...Z minimizes

...--_.

(n-G)1n [I~I ] + r J p 1n[n ] - 2 1n[f(n )] }, g=1 I g g

The Bayes criterion (10) is more sensitive to disparate group sizes.

2.3 Covariance Matrices Unequal and Unknown• In this situation where the covariance structure may differ from group

to group, the prior chosen is

(10)

11

As with the prior (8)', the parameter space is del ineated with only vague

apriori information in both the spirit of Jeffreys (1961) and cluster analysis. _I. •

The priors on the means and covariances for each group are independent of one '

another and each is presumed independent of the prior on the mixing proportions.

The averaging of the nuisance parameters from the product of the likelihood

(1) and prior (11) involves no different integrations than discussed in the •

previous case. However some additional terms, which were constants with the

integration over the common unknown f. appear due to the normalization of a

Wishart distribution for each covariance matri~ ~g' g=l, .••• G. With unequal

and unknown covariance matrices, the Bayes partition of the data into G groups

maximizes

where

Equivalently .~ minimizes

Wg = E (Yl" -Yg) (Yl.-Y9) I.~ c· .... """ ....,. <IV

9 , "

113)

9~1 { ("g-l)ln[l~gJ] + pln[ng] - p(ng+p)ln[2] -2[ln[r(ng)] + i~lln[r(J,{ng+p+l.i})])}(14)

Situations involving unequal covariance matrices pose tremendous challenges

to clustering algorithms. One need only vis~alize two cigar shaped clusters with

common location and crossed orientation of the covariance matrices. Chernoff

(1970) comments briefly on this case. but the inherent difficulties are very

within group sum of squares matrices.

The search over the Gn possible partitions of the n

.e

formidable unless the clusters are well separated. A traditional analogue to the

Bayes criterion (12) could be the product of the determinant of the individual,,-)

'r \' 1i' r ,~:1 "',~'

observations into G

groups neeus further comment for this case. In addition to the remarks following

.e

I'

criteria' (6) and (10), the search for t must be further restricted so that ng ~ (p+l)

''$0, that ~g will not be singular. The search could be extended to include those eparti.tions with lsng s p, i~ one used ~, the pooled within groups sum of squares,

to approximate ~g. One might rep1~ce ~g by ng~/n or (ng-l)~/(n-G). This imposes

an average covariance structure on the clusters with ng:sp, but allows an expanded

search ·for the optimum allocation of the n observations to include all partitions

with the number allocated to .each group being at least one, as with the blo previous

cases.

2.4 Discussion of Approaches

In view of the statistical structure that is presumed to derive these

criteria, they may be viewed as instruments of "simultaneous discrimination" or

"joint normal c1assification" as used by Giesser (1966). The traditional clus

tering criteria usually involve little distributional structure, but do attempt

·to optimally allocate each of the n observations to one of several groups. The

e criteria presented here are optimized by a particular allocation of each of the

observations, the parameters,of the presumed distributions being of secondary

importance. The attention is focused on the predictive distribution of VIZ where. . ~-

! = (zp ... , zn) and zi is the result of an unobservable multinominal random

variable that specifies the component origin of the observation li.

3. Numerical III ustration of the Sensitivity of the BayesCriterion (10) \'/ith Disparate Group SiZES

The determinant of the within groups sum of squares matrix, I~I, and the

Bayes critarion (10) derived in Section 2.2 were compared using various portions•of the Fisher Iris data. (See Kendall and Stuart [1966], p. 318.) Briefly

·these data are 50 ohservations on each of three types of Iris: Setosa,Versi

colors and Virginica. Each observation is composed of four measurements on th'"

flower: sepal length, sepal.width, petal length, and petal width. The Bayesje criterion (10) and I\~I are each options., in an approximate routine constructed...

by McRae (1971) with slight modifications.

As an initial. investigation of the perfonmance of the Bayes modification

4It of IWI, five balanced data sets were constructed involving the Versicolor and

Setosa plants and five balanced sets involving the Versicolor and Virginica

is discussed more fully after the presentation of the analysis results for the

unbalanced data sets.

[Insert Table 3]

A total of ten unbalanced data sets were constructed to test the Bayes mod

ification of I~I to sepa~ate the disparate sized clusters. Five unbalanced data"sets were created by splitting the Versicolor plants into five groups of 10 and

-~

combining each sub-group with all 50 Virginica plants. Five more unbalanced

sets were produced in the same way by splitting the Virginica plants into five•groups of 10 and combining each s~b-group \'lith all 50 Versicolor plants •

. [Insert Table 4]

The results are summarized in Table 4. Except for the two data sets dis

cussed below. the increased.sensitivity of the Bayes criterion (10) as comparecf \'lith•

14

the standard I~I criterion is clearly demonstrated. Since the Versicolor and

Vlrginica overlap slightly, the IWI criterion tends to split the combined 10...plan~s of one type and the 50 of the other type into two equal sized groups.

This is to be expected on theoretical grounds, since I~I is essentially the

Bayes criterion for the situation of Section 2.2 with the added assumption that

the mixing parameters are assumed known apriori each to be G-1. See (10))

(7) and from the related discussion with the trace criterion the appropriate

Bayes modification can be seen. That is the Bayes inter-

pretation of Iwl is essentially that each observation is equally likely from.., .

either of the two groups. Allowing the mixing parameters to be unknown provides

the terms involving the group sizes, ~in particular the r(ng) factors, as part of

the criterion to be optimized, and hence the sensitivity to unequal group sizes.

The results in Table 4 with the Bayes criterion for the two data sets: ethe 3rd 10 Versicolor with 50 Virginica and the 4th 10 Versicolor with 50

Virginica, may suggest that the tendency toward unequal group sizes is too

strong. The criterion in each case produces groups of about 10 and 50

observations. However, the smaller group of observations corresponds to

Virginica plants rather than the Versicolor ones, as can be seen from the number

of mis-classifications for these two unbalanced data sets. A glance at Figure 1

of the 3rd 10 Versicolor and 50 Virginica plants reveals a group of about 10

Virginica measurements about as equally prominent as the 10 Versicolor ones.

lot is this group oflO or 11 Virginica observations that the Bayes criterion

groups as one cluster. A search around the allocation of the 10 Versicolor

(3rd or 4th set) to one group and the 50 Virginica to the other group by the

Bayes criterion (10) yields another local minimum. But as can be seen by.-.

e comparing the criterion values for the ,"solution start". row with the minimum in

the "random start" row, the separation of the 10 or so Virginica p1an'ts is pre

ferred to that which splits off the 10 Versicolor plants.

[Insert Fi gur'el]

This feature of the Virginica measurements brings to the front the funda

mental question of judging the significance of cluster solutions. Is the

separation of the clusters statistically significant or is it a random feature

of the data set at hand? Clearly the upper 10 or 1.1 Virginica are competing

with the 10 Versicolor as the smaller group to be split off. As the 10 Versi

~lor are moved away from the 50 Virginica, there is a distance (a separation

1n the means for the two groups of about two standard deviations in each of the

four variables works) for which the Bayes criterion (10) correctly groups the

10 Versicolor. At this distance I~I still divides the data set into two equal

~ groups, but as the separation is increased more, I~I also makes the correct di

vision. When is the separation statistically significant? Little is available

on this question. See for example Engleman and Hartigan (1969) and more recently

Lee (1977). The problems of judging the significance between one and more groups

or more generally between G = Go and G = G1 are difficult. Wolfe (1971) also

addresses this problem.

4. Comparison with the Wolfe-Day Approach

Wolfe (1967, 1969) and Day (1969) approach the problem of estimating the

mixture component origin of the observations ~i in a rather indirect method.

Estimation of 9 is addressed first. Given estimates of the mixture parameters,. ...

:,

15

16

is over all Gn allocations

The likelihood maximized by Wolfe and Day can

L(!I~) = i~l [gtn9V~il~g'~g~ =

a marginal of (1), since the summation of 1(!I~,~)

be written as

t L(YIQ,Z),Z -v ...........

...(15)

e-

of the n Y1' to the Gcomponents. The notation N (y.lll ,t ) denotes the p-variate

- p -1 -g -gmultivariate normal density evaluated at y. given II and r. By maximizing (15). -1 -g -g . ': '

the maximum likelihood (ML) estimates of the 11"g'S, II 's,and r 's are obtained.-g -gThis approach then replaces the parameters by their ML estimates and assigns each

observation to the mixture component w.ith the largest estimated density height,

scaled by its estimated proportion of representation in the mixture. That is for

g=l, .... , G, 11"g is replaced by 1T , 1.1 by n, and, r by f. Then for i=l, •.. , n, Y-1'g -g -g -g -g

1s assigned to the component for which Tr N (y .I~ , f ) is the largest, g=l, .•. , G.g P _1 -g ,-gNotice that this procedure does not include the variability in the ML estimates of

the parameters. However, the Bayes solution averages the likelihood over these enuisance parameters, thereby incorporating thei r variabil i ty into the estimation

of the optimum allocation of the observations.

[Insert TableS.]

The performance of the Wolfe-Day approach is illustrated in Table 5.

Using a program supp1 ied by Wo1 fe, the same ten unba1 anced data sets presente"d

1n Table 4 and in Section 3 were analyzed. The mixture of multivariate normals

was assumed to be homogeneous in the covariance matrices, i.e., corresponding

to the situation in Section 2.2. Two sets of results were obtained with Wolfe's

program: one using the initial estimates for the parameters (11" IS, II IS, and r)• . . g -g -

prOvided by the program and another using the best-Bayes criterion (10) solution

to estimate the proportions, means, and common covariance matrix. Comparing

:.

II.

first the Bayes criterion solutions with the solutions provided entirely by

Wolfe's program, the Bayes criterion does quite well. Wolfe's solutions also

have a tendency toward equal sized clusters as did the I~I criterion. The

solutions by Wolfe's program with initial estimates of the parameters based

on the best solution with the Bayes criterion were very much like the Bayes

criterion (10) solutions. This is not surprising since as Day (1969) points

out, relative maxima are frequent1y encountered. However the fact that the

likelihood for these solutions was larger in all cases is noteworthy. We also

note that in two cases the increase in the likelihood was at the expense of a

few more iterations, but in about one-half of the cases the increased likelihood

was complemented by a marked reduction in iterations.

These results suggest that the Bayes criteria might provide useful initial

estimates for ML estimation routines for mixtures of multivariate normals like

4It Wolfe's. The use of a clustering criterion to generate initial estimates for

ML routines is not . new; 'in fact Wolfe uses MacQueen's (1965) clustering

procedure together with a hierarchical grouping by Ward (1967) to obtain initial

estimates for .his ML equations. However the Bayes criteria are derived for three

specific situations with a mix of multivariate normals. The demonstration of

improved performance by the Bayes criterion in Table 5 for these preliminary

comparisons then is. not unexpected.

5. Further Discussion

5.1 Estimation of Parameters in Mixtures of Normals

•-,

Marriott (1975) has pointed out that the approach of Wolfe and Day provides

estimates of the parameters g which have asymptotically desirable properties •...It is also pointed out

...that estimates of g based on the partition Z that minimizes P~I are inconsistent.... ... ...

This also applies to the"Bayes modification. For example the distance be~,., ' C3 n .

the means will be over-estimated and the cornmon variance for each measurement. .

will be under-estimated. This is due to the overlap of the mixture components...

and the truncation process involved in the optimization to find Z. We note. ...however that in spite of these obvious shortcomings, the estimates of g based

-on Z still provide useful initial estimates for MlE routines, as pointed out

in Section 4 and as shown by the results of Table 5.

More importantly, we note that in determining the optimum allocation from

the Bayes approach and from the ~Iol fe-Day approach, there is a di fference in the

primary intent of each approach. The clustering procedures provide an allocation

of each observation to one of several groups. The approach of Wolfe-Day is pri

marily to obtain the maximum likelihood estimates of the parameters in the mixture.

The component origin estimation is a subsequent consideration, which proceeds as

if ,the f1L estimates of· the parameters in the. mixture are in fact the true values

of 9. The Bayes approach described"here focuses on the estimation of the compo-- .

nent origin of each observation, averaging over the uncertainty involving the

mixture parameters to determine the optimum component origin of each observation.

This is a difference in e$timation philosophy of the two approaches. The Bayes

approach to estimating the parameters of the mixture would involve normalization

of the product of the likelihood (15) and a prior, p(Q), for the mixture para-....meters to determine the posterior distribution of Q. With vague forms of the

. ...prior, such as those employed in Section 2, the Bayes mode of the posterior

distribution of e and the maximum likelihood estimates of e would tend to be the- . -same with large samples.

5.2 Practical Computational Aspects,

• The practical difficulties with determining either the Ml estimate of Q...-or the optimizing allocation Z are formidable. Both are plagued by relative...minima.' With small samples one should feel more confident with the search

routines for i. Several approximate algorithms are available to search for ~- -I .,

the optimum partition into several groups. For example, see Forgy (1965),

For example, see Forgy (1965), MacQueen (1965), Friedman and Rubin (1969),

and McRae (1971). These seem to work well in practice as illustrated by the

results in Table 5 using McRae's routine, but they provide no assurance that

the optimum has been reached. Relative minima present difficulties for such

routines; see Friedman and Rubin (1969), p. 1165 and the results in Table 3, 4

and 5. Feasible search routines guaranteeing an optimum over the Gn partitions

19

are for the most part not available. See for example the difficulties with only

two groups and the simple tr(~) criterion examined by Scott and Symons (1971a).

Not only does the number of groups and sample size enter, but also the number

of dimensions.

The maximum likelihood approaches face convergence problems andean produce

unreliable results with small samples. For example with the equal sized sets of

10 plants in the lower half of Table 3, solutions comparable to those presented

~ fr~m the clustering criteria were produced for the fourth and fifth sets of ten.

For the first set of 10, a 19-onesplit was found as compared with a 10-10 split

for the clustering criteria. No solution was produced after 100 iterations for

the second and third sets. However with larger samples, the maximum likelihood

approaches perform much better. This is fortunate, since the search routines

become questionable as the number of partitions possible increases exponentially

with the sample size. Consequently, to estimate the optimum allocation, one

can recommend the search routines with small samples, and the maximum likelihood

approach of Wolfe-Day for large samples. Initial estimates for the ML routines

could be obtained utili~ing the search routines over the vlho1e data set, or•

over subsets if the sample size is very large •

.-,

Geisser, S. and Cornfield, J. (1963). Posterior- distributions for multivariate normal parameters. Journal of the Royal Statistical Society,Series ~ 25, 368-76.

Ha'rtley, H. O. and Rao, J. N. K. (l968). Classification and estimation inanalysis of variance problems. Review of the International StatisticalInstitute 36, 141-7.

Jeffreys, H. (1961). Theory of Probability. Clarendon, Oxford.

Johnson~' N.L. and Kotz, S. (l972). Distributions in Statistics: ContinuousMultivariate Distributions.. Wiley, New York.

Kendall, M. G. and Buckland, w. R. (1958}.A Dictionary of Statistical Terms.. Haefner, New York.

Kendall, M. G. and Stuart, A. (1966). The Advanced Theory of Statistics• Vol. 3. Hafner, New York.

I

Lee, K. L. Multivariate tests for clusters. J. Amer. Stat. Ass. Acceptedfor publication 1977.

Lindley, D. V.(l965). Introduction to Probability and Statistics from a. Bayesian Viewpoint. Part 2: Inference. Cambridge University Press.

'-j

REFERENCES (cont.)

MacQyeen, J. (1965). Some methods for c1assffication and analysis ofmultivariate observations. Proceedings of the Fifth Berkeley Symposiumon Mathematical Statistics and Probability~ Vol. -1,281-297.

McRae, D. J. (1971). MIKCA:. a Fortran IV iterative K-mean.s cluster analysis,program. Behavioral Science 16, 423-4.

Rao, C. R. (1952). Advanced Statistical Methods in Biometric Research.Wiley, New York.

Scott. A. J. and Symons, M. J. (1971a). Clustering methods based on likelihoodratio criteria. Biometrics 27, 387-97.

I

Scott, A. J. and Symons, M. J. (l971b). Clustering methods based on likelihood ratio criteria. Biometrics 27, 387-97.

Ward, J. H., Jr. (1967). PERSUB Reference Manual: PRL-TR-67-3-{1l}. PersonnelResearch Laboratory, Lackland AFB,- Texas.

Wolfe, J. H. (1967). NORMIX: computation methods for estimating the parametersof multivariate normal mixtures of distributions. Research Memo, SRM 68-2.U. S. Naval Personnel Research Activity, San Diego, California.

Wolfe, J. H. (1969). Pattern clustering by multivariate mixture analysis.Research Memo. SRM 69-17. U. S. Naval Personnel Research Activity.San Diego, California.

'·Wo1fe, J. H. (1971). A Monte Carlo study of the sampling distribution of thelikelihood ratio for mixtures of multi normal di stri but ions. TechnicalBulletin STS 72-2 .. U. S. Naval Personnel Research Activity, San Diego,california •

..

-,

\

Acknowledgement

The author wi shes to acknowl edge the ass istance of Kerry L. Lee, a

graduate student in Biostatistics at the time the initial results of this

research were obtained. The helpful comments of the referees and thorough

review by an associate editor are gratefully acknowledged. This included

the notational convention of this paper and pointtngout that an earlier

version of the criterion for the situation with unequal and unknown covariance

matrices had not included all the terms from normalizing each of the G requi

site Wishart distributions. More recently, the excellent typing and patience

of Pauline Kopec and the expert programming assistance of Thomas Lange,nderfer

. is also appreciated. e

•

-,

I.IST OF FIGURES

. FIGURE 1: Scatter Diagram of Third 10 Iris Versicolor (*) and 50 IrisVirginica (0)

LIST OF TABLES

•

...~~.

.'.t

TABLE 1:

TABLE 2:

TABLE 3:

TABLE 4:

TABLE 5:

•

Contribution of the Bayes Modification Portion of Cluster Criterion(6) by Component and Total for Three Selections of Number of Variablesand Two Sample Sizes with Three Selected Allocations of the Number ofObservations to Two GroupsContribution of the Bayes Modification' Portion of Cluster Criterion(6) by Component and Total for Three Selections of Number of Variables'and Two Sample Sizes with Three Selected Allocations of the Number ofObservations to Four GroupsComparison of Cluster Solutions with the Detenninant of the WithinGroups Sum of Squares and Bayes Modification (10) for Various BalancedSets of Setosa, Versicolor and Virginica Plant MeasurementsComparison of Clusters Solutions with the Determinant of the WithinGroups Sum of Squares and Bayes Modification (10) for Ten UnbalancedData Sets Composed of Versicolor and Virginica ObservationsComparison of Cluster Solutions with Wolfe's Maximum LikelihoodApproach and Bayes Criterion (9) for Ten Unbalanced Data SetsComposed of Versicolor and Virginica Observations

!,

-.

•

TABLE 1

Contribution of the Bayes Modification Portion of Cluster Criterion (6) by ComponentAI Total for Three Selections of Number of Variables and Two .Sample Sizes with Three~lected Allocations of the Number of Observations to Two Groups

Bayes Component from Component from Total ContributionNum- averaging averaging over of the Bayesber of Modification over the means the mixing Modification Portion

proportionsVariables Componentsand Sample 2 2 2Size Allocation P t R.n(ng) -2g~1R.n{r(ng)} t [ptn(ng)-2R.n{f(n )]};

g=l g=l 9 .P ng (g=1,2):n ~

- - ..

1, 19: 20 5.89 -72.79 -66.902 6, 14: 20 8.86 -54.68 -45.82

10, 10: 20 9.21 -51.21 -42.001, 19:. 20 11.78 -72.79 -61.01

4 6, 14: 20 17.72 -54.68 -36.9610, 10: 20 18.42 -51.21 -32.791, 19: 20 23.56 -72.79 -49.24

8 6, 14: 20 35.45 -54.68 -19.2310, 10: 20 36.84 -51.21 -14.375, 55: 60 11.23 -335.00 -323.76

2 20, 40: 60 13.37 -291.94 -278.5730,

030: 60 13.61' -285.03 -271.42

5, 55: 60 22.47 -335.00 -312.534 20, 40: 60 26.74 -291.94 -265.20

30, 30: 60 27.21 -285.03 -257.825, 55: 60 44.93 -335.00 -290.06

8 20,:.40': 60 53.48 -291.94 -238.4730, 30: 60 54.42 -285.03 -230.61

•

.J

TABLE 2

,Contribution of the Bayes Modification Portion of Cluster Criterion (6) by Component eand Total for Three Selections of Number of Variables and Two Sample Sizes with ThreeSelected Allocations of the Number of Observations to Four Groups

-

Bayes .component from Component from Total ContributionNum- averaging averaging over I of the Bayesber of Modification over the means the mixing Modification Portion

proportions IVariables Componentsand Sample

4 4 4Size Allocation P I R.n(ng) -2 I R.n{r(ng)} I[PR.n(ng)-2R.n{r(ng)}]

(g=1.2):"~9=1 9=1 9=1

e 09- -"=.='-:-=-"-'-.'-'-=":"','-.=..:.'===-=:. -,-

I, 1, 1, 17:20 5.67 -61.34 -55.68

2 3, 3, 7, 7:20 12.18 .,. -29.09 -16.91

5, 5, 5, 5:20 12.88 -25.42 -12.55

1, 1, 1, 17:20 11.33 -61.34 -50.01

4 3, 3, 7, 7:20 24.36 -29.09 -4.73

5, 5, 5, 5:20 25 .. 75 -25.42 0.33

1, 1, 1, 17:20 22.67 -61.34 -38.68

8 3, 3, 7, 7:20 48.71 -29.09 19.62

5, 5, 5, 5:20 51.50 -25.42 26.08

3, 3, 3, 51:60 14.46 -301.11 -286.67

2 10,10,20,20:60 21.19 -208.57 -187.37

15,15,15,15:60 21.66 -201.53 -179.87

3, 3, 3, 51:60 28.91 -301.11 -272.20

4 10,10,20,20:60 42.39 -208.57 -166.18

15,15,15,15:60 43.33 -201. 53 -158.20

3, 3, 3, 51:60 57.82 -301.11 -243.29

8 10,10,20,20:60 84.77 -208.57 -123.79•

15,15,15,15:60 86.66 -201.53 -144.87

. -, •

e .. TABLE 3' •

Comparison of Cluster Solutions with the Detenminant tIIlhe Within Groups Sum of Squares and BayesModification (10) for Various Balanced Sets of Setosa, Versicolor and Virginica Plant Measurements

..

e-- -- --, ....._--- ..-. ,._-- _.....

1st 30 Versicolor Last 30 Versicolor 1st 20 Versicolor Last 20 Versicolor& & & &

1st 30 Setosa Last 30 Setosa 1st 20 Setosa Last 20 Setosa

30, 30; 0: 127.87 30, 30; 0: 143.26 20, 20; 0: 21.52 20, 20; 0: 20.55

All 50 Versicolor&

All 50 SetosaCriteriono

Detenminant 150, 50; 0: 1190.48*Criterion !

random start: Isolution start: r5~,=5~;=0~I1~0~~solution va1ue:f50, 50; 0: 1190.48

30, 30; 0: 127.87----------30, 30; 0: 127.87

----------~----------I--------------30, 30; 0: 143.26 20,20; 0: 21.52 20,20; 0: 20.55---------- ------------------------30, 30; 0: 143.26 20, 20; 0: 21.52 20, 20; 0: 20.55

lo...; •

20, 20; 0: -18.5220, 20; 0: -16.7730,30; 0: 30.1330, 30; 0: 23.54

=3~,=3~;=0~ ~3=5~ = 1!O~ !Ol ~:=3~·I3= =1 ~O~ ~l ~:=-I6:!~=1= =2~,=2~;=0~ ~1~.!( = = =.30, 30; 0: 23.54 130, 30; 0: 30.13 120, 20; 0: -16.77 20, 20; 0: -18.52

GO ,50; 0: 147.08._--------50, 50; 0: 147.08

50, 50; 0: 147.08

~

Bayes, Modification (10)

random start:solution start:solution value:

---'-'

1, 19; 9: 2.74 I 4, 16; 8: -39.62 I 3, 17; 7: 3.51 I 10, 10; 0: -45.689,11; 3: 19.98 6,14; 8: 4.19

----------~----------~----------I--------------10, 10; 0: 13.68 7, 13; 5: 0.87 10, 10; 0: 14.19 10, 10; 0: 0.49-10,-10;-0~ 13-:-68 - 10:-1O~ 0:-2-:60 - - 10:-10~ 0:-14.19- - - -10,-10;-0~ 0.49- - ---

----------~----------~----------I--------------10, 10; 0: 14.30 4, 16; 8: -39.62 10, 10; 0: 14.96 10, 10; 0: -45.68-10,-10;O~14-:-30- 10-:-10;0:--15-:-60- 10-:-m;0:-14.96----10,-10;-0~=-45.68----

________ •__ - _n___ -.+-1----.....;,.,,----

10, 10; 0: 0.49

5th 10 versicolor&

5th 10 Virginica

5, 15; 5: 10.68

4th 10 Versicolor&

4th 10 Virginica

7, 13; 5: 0.874, 16; 8: 1.19

3rd 10 Versicolor&

3rd 10 Virginica

9,11 ;1: 13.28

2nd 10 Versicolor&

2nd 10 Virginfca

solution start: L1~,_1~;_02., :1~.:~3_solution va1ue:~O, 10; 0: -17.33

Bayes I5, 15; 9: -5.13Modification (10) 10, 10; 0: -17.33

random start:

1st 10 Versicolor&

1st 10 Virginica Criterion------+-------t--------.-------- .--.----- -Determinant 10, 10; 0: 2.36

Criterionrandom start:

solution start: 1'1~,=1~;=(~.~( =solution va1ue:ho, 10; 0: 2.36

I • _'_ ___ "'=~ I --===--=~-,="'._:-,'==_~=_-~<- ,..=== L .. ======- _* The table entries are: the first two numbers are the cluster sizes; the third is the number of misclassifications;

the criterion value is shown after the colon for each solution.° The criteria are each to be minimized. The "random start" solutions are from the search for the allocation minimizing

the specified criterion as provided by McRae's program (1971). The "solution start" provides the allocation specifiedin the column ~eading as the preliminary allocation for the exchange portion of the routine; see p. 9 of text. The"solution value" gives the criterion value at the allocation specified in the column heading.

TABLE 4

Comparison of Clusters Solutions with the Determinant of the Within Groups Sum of Squares and Bayes Modification(10) for Ten Unbalanced Data Sets Composed of Versicolor and Virginica Observations

1st 10 Versicolor&

All 50 Virginica

2nd 10 Versicolor&

·All 50 Virginica

3rd 10 Versicolor&

All 50 Virginica

4th 10 Versicolor&

All 50 Virginica

5th 10 Versicolor&

All 50 VirginicaCriterion° 1-; .-·----1-1----------

* 26.34;16:1404.2529.31;19:1417.73

,

. - -~.~

~

Determinant 26.34;16:963.37Criterion 30.30;20:1153.04

random start:_._----~------------

solution. start: 10.50;0:1228.7010.50;0:1441.27so'.Uoo val••, lQ,50~O-;-lli8.7o- - -,o.50~o-;-q.l.17--

Bayes 10.50;0:122.72 8.52;2:129.22Modification (10) 26.34;16:139.54

random start:

SOlut~on start: -1~,~OlQ~1~2!~= = I =1~'~OlO~1~1~9~ = =Solutlon value: 10,50.0:122.72 I 10.50;0:131.97

30.30; 20:884.08 I 30.30;20:1112.46 27.33;23:1011.4026,34;16:1128.96 24.36;14:1049.83

29.31;19:1086.22- - - - - - - .. - -1- - - -- - - - - -l- - - - - - - - - - - - - - .~5.!.3~;.!-5-=-8~1..:..5".-l !O.!.5~;~: .!-3~6..:..7~ - -1- _1~.~0!.0_:1~6~.~9- - - - -10.50;0:1242.35 10.50;0:1428.75 10.50;0:1069.49-----_. --10.50;20:115.82 11.49;21:119.34 10.50;0:114.6730,30;20:135.68 8.52;2:119.96 I 12.48;22:122.28

23,37; 15: 147.58----------,-------------------------..10.•50;0:125.41 10.50;0:131.47 I 10.50;0:114.6710:-50;0:125.41- - - -10:-50;0:131.47- -l- -10.50~0-;-114-:-67- - - --

---------i----------i--------------!.3!.4~;~;.¥..:..3~ _ _ _ !.0!.5~;~:~8..:..8Q. _ _ _ JQ..~Oi0.:.2~:!6- _10.50;0:43.77 10,50;0:68.80 10.50;0:26.36

---'_ ..•,-- -._.__ . -- - .._..--,_.I

All 50 Versicolor All 50 Versicolor All 50 Versicolor& & &

3rd 10 Virginica 4th 10 Virginica 5th 10 Virginfca-----...._-_. - --_."

I28.32;22:216.32 15.45;5:368.57 24.36;20:206.8616.44;6:220.89 7.53;5:426.53 10.50;2:221.14

;::======I=======:t==.=~-==--==i=.- .._-.All 50 Versicolor All 50 Versicolor

& &1st 10 Virgfnica 2nd 10 Virginfca

Criterion I

Determinant 13.47;3:254.42 26.34;22:392.60Criterion 29.31;29:330.66 23.37;21:399.94

so~~: :::: 1~.~O~O~3~3~2~===:i:i:~:i~~~~==t!6~4~;~,~~.~9===iO~5~;~'~~.~4==~==1~.~O~2~2~1~'~=====solution value: 10.50;0:323.28 10.50;0:459.40 .10.50;0:315.00 10.50;0:484.94 10.50;0:233.31

Bayes 13.47;3:41.04 8.52;4:57.25 6.54;4:41.53 6.54;6:44.68 9.51;1;22.24Modification (10) 29.31;25:85.9628.32;22:53.75

random start _ _ _ _ _ _ _ _ _ _ _

solution staritlO.SO;0:45.28 I 10.50;0:65.65solution valuetl0.50~O~45.28-- -1-10.50:0~65.65- --

I. 1 . .. _ J ..-_._. . __-.,t---.--.-- --.----.--.l.-,--..,..------ I .__ ..•. _

* The table entries are: The first two numbers are the cluster sizes; the third is the number of misclassifications;the criterion value is shown after the colon for each solution.

o The criteria are each to be minimized. The "random start" solutions are from the search for the allocation minimizingtile specified criterion as provided by McRae's program (1971). The "solution start" provides the allocation specifiedin the column heading as the preliminary allocation for the exchange portion of the routine; see p. 9 of text. The ~

·solution.~alueH gives the criterion value at the allocation s~fied in the column heading. ~

A . ',",~)' • 't ..

-- 1# eTABLE 5

.. .. ~

e

....

Comparison of Cluster Solutions with Wolfe's ~1aximum likelihood Approach' andBayes Criterion (9) for Ten Unbalanced Data Sets Composed of Versicolor and Virginfca Observations

_..... ...... -. _ w· • - "'- - _.-.- -- .. -_. - ... __..__.-- --- . ..._-

~1st 10 Versicolor 2nd 10 Versicolor 3rd 10 Versicolor 4th 10 Versicolor 5th 10 Versicolor

TechniqueO and and and and andAll 50 VirQinica All 50 VirQinica All 50 Virginica All 50 Virginica All 50 Virginica

10,50;20:115.82 11 ,49;21 :119.34 10,50;0:114.67Bayes Criterion (10) 10,50;0:122.72* 8,52;2:129.22

-Om __________• __ -_".'-' ...... __ .... _.'......_-- . ___ . _ ._. W_._'.·'.". ___ .____

Wolfe's Maximum 26,34;16:127.22 9,51;1:127.83 22,38;14:132.05 8,52;2:132.13 19,41;29:126.99likel ihood 18 iterations 8 iterations 15 iterations 8 iterations. 12 iterations

Wo1 fe's f4aximum II

Like1 ihood with 10,50;0:131.57 9,51;1:127.83 11,49;21:136.65 11,49;21:133.91 10,50;0:135.34Initial Estimates 6 iterations 8 iterations 21 iterations 8 iterations 4 iterationsfrom the' BayesCriterion (10)solution

.-

~All 50 Versicolor All 50 Versicolor All 50 Versicolor All 50 Versicolor All 50 Versicolor

Technique and and and and and1st 10 Virainica 2nd 10 Virainica 3rd 10 Virainica 4th 10 Virainica 5th 10 Virainica

Bayes Criterion (10) 13,47;3:41.04 8,52;4:57.25 6,54;4:41.53 6,54;6:44.68 9,51;1:22.24

Wolfe's Maximum 27,33;29:161.68 8,52;2:163.36- 30,30;28:171.01 21 ,39; 21 :162.01 23,37;23:172.38likel ihood 10 iterations 16 iterations 21 iterations 11 iterat fons 9 iterations..

Wolfe's MaximumLikel ihood with 13,47,3:174.07 8,52;4:164.60 9,51 ;3:179.13 6,54;6:170.00 9,51;1 :182.71Initial Estimates 4 iterations 21 iterations 18 iterations 4 iterations 5 iterationsfrom the BayesCriterion (10) .solution

- --- -. .. --- ..•.._. .._._... _. - ---_.,- .-*The table entries are: the first two numbers are the cluster sizes; the third is the number of misc1assifications.~ If more th~n one minimum was found with the Bayes Criterion, the criterion value is shown after the colon f~r each

solut~on. The likelihood value with Wolfe's ML approach is shown to compare the results using different initial esti-mates of the parameters . _ _

°The Bayes criter~o~ is to be minimized and Wolfe's ML procedure is to maximize.

Date post:	16-May-2018
Category:	Documents
Upload:	phungnguyet
View:	223 times
Download:	1 times

BAYES MODIFICATION OF SOME CLUSTERING CRI7ERIA … · BAYES MODIFICATION OF SOME CLUSTERING...

Documents