Date post: | 16-May-2018 |
Category: |
Documents |
Upload: | phungnguyet |
View: | 223 times |
Download: | 1 times |
~..
BAYES MODIFICATION OF SOME CLUSTERING CRI7ERIA
By
M. J. Symons
Department of BiostatisticsThe University of North Carolina at Chapel Hill
Institute of Statistics Mimeo Series No. 880
AUGUST 1973
Revised OCTOBER 1977
\
Bayes Modification of Some Clustering Criteria
M. J. Symons
Department of BiostatisticsUniversity of North Carolina
Chapel Hill, N. C. 27514, USA
Summary
Clustering criteria, like the trace and determinant of the \'lithin groups
sum of squares matrix, can be related to models involving multivariate normal
distributions. Inherent in these criteria is a tendency to split a data set
into approximately equa1 sized groups regardless of the underlying proportion
of mix, unless the groups are widely separated. In this paper we derive criteria
using a Bayesian approach which are more sensitive to disparate group sizes.
These are extensions of standard criteria which include the number of units
assigned to each sub-group, the number of sub-groups and the number of measure
ment's on each unit, as part of the criterion to be optimized. In addition, a
new criterion is deduced for the situation where the covariance matrices of the
clu~ters are not homogeneous. An example is included.
A Bayes criterion is also compared with the maximum likelihood approach
of Wolfe (1967, 1969) and Day (1969). The results suggest that the solution
using the Bayes criterion will provide useful initial parameter estimates for
maximum likelihood routines with mixtures of multivariate normals.
1. INTRODUCTION
Numerous techniques are available for partitioning a set of observations
into homogeneous sub-groups. For'examp1e, the method of cluster analysis dis-
cussed by Edwards and Cavalli-Sforza (1965) measures homogeneity by minimizing
the trace of the within groups sum of squares, using an algorithm that imposes
a hierarchical structure on the sub-groups. In such cases the criterion and
algorithm together implicitly determines the definition of a cluster. Rao (1952\ estates that a cluster has a very vague meaning and perhaps the Kendall and Buckland
\
2
(J958) definition as "a group of contiguous elements of a statistical population"
is specific enough for most purposes. An alternate approach is to first define
a cluster and then deduce the criterion which measures contiguousness or similar-
1ty in a cluster. A subsequent step would be the devising of an algorithm to
determine clusters so defined. This seems preferable to the uncertain interpreta
. tion of clusters implicitly defined by a criterion-algorithm combination.
The model discussed here is for a clustering situation where the sampled
population is thought to be composed of several distinct sub-populations and the
intent is to group together all those observations belonging to the same sub
population. As statistical structure~ we suppose that each observation arises
from one of a small number of different distributions, one modeling each sub
population. Hence the model for the:population is a mixture, the component. .
distributions modeling the sub-populations. Various choices of distributions
would lead to different criteria. Here they are taken as multivariate normals
and this is the same model as discussed by Wolfe (1967, 1969) and Day (1969).
However, their approach is from a traditional point of view. Geisser (1966)
uses the same model and a Bayesian approach, but assumes that the mixing weight
for each component is known. This model is also a special case of the model in
Scott and Symons (1971b), however a Bayesian point of view is stressed here. A
discussion of the approach in the present paper and that of Wolfe and Day is
presented in Section 4 along with a comparison· of resul ts from the two approaches,
using the Fisher Iris data.
• A cluster is' defined as those observations coming from the same component
distribution in the mixture. The problem is formulated as one' of predicting the
component origin of each observation •. Scott and Symons (197lb) approached this
problem using maximum likelihood methods and found that the solution is equivalent
to some standard clustering-;methods, depending on the assumptions about the COvCi ' ..
iance matrices of the multivariate normals in'the mixture and assum.ing that ,the
.e
observations are equally likely froma~y component. This suggested that the
standard clustering methods would perform best when the sub-groups are repre
sented in about the same proportions. The approach here is closely related to
the joint normal classification of Geisser (1966), but with prior probabilities
of an observation coming from any component being unknown. Also to conform
with the clustering setting presented by Scott and Symons (197lb), sample data
from the components are presumed unavailable.
Two of the criteria derived are shown to be modifications of standard
'clustering criteria. The modifications primarily involve the number of units
assigned to each sub-group, the number of sub-groups, and the number of measure
mentson each unit. The Bayes criteria are shown to be more sensitive to dis
parate group sizes than the corresponding standard criteria. When the covariance
matrices of the component normals may not be presumed homogeneous, a new criterion
is deduced using the Bayes approach .
2. Bayes Approach
Let thesamp1e cons i st of n independent observat ions r = (~1' .... , ~n)'
with li representing p measurements on the i th unit. No provision foruprevious
samples from any of the components is made since little or no prior knowledge
is available in most applications where clustering techniques are used. However,
1n a more general discussion of classification such a provision is reasonable
and easily incorporated; see for example Scott and Symons (197lb) and especially
Geisser (1966). The sampl ed population is model ed by a mi xture of G p-variate
3
normal distributions with means ~l' •.• , ~G and covariance matrices ~l' •.• , ~G·
eaCh' observation may arise from th~ gth constituent normal with probability ng.
Then 0 S ng ~ 1 for g = 1, ••• , G and Eng = 1.
All the observations from the same component in the mixture model are
e ·considered as a cluster. A sample observation li from a mixture model can be.,thought of as resulting from two random steps. First there is a multinomial
trial,specifically, the realization of an indicator random variable. taking
(1)
4
-e
unobservable multinomial trial for each observation y., call it Zl" that we would...1
like to predict.
The device of introducing notation for the missing information on class
membership of an observation is not new. Hartley and Rao (1968) use a "decision
parameter" which takes on the value of unity if the gth group is the one selected
fo~ the observation and zero if not. With this parameter introduced and in an
ANOYA setting, they formulate the problem of classification of an observation of
uncertain class origin, or ANOVA cell, and the estimation of the usual parameters
of an ANOVA as one single estimation problem. They presume "pilot samples," i.e.,
observations of known class origin as part of the available data. The class origin eis presumed to be deterministic and, quite appropriately in their ANOVA setting, so
there is no consideration of any random structure for determining the class origin
of an observation, such as a mixture of the possible ANOVA cell origins for each
observation. Their main emphasis is on the estimC).tion of the parameters of the
ANOYA model parameters when the class origin is uncertain for some observations.
The 1ikel ihood of! is completely determined by the parameters ~ ::: (7Tl , .• "t
wG' Pl' ••• , PG' Ll , .•• , LG) and Z = (zl' ••• , z). Note that 1 S z. S G and~ .....". - ..... ... . n 1
.¥.~~ =. ~ ..Wi.~h_ p.r~ab{i~itY. 1T~nTh}e ~.ike{lihO~ of the data :en is 9iJven by
• L(YI9,Z) = n 7Tggltgl g exp ~ L r (!i-~g)'~g- (!i-~g)... ... ... g=l g=l C
g
where Cg is the collection of y.'s with zl' = g, and n is the number of observ~- ....1 g
•tions in C9• Note that this likelihood is conditional on a 'speCific va'lue of ~ e
and that there are Gn possibl'e allocations of the n observations to the G componel ... ;.
5
The clustering problem then is to find the optimum assignment of ~i to the
components and hence to the c1ust~rs C1, ••• , eG• A Bayesian approach is taken
to jointly predicting the most likely outcome zi of the unobservable multinomial
trial in the generation of each of the n observations. This requires the specifications
of a prior distribution for ~, p(~), and the combination of this prior with the
likelihood (1). The parameters 9 are averaged over since they are not of central-'interest. This is in contrast to the emphasis of Hartley and Rao (1968), where
the parameters of the ANOVA model are of primary interest. Here the predictive
distribution of ! is of primary importance, i.e., the joint likelihood of ~i com-,
1ng from th~ component identified by zi' i =1, ••• , n. This can be written as
L(!I~) =~L(!l~,~) p(~)d~, (2)
where the integration is over the parameter space of 9. Alternatively, the-predictive distribution, f(YIZ), is the normalized L(YIZ) in (2). Following
"'" .... "'" ..,e Geisser (1966), this could be referred to as the joint prior, rather than pos-
terior, predictive probability of ~i coming from component zi for i = 1, ••• , n,
since samples from the components of the mixture are not available in this clus
tering situation.
The mode ofL(YIZ) is taken as the Bayes estimate of the cluster solution.-- .
There are predictable difficulties in determining this mode since the Gn possible
values for Z are not ordered in any simple way. Local extrema are frequently-encountered in applications. The explicit form of (2) and a discussion of finding
the mode of (2) is presented in the next sub-sections for three cases considered
to have the most practic.al importance ••
2.1 Covariance Matrices Equal and Known
If~9 =~ for g = 1, ••• , Gand ~ is known, the prior···.. ·...........·-r
...-..- ..--...
.-.•"0. ......
-1:
1>(~) - 1>(~l; 'i" •!!cHi (1Il ••• , ,1IG) 0< [g~l 1Ig]
.. .........
(3)
6
is selected. This is the product of two independent priors of ignorance, one
for the means and the other for the mixing proportions. The choice is consistent
with a general philosophy in cluster analysis that the data are to detennine the
clusters. The priors chosen on1y delineate the bounds of the parameter space;
see Jeffreys (1961) or Lindley (1965) for a further discussion.
The integral of the product of (1) and (3) over the parameter space c~n ~eI.:· J
accomplished in two steps. The likelihood factors into a portion which involves
only the mixing proportions and the cluster sizes and the remainder, composed of
the determinant of the common and known covariance m~trix and the exponential
portion. The integral over the mixing parameters amounts to the normalization of
a G - 1 variate Dirichlet distribution (see Johnson and Kotz (1972), p. 233) and
yields the terms r(ng) = (ng - 1)1 that depend on the cluster sizes. ,Constants
like n=Eng are not included. The integral over the mean vectors is the product
of the normalization of G p-variate normals, all with the same known covariance ee matrix. It contributes factors of n~~ and an exponential term' of ~tr(~~-l).
Similar details of the required matrix manipulations are given in Anderson (1958),
p~ 46. The result then, L{YI~), is proportional to
where
exp { ~tr(!i~-1)} ~ {. r(n )n-l,p }• g=1 g g
(4)
(5)
~nd ~g is the mean o'f-'~hose ~i a11oc~~ed' to the gth·g~OUp and ~ is the within
groups sum of squares matrix.
The Bayes prediction of ~, denoted Z, is ~hat allocatiC?n of each ~i' i = 1,
..e. n', to one of the G components in the mixture that maximizes (4). Equivale., t1 v, . .
the grouping minimizes the negative of two times the natural logarithm of (4)
or
7 '
.e(6)
with a pre-analysis rotation to remove correlations and standardize the variances
of the observations so that L = I, the p x p identity matrix. For the situation- -where fis known, this is easily accomplished by the transformation x = Ty, where- --.! is the square root matrix or the transpose of the orthogonal matrix in the
spectral decomposition of the covariance matrix ~. Edwards and Cavalli-Sforza
(1965) proposed the trace of the within sum of squares matrix, tr(W), as a-'clustering criterion.
The Bayes formulation (6) appends to tr{~) terms involving the number of
observations allocated to each component, the number of components and the number
of variates. The overall effect of these terms is to favor unequal splits rather
than equal ones. This can be seen in Tables 1 and 2 where the total contribution
of the Bayes modification is partitioned into portions coming from the averaging
over the'mixing parameters and the averaging over the means. The contribution of
the terms from the mixing parameters dominates that of the terms from the means,
except where the number of variables and number of groups becomes large relative
to the total sample size. This is evident in Table 2 for four groups and parti
tions of 20 observations with four and eight variables. Also with fixed total
sample size and fixed number of groups, as more variates are added less importance
'\s,given to unequal splits, i.e., the value of the component from averaging over
the mixing proportions becomes less negative. This provides protection against
determining spurious clusters of unbalanced size due to variation in irrelevant
tit ·additional measurements. Analogus statements can be made for a fixed total sample-..
size and a fixed number of variates as the number of groups increase. Also, hold in_
the number of groups and the number of variables both fixed, the magnitude of the
contribution of the Bayes modification increases ~/ith the total sample size, pro-
8
vid1ng additional sensitivity to disparate group sizes. The trace of the within
e groups sum of squares, of course, enters the final determination of the clusters
and could be the deciding factor for clearly separated groups. In summary the
Bayes modification adds the potential for increased sensitivity to disparate
group sizes.
The Bayes modification of the trace of the within groups sum of squares will
be optimized by an allocation of the ~ observations to the G groups which balances..the preference of compact, equal sized groups by the trace portion and the net
. tendency toward unequal spl its by the termsinvol ving the cl uster sizes, number
of clusters and number of variables. As noted above, less weight is given to
. unequal sized splits when the number of variables and number of groups is large
relative to the total sample size. In such situations one could expect equal
sized cluster solutions, un1ess the groups are clearly separated and then the
trace of the within groups sum of squares alone could be decisive in determining ea sound cluster solution. (Insert Tables 1 and 2).
The Bayes criterion with known mixing parameters would include the same
tenns from averaging over the means and the term -2 L ng 1n (7£g) rather than
-2 I In{r{ng)}, each sum being over g =1, ... , G, since no averaging over the
uncertainty of the mixing parameters would be required when they are known. Note
that if each group is equally likely as the origin of each observation, i.e.,-1
1I"g c: G , for g =1, 2, ... , G, then the sum over the G groups of this new term
is a constant whatever the cluster sizes. Hence the criterion
(7)• Gtr(W) + p L 1n{n
9)
... g=l
'is the Bayes analogue of the tr{~) criterion proposed by Edwards and Cava11i-
. Sforza (1965). Note that the Bayes criterion (7) makes explicit use of the
,e assumption that each of the'G components is equally likely as the origin of
anyone of the n·observations.
9
The practical aspects of finding the allocation which minimizes (6), (7),
or the trace criterion alone requires search routines such as the one compiled
by McRae (1971). After saving the best of several randomly determined initial
allocations of li to the G groups, his routine produces a relative minimum
for a chosen criterion in the sense that any reassignment of one of the obser-
. vations to a different group results 1n a larger' value of the criterion, but
does not guarantee an absolute minimum. Although the mixture model accommo
dates the possibility of no observation coming from anyone group, generally
for cluster analysis applications the search over the Gn values of t is re
stricted to insure that at least one observation is allocated to each group.
Note that (6) is unbounded for any ng = 0 and clearly would not correspond to
a partition of the observations minimizing (6) or (7).
2.2 Covariance Matrices Equal and Unknown
When 1: = 1: for g=l, ••• ,G and .L·is unknown, the vague prior....g .... -
(8)
.....- --- . -. _ .
is used to delineate the bounds of the parameter space. . ~ 1~_l-~(P+l),The prlor on~, L.
is a generalization of the 1/02 prior of ignorance for the univariate normal
variance parameter discussed by Jeffreys (1961). It is of the general form
used by'Geisser and Cornfield (1963) and is invariant under power transformations.
After forming the product of the likelihood (2) and prior (8), an integration
over the parameter SUb-space of L' is required in addition to that described in. . -Section 2.1. This involves 'the normalization of a Wishart distribution, the resul~"
of which can be obtained by inspection from Anderson (1958), p. 154. This contrib
utes a factor containing the determinant of the within group sum of squares and
some constants. The ma'rgin~;l likelihood of the data given an allocation t' ~(Y~:l'e
then is proportional to 0, , .
IWI -I;(n-G) °nG [( ) -~]f ngo n
9•
... g=l{9).
10
The Bayes partition of the data into G groups maximizes (9), or equivalently,...Z minimizes
...--_.
(n-G)1n [I~I ] + r J p 1n[n ] - 2 1n[f(n )] }, g=1 I g g
The Bayes criterion (10) is more sensitive to disparate group sizes.
2.3 Covariance Matrices Unequal and Unknown• In this situation where the covariance structure may differ from group
to group, the prior chosen is
(10)
11
As with the prior (8)', the parameter space is del ineated with only vague
apriori information in both the spirit of Jeffreys (1961) and cluster analysis. _I. •
The priors on the means and covariances for each group are independent of one '
another and each is presumed independent of the prior on the mixing proportions.
The averaging of the nuisance parameters from the product of the likelihood
(1) and prior (11) involves no different integrations than discussed in the •
previous case. However some additional terms, which were constants with the
integration over the common unknown f. appear due to the normalization of a
Wishart distribution for each covariance matri~ ~g' g=l, .••• G. With unequal
and unknown covariance matrices, the Bayes partition of the data into G groups
maximizes
where
Equivalently .~ minimizes
Wg = E (Yl" -Yg) (Yl.-Y9) I.~ c· .... """ ....,. <IV
9 , "
113)
9~1 { ("g-l)ln[l~gJ] + pln[ng] - p(ng+p)ln[2] -2[ln[r(ng)] + i~lln[r(J,{ng+p+l.i})])}(14)
Situations involving unequal covariance matrices pose tremendous challenges
to clustering algorithms. One need only vis~alize two cigar shaped clusters with
common location and crossed orientation of the covariance matrices. Chernoff
(1970) comments briefly on this case. but the inherent difficulties are very
within group sum of squares matrices.
The search over the Gn possible partitions of the n
.e
formidable unless the clusters are well separated. A traditional analogue to the
Bayes criterion (12) could be the product of the determinant of the individual,,-)
'r \' 1i' r ,~:1 "',~'
observations into G
groups neeus further comment for this case. In addition to the remarks following
.e
I'
criteria' (6) and (10), the search for t must be further restricted so that ng ~ (p+l)
''$0, that ~g will not be singular. The search could be extended to include those eparti.tions with lsng s p, i~ one used ~, the pooled within groups sum of squares,
to approximate ~g. One might rep1~ce ~g by ng~/n or (ng-l)~/(n-G). This imposes
an average covariance structure on the clusters with ng:sp, but allows an expanded
search ·for the optimum allocation of the n observations to include all partitions
with the number allocated to .each group being at least one, as with the blo previous
cases.
2.4 Discussion of Approaches
In view of the statistical structure that is presumed to derive these
criteria, they may be viewed as instruments of "simultaneous discrimination" or
"joint normal c1assification" as used by Giesser (1966). The traditional clus
tering criteria usually involve little distributional structure, but do attempt
·to optimally allocate each of the n observations to one of several groups. The
e criteria presented here are optimized by a particular allocation of each of the
observations, the parameters,of the presumed distributions being of secondary
importance. The attention is focused on the predictive distribution of VIZ where. . ~-
! = (zp ... , zn) and zi is the result of an unobservable multinominal random
variable that specifies the component origin of the observation li.
3. Numerical III ustration of the Sensitivity of the BayesCriterion (10) \'/ith Disparate Group SiZES
The determinant of the within groups sum of squares matrix, I~I, and the
Bayes critarion (10) derived in Section 2.2 were compared using various portions•of the Fisher Iris data. (See Kendall and Stuart [1966], p. 318.) Briefly
·these data are 50 ohservations on each of three types of Iris: Setosa,Versi
colors and Virginica. Each observation is composed of four measurements on th'"
flower: sepal length, sepal.width, petal length, and petal width. The Bayesje criterion (10) and I\~I are each options., in an approximate routine constructed...
by McRae (1971) with slight modifications.
As an initial. investigation of the perfonmance of the Bayes modification
4It of IWI, five balanced data sets were constructed involving the Versicolor and
Setosa plants and five balanced sets involving the Versicolor and Virginica
is discussed more fully after the presentation of the analysis results for the
unbalanced data sets.
[Insert Table 3]
A total of ten unbalanced data sets were constructed to test the Bayes mod
ification of I~I to sepa~ate the disparate sized clusters. Five unbalanced data"sets were created by splitting the Versicolor plants into five groups of 10 and
-~
combining each sub-group with all 50 Virginica plants. Five more unbalanced
sets were produced in the same way by splitting the Virginica plants into five•groups of 10 and combining each s~b-group \'lith all 50 Versicolor plants •
. [Insert Table 4]
The results are summarized in Table 4. Except for the two data sets dis
cussed below. the increased.sensitivity of the Bayes criterion (10) as comparecf \'lith•
14
the standard I~I criterion is clearly demonstrated. Since the Versicolor and
Vlrginica overlap slightly, the IWI criterion tends to split the combined 10...plan~s of one type and the 50 of the other type into two equal sized groups.
This is to be expected on theoretical grounds, since I~I is essentially the
Bayes criterion for the situation of Section 2.2 with the added assumption that
the mixing parameters are assumed known apriori each to be G-1. See (10))
(7) and from the related discussion with the trace criterion the appropriate
Bayes modification can be seen. That is the Bayes inter-
pretation of Iwl is essentially that each observation is equally likely from.., .
either of the two groups. Allowing the mixing parameters to be unknown provides
the terms involving the group sizes, ~in particular the r(ng) factors, as part of
the criterion to be optimized, and hence the sensitivity to unequal group sizes.
The results in Table 4 with the Bayes criterion for the two data sets: ethe 3rd 10 Versicolor with 50 Virginica and the 4th 10 Versicolor with 50
Virginica, may suggest that the tendency toward unequal group sizes is too
strong. The criterion in each case produces groups of about 10 and 50
observations. However, the smaller group of observations corresponds to
Virginica plants rather than the Versicolor ones, as can be seen from the number
of mis-classifications for these two unbalanced data sets. A glance at Figure 1
of the 3rd 10 Versicolor and 50 Virginica plants reveals a group of about 10
Virginica measurements about as equally prominent as the 10 Versicolor ones.
lot is this group oflO or 11 Virginica observations that the Bayes criterion
groups as one cluster. A search around the allocation of the 10 Versicolor
(3rd or 4th set) to one group and the 50 Virginica to the other group by the
Bayes criterion (10) yields another local minimum. But as can be seen by.-.
e comparing the criterion values for the ,"solution start". row with the minimum in
the "random start" row, the separation of the 10 or so Virginica p1an'ts is pre
ferred to that which splits off the 10 Versicolor plants.
[Insert Fi gur'el]
This feature of the Virginica measurements brings to the front the funda
mental question of judging the significance of cluster solutions. Is the
separation of the clusters statistically significant or is it a random feature
of the data set at hand? Clearly the upper 10 or 1.1 Virginica are competing
with the 10 Versicolor as the smaller group to be split off. As the 10 Versi
~lor are moved away from the 50 Virginica, there is a distance (a separation
1n the means for the two groups of about two standard deviations in each of the
four variables works) for which the Bayes criterion (10) correctly groups the
10 Versicolor. At this distance I~I still divides the data set into two equal
~ groups, but as the separation is increased more, I~I also makes the correct di
vision. When is the separation statistically significant? Little is available
on this question. See for example Engleman and Hartigan (1969) and more recently
Lee (1977). The problems of judging the significance between one and more groups
or more generally between G = Go and G = G1 are difficult. Wolfe (1971) also
addresses this problem.
4. Comparison with the Wolfe-Day Approach
Wolfe (1967, 1969) and Day (1969) approach the problem of estimating the
mixture component origin of the observations ~i in a rather indirect method.
Estimation of 9 is addressed first. Given estimates of the mixture parameters,. ...
:,
15
16
is over all Gn allocations
The likelihood maximized by Wolfe and Day can
L(!I~) = i~l [gtn9V~il~g'~g~ =
a marginal of (1), since the summation of 1(!I~,~)
be written as
t L(YIQ,Z),Z -v ...........
...(15)
e-
of the n Y1' to the Gcomponents. The notation N (y.lll ,t ) denotes the p-variate
- p -1 -g -gmultivariate normal density evaluated at y. given II and r. By maximizing (15). -1 -g -g . ': '
the maximum likelihood (ML) estimates of the 11"g'S, II 's,and r 's are obtained.-g -gThis approach then replaces the parameters by their ML estimates and assigns each
observation to the mixture component w.ith the largest estimated density height,
scaled by its estimated proportion of representation in the mixture. That is for
g=l, .... , G, 11"g is replaced by 1T , 1.1 by n, and, r by f. Then for i=l, •.. , n, Y-1'g -g -g -g -g
1s assigned to the component for which Tr N (y .I~ , f ) is the largest, g=l, .•. , G.g P _1 -g ,-gNotice that this procedure does not include the variability in the ML estimates of
the parameters. However, the Bayes solution averages the likelihood over these enuisance parameters, thereby incorporating thei r variabil i ty into the estimation
of the optimum allocation of the observations.
[Insert TableS.]
The performance of the Wolfe-Day approach is illustrated in Table 5.
Using a program supp1 ied by Wo1 fe, the same ten unba1 anced data sets presente"d
1n Table 4 and in Section 3 were analyzed. The mixture of multivariate normals
was assumed to be homogeneous in the covariance matrices, i.e., corresponding
to the situation in Section 2.2. Two sets of results were obtained with Wolfe's
program: one using the initial estimates for the parameters (11" IS, II IS, and r)• . . g -g -
prOvided by the program and another using the best-Bayes criterion (10) solution
to estimate the proportions, means, and common covariance matrix. Comparing
:.
II.
first the Bayes criterion solutions with the solutions provided entirely by
Wolfe's program, the Bayes criterion does quite well. Wolfe's solutions also
have a tendency toward equal sized clusters as did the I~I criterion. The
solutions by Wolfe's program with initial estimates of the parameters based
on the best solution with the Bayes criterion were very much like the Bayes
criterion (10) solutions. This is not surprising since as Day (1969) points
out, relative maxima are frequent1y encountered. However the fact that the
likelihood for these solutions was larger in all cases is noteworthy. We also
note that in two cases the increase in the likelihood was at the expense of a
few more iterations, but in about one-half of the cases the increased likelihood
was complemented by a marked reduction in iterations.
These results suggest that the Bayes criteria might provide useful initial
estimates for ML estimation routines for mixtures of multivariate normals like
4It Wolfe's. The use of a clustering criterion to generate initial estimates for
ML routines is not . new; 'in fact Wolfe uses MacQueen's (1965) clustering
procedure together with a hierarchical grouping by Ward (1967) to obtain initial
estimates for .his ML equations. However the Bayes criteria are derived for three
specific situations with a mix of multivariate normals. The demonstration of
improved performance by the Bayes criterion in Table 5 for these preliminary
comparisons then is. not unexpected.
5. Further Discussion
5.1 Estimation of Parameters in Mixtures of Normals
•-,
Marriott (1975) has pointed out that the approach of Wolfe and Day provides
estimates of the parameters g which have asymptotically desirable properties •...It is also pointed out
...that estimates of g based on the partition Z that minimizes P~I are inconsistent.... ... ...
This also applies to the"Bayes modification. For example the distance be~,., ' C3 n .
the means will be over-estimated and the cornmon variance for each measurement. .
will be under-estimated. This is due to the overlap of the mixture components...
and the truncation process involved in the optimization to find Z. We note. ...however that in spite of these obvious shortcomings, the estimates of g based
-on Z still provide useful initial estimates for MlE routines, as pointed out
in Section 4 and as shown by the results of Table 5.
More importantly, we note that in determining the optimum allocation from
the Bayes approach and from the ~Iol fe-Day approach, there is a di fference in the
primary intent of each approach. The clustering procedures provide an allocation
of each observation to one of several groups. The approach of Wolfe-Day is pri
marily to obtain the maximum likelihood estimates of the parameters in the mixture.
The component origin estimation is a subsequent consideration, which proceeds as
if ,the f1L estimates of· the parameters in the. mixture are in fact the true values
of 9. The Bayes approach described"here focuses on the estimation of the compo-- .
nent origin of each observation, averaging over the uncertainty involving the
mixture parameters to determine the optimum component origin of each observation.
This is a difference in e$timation philosophy of the two approaches. The Bayes
approach to estimating the parameters of the mixture would involve normalization
of the product of the likelihood (15) and a prior, p(Q), for the mixture para-....meters to determine the posterior distribution of Q. With vague forms of the
. ...prior, such as those employed in Section 2, the Bayes mode of the posterior
distribution of e and the maximum likelihood estimates of e would tend to be the- . -same with large samples.
5.2 Practical Computational Aspects,
• The practical difficulties with determining either the Ml estimate of Q...-or the optimizing allocation Z are formidable. Both are plagued by relative...minima.' With small samples one should feel more confident with the search
routines for i. Several approximate algorithms are available to search for ~- -I .,
the optimum partition into several groups. For example, see Forgy (1965),
For example, see Forgy (1965), MacQueen (1965), Friedman and Rubin (1969),
and McRae (1971). These seem to work well in practice as illustrated by the
results in Table 5 using McRae's routine, but they provide no assurance that
the optimum has been reached. Relative minima present difficulties for such
routines; see Friedman and Rubin (1969), p. 1165 and the results in Table 3, 4
and 5. Feasible search routines guaranteeing an optimum over the Gn partitions
19
are for the most part not available. See for example the difficulties with only
two groups and the simple tr(~) criterion examined by Scott and Symons (1971a).
Not only does the number of groups and sample size enter, but also the number
of dimensions.
The maximum likelihood approaches face convergence problems andean produce
unreliable results with small samples. For example with the equal sized sets of
10 plants in the lower half of Table 3, solutions comparable to those presented
~ fr~m the clustering criteria were produced for the fourth and fifth sets of ten.
For the first set of 10, a 19-onesplit was found as compared with a 10-10 split
for the clustering criteria. No solution was produced after 100 iterations for
the second and third sets. However with larger samples, the maximum likelihood
approaches perform much better. This is fortunate, since the search routines
become questionable as the number of partitions possible increases exponentially
with the sample size. Consequently, to estimate the optimum allocation, one
can recommend the search routines with small samples, and the maximum likelihood
approach of Wolfe-Day for large samples. Initial estimates for the ML routines
could be obtained utili~ing the search routines over the vlho1e data set, or•
over subsets if the sample size is very large •
.-,
Geisser, S. and Cornfield, J. (1963). Posterior- distributions for multivariate normal parameters. Journal of the Royal Statistical Society,Series ~ 25, 368-76.
Ha'rtley, H. O. and Rao, J. N. K. (l968). Classification and estimation inanalysis of variance problems. Review of the International StatisticalInstitute 36, 141-7.
Jeffreys, H. (1961). Theory of Probability. Clarendon, Oxford.
Johnson~' N.L. and Kotz, S. (l972). Distributions in Statistics: ContinuousMultivariate Distributions.. Wiley, New York.
Kendall, M. G. and Buckland, w. R. (1958}.A Dictionary of Statistical Terms.. Haefner, New York.
Kendall, M. G. and Stuart, A. (1966). The Advanced Theory of Statistics• Vol. 3. Hafner, New York.
I
Lee, K. L. Multivariate tests for clusters. J. Amer. Stat. Ass. Acceptedfor publication 1977.
Lindley, D. V.(l965). Introduction to Probability and Statistics from a. Bayesian Viewpoint. Part 2: Inference. Cambridge University Press.
'-j
REFERENCES (cont.)
MacQyeen, J. (1965). Some methods for c1assffication and analysis ofmultivariate observations. Proceedings of the Fifth Berkeley Symposiumon Mathematical Statistics and Probability~ Vol. -1,281-297.
McRae, D. J. (1971). MIKCA:. a Fortran IV iterative K-mean.s cluster analysis,program. Behavioral Science 16, 423-4.
Rao, C. R. (1952). Advanced Statistical Methods in Biometric Research.Wiley, New York.
Scott. A. J. and Symons, M. J. (1971a). Clustering methods based on likelihoodratio criteria. Biometrics 27, 387-97.
I
Scott, A. J. and Symons, M. J. (l971b). Clustering methods based on likelihood ratio criteria. Biometrics 27, 387-97.
Ward, J. H., Jr. (1967). PERSUB Reference Manual: PRL-TR-67-3-{1l}. PersonnelResearch Laboratory, Lackland AFB,- Texas.
Wolfe, J. H. (1967). NORMIX: computation methods for estimating the parametersof multivariate normal mixtures of distributions. Research Memo, SRM 68-2.U. S. Naval Personnel Research Activity, San Diego, California.
Wolfe, J. H. (1969). Pattern clustering by multivariate mixture analysis.Research Memo. SRM 69-17. U. S. Naval Personnel Research Activity.San Diego, California.
'·Wo1fe, J. H. (1971). A Monte Carlo study of the sampling distribution of thelikelihood ratio for mixtures of multi normal di stri but ions. TechnicalBulletin STS 72-2 .. U. S. Naval Personnel Research Activity, San Diego,california •
..
-,
\
Acknowledgement
The author wi shes to acknowl edge the ass istance of Kerry L. Lee, a
graduate student in Biostatistics at the time the initial results of this
research were obtained. The helpful comments of the referees and thorough
review by an associate editor are gratefully acknowledged. This included
the notational convention of this paper and pointtngout that an earlier
version of the criterion for the situation with unequal and unknown covariance
matrices had not included all the terms from normalizing each of the G requi
site Wishart distributions. More recently, the excellent typing and patience
of Pauline Kopec and the expert programming assistance of Thomas Lange,nderfer
. is also appreciated. e
•
-,
I.IST OF FIGURES
. FIGURE 1: Scatter Diagram of Third 10 Iris Versicolor (*) and 50 IrisVirginica (0)
LIST OF TABLES
•
...~~.
.'.t
TABLE 1:
TABLE 2:
TABLE 3:
TABLE 4:
TABLE 5:
•
Contribution of the Bayes Modification Portion of Cluster Criterion(6) by Component and Total for Three Selections of Number of Variablesand Two Sample Sizes with Three Selected Allocations of the Number ofObservations to Two GroupsContribution of the Bayes Modification' Portion of Cluster Criterion(6) by Component and Total for Three Selections of Number of Variables'and Two Sample Sizes with Three Selected Allocations of the Number ofObservations to Four GroupsComparison of Cluster Solutions with the Detenninant of the WithinGroups Sum of Squares and Bayes Modification (10) for Various BalancedSets of Setosa, Versicolor and Virginica Plant MeasurementsComparison of Clusters Solutions with the Determinant of the WithinGroups Sum of Squares and Bayes Modification (10) for Ten UnbalancedData Sets Composed of Versicolor and Virginica ObservationsComparison of Cluster Solutions with Wolfe's Maximum LikelihoodApproach and Bayes Criterion (9) for Ten Unbalanced Data SetsComposed of Versicolor and Virginica Observations
!,
-.
•
TABLE 1
Contribution of the Bayes Modification Portion of Cluster Criterion (6) by ComponentAI Total for Three Selections of Number of Variables and Two .Sample Sizes with Three~lected Allocations of the Number of Observations to Two Groups
Bayes Component from Component from Total ContributionNum- averaging averaging over of the Bayesber of Modification over the means the mixing Modification Portion
proportionsVariables Componentsand Sample 2 2 2Size Allocation P t R.n(ng) -2g~1R.n{r(ng)} t [ptn(ng)-2R.n{f(n )]};
g=l g=l 9 .P ng (g=1,2):n ~
- - ..
1, 19: 20 5.89 -72.79 -66.902 6, 14: 20 8.86 -54.68 -45.82
10, 10: 20 9.21 -51.21 -42.001, 19:. 20 11.78 -72.79 -61.01
4 6, 14: 20 17.72 -54.68 -36.9610, 10: 20 18.42 -51.21 -32.791, 19: 20 23.56 -72.79 -49.24
8 6, 14: 20 35.45 -54.68 -19.2310, 10: 20 36.84 -51.21 -14.375, 55: 60 11.23 -335.00 -323.76
2 20, 40: 60 13.37 -291.94 -278.5730,
030: 60 13.61' -285.03 -271.42
5, 55: 60 22.47 -335.00 -312.534 20, 40: 60 26.74 -291.94 -265.20
30, 30: 60 27.21 -285.03 -257.825, 55: 60 44.93 -335.00 -290.06
8 20,:.40': 60 53.48 -291.94 -238.4730, 30: 60 54.42 -285.03 -230.61
•
.J
TABLE 2
,Contribution of the Bayes Modification Portion of Cluster Criterion (6) by Component eand Total for Three Selections of Number of Variables and Two Sample Sizes with ThreeSelected Allocations of the Number of Observations to Four Groups
-
Bayes .component from Component from Total ContributionNum- averaging averaging over I of the Bayesber of Modification over the means the mixing Modification Portion
proportions IVariables Componentsand Sample
4 4 4Size Allocation P I R.n(ng) -2 I R.n{r(ng)} I[PR.n(ng)-2R.n{r(ng)}]
(g=1.2):"~9=1 9=1 9=1
e 09- -"=.='-:-=-"-'-.'-'-=":"','-.=..:.'===-=:. -,-
I, 1, 1, 17:20 5.67 -61.34 -55.68
2 3, 3, 7, 7:20 12.18 .,. -29.09 -16.91
5, 5, 5, 5:20 12.88 -25.42 -12.55
1, 1, 1, 17:20 11.33 -61.34 -50.01
4 3, 3, 7, 7:20 24.36 -29.09 -4.73
5, 5, 5, 5:20 25 .. 75 -25.42 0.33
1, 1, 1, 17:20 22.67 -61.34 -38.68
8 3, 3, 7, 7:20 48.71 -29.09 19.62
5, 5, 5, 5:20 51.50 -25.42 26.08
3, 3, 3, 51:60 14.46 -301.11 -286.67
2 10,10,20,20:60 21.19 -208.57 -187.37
15,15,15,15:60 21.66 -201.53 -179.87
3, 3, 3, 51:60 28.91 -301.11 -272.20
4 10,10,20,20:60 42.39 -208.57 -166.18
15,15,15,15:60 43.33 -201. 53 -158.20
3, 3, 3, 51:60 57.82 -301.11 -243.29
8 10,10,20,20:60 84.77 -208.57 -123.79•
15,15,15,15:60 86.66 -201.53 -144.87
. -, •
e .. TABLE 3' •
Comparison of Cluster Solutions with the Detenminant tIIlhe Within Groups Sum of Squares and BayesModification (10) for Various Balanced Sets of Setosa, Versicolor and Virginica Plant Measurements
..
e-- -- --, ....._--- ..-. ,._-- _.....
1st 30 Versicolor Last 30 Versicolor 1st 20 Versicolor Last 20 Versicolor& & & &
1st 30 Setosa Last 30 Setosa 1st 20 Setosa Last 20 Setosa
30, 30; 0: 127.87 30, 30; 0: 143.26 20, 20; 0: 21.52 20, 20; 0: 20.55
All 50 Versicolor&
All 50 SetosaCriteriono
Detenminant 150, 50; 0: 1190.48*Criterion !
random start: Isolution start: r5~,=5~;=0~I1~0~~solution va1ue:f50, 50; 0: 1190.48
30, 30; 0: 127.87----------30, 30; 0: 127.87
----------~----------I--------------30, 30; 0: 143.26 20,20; 0: 21.52 20,20; 0: 20.55---------- ------------------------30, 30; 0: 143.26 20, 20; 0: 21.52 20, 20; 0: 20.55
lo...; •
20, 20; 0: -18.5220, 20; 0: -16.7730,30; 0: 30.1330, 30; 0: 23.54
=3~,=3~;=0~ ~3=5~ = 1!O~ !Ol ~:=3~·I3= =1 ~O~ ~l ~:=-I6:!~=1= =2~,=2~;=0~ ~1~.!( = = =.30, 30; 0: 23.54 130, 30; 0: 30.13 120, 20; 0: -16.77 20, 20; 0: -18.52
GO ,50; 0: 147.08._--------50, 50; 0: 147.08
50, 50; 0: 147.08
~
Bayes, Modification (10)
random start:solution start:solution value:
---'-'
1, 19; 9: 2.74 I 4, 16; 8: -39.62 I 3, 17; 7: 3.51 I 10, 10; 0: -45.689,11; 3: 19.98 6,14; 8: 4.19
----------~----------~----------I--------------10, 10; 0: 13.68 7, 13; 5: 0.87 10, 10; 0: 14.19 10, 10; 0: 0.49-10,-10;-0~ 13-:-68 - 10:-1O~ 0:-2-:60 - - 10:-10~ 0:-14.19- - - -10,-10;-0~ 0.49- - ---
----------~----------~----------I--------------10, 10; 0: 14.30 4, 16; 8: -39.62 10, 10; 0: 14.96 10, 10; 0: -45.68-10,-10;O~14-:-30- 10-:-10;0:--15-:-60- 10-:-m;0:-14.96----10,-10;-0~=-45.68----
________ •__ - _n___ -.+-1----.....;,.,,----
10, 10; 0: 0.49
5th 10 versicolor&
5th 10 Virginica
5, 15; 5: 10.68
4th 10 Versicolor&
4th 10 Virginica
7, 13; 5: 0.874, 16; 8: 1.19
3rd 10 Versicolor&
3rd 10 Virginica
9,11 ;1: 13.28
2nd 10 Versicolor&
2nd 10 Virginfca
solution start: L1~,_1~;_02., :1~.:~3_solution va1ue:~O, 10; 0: -17.33
Bayes I5, 15; 9: -5.13Modification (10) 10, 10; 0: -17.33
random start:
1st 10 Versicolor&
1st 10 Virginica Criterion------+-------t--------.-------- .--.----- -Determinant 10, 10; 0: 2.36
Criterionrandom start:
solution start: 1'1~,=1~;=(~.~( =solution va1ue:ho, 10; 0: 2.36
I • _'_ ___ "'=~ I --===--=~-,="'._:-,'==_~=_-~<- ,..=== L .. ======- _* The table entries are: the first two numbers are the cluster sizes; the third is the number of misclassifications;
the criterion value is shown after the colon for each solution.° The criteria are each to be minimized. The "random start" solutions are from the search for the allocation minimizing
the specified criterion as provided by McRae's program (1971). The "solution start" provides the allocation specifiedin the column ~eading as the preliminary allocation for the exchange portion of the routine; see p. 9 of text. The"solution value" gives the criterion value at the allocation specified in the column heading.
TABLE 4
Comparison of Clusters Solutions with the Determinant of the Within Groups Sum of Squares and Bayes Modification(10) for Ten Unbalanced Data Sets Composed of Versicolor and Virginica Observations
1st 10 Versicolor&
All 50 Virginica
2nd 10 Versicolor&
·All 50 Virginica
3rd 10 Versicolor&
All 50 Virginica
4th 10 Versicolor&
All 50 Virginica
5th 10 Versicolor&
All 50 VirginicaCriterion° 1-; .-·----1-1----------
* 26.34;16:1404.2529.31;19:1417.73
,
. - -~.~
~
Determinant 26.34;16:963.37Criterion 30.30;20:1153.04
random start:_._----~------------
solution. start: 10.50;0:1228.7010.50;0:1441.27so'.Uoo val••, lQ,50~O-;-lli8.7o- - -,o.50~o-;-q.l.17--
Bayes 10.50;0:122.72 8.52;2:129.22Modification (10) 26.34;16:139.54
random start:
SOlut~on start: -1~,~OlQ~1~2!~= = I =1~'~OlO~1~1~9~ = =Solutlon value: 10,50.0:122.72 I 10.50;0:131.97
30.30; 20:884.08 I 30.30;20:1112.46 27.33;23:1011.4026,34;16:1128.96 24.36;14:1049.83
29.31;19:1086.22- - - - - - - .. - -1- - - -- - - - - -l- - - - - - - - - - - - - - .~5.!.3~;.!-5-=-8~1..:..5".-l !O.!.5~;~: .!-3~6..:..7~ - -1- _1~.~0!.0_:1~6~.~9- - - - -10.50;0:1242.35 10.50;0:1428.75 10.50;0:1069.49-----_. --10.50;20:115.82 11.49;21:119.34 10.50;0:114.6730,30;20:135.68 8.52;2:119.96 I 12.48;22:122.28
23,37; 15: 147.58----------,-------------------------..10.•50;0:125.41 10.50;0:131.47 I 10.50;0:114.6710:-50;0:125.41- - - -10:-50;0:131.47- -l- -10.50~0-;-114-:-67- - - --
---------i----------i--------------!.3!.4~;~;.¥..:..3~ _ _ _ !.0!.5~;~:~8..:..8Q. _ _ _ JQ..~Oi0.:.2~:!6- _10.50;0:43.77 10,50;0:68.80 10.50;0:26.36
---'_ ..•,-- -._.__ . -- - .._..--,_.I
All 50 Versicolor All 50 Versicolor All 50 Versicolor& & &
3rd 10 Virginica 4th 10 Virginica 5th 10 Virginfca-----...._-_. - --_."
I28.32;22:216.32 15.45;5:368.57 24.36;20:206.8616.44;6:220.89 7.53;5:426.53 10.50;2:221.14
;::======I=======:t==.=~-==--==i=.- .._-.All 50 Versicolor All 50 Versicolor
& &1st 10 Virgfnica 2nd 10 Virginfca
Criterion I
Determinant 13.47;3:254.42 26.34;22:392.60Criterion 29.31;29:330.66 23.37;21:399.94
so~~: :::: 1~.~O~O~3~3~2~===:i:i:~:i~~~~==t!6~4~;~,~~.~9===iO~5~;~'~~.~4==~==1~.~O~2~2~1~'~=====solution value: 10.50;0:323.28 10.50;0:459.40 .10.50;0:315.00 10.50;0:484.94 10.50;0:233.31
Bayes 13.47;3:41.04 8.52;4:57.25 6.54;4:41.53 6.54;6:44.68 9.51;1;22.24Modification (10) 29.31;25:85.9628.32;22:53.75
random start _ _ _ _ _ _ _ _ _ _ _
solution staritlO.SO;0:45.28 I 10.50;0:65.65solution valuetl0.50~O~45.28-- -1-10.50:0~65.65- --
I. 1 . .. _ J ..-_._. . __-.,t---.--.-- --.----.--.l.-,--..,..------ I .__ ..•. _
* The table entries are: The first two numbers are the cluster sizes; the third is the number of misclassifications;the criterion value is shown after the colon for each solution.
o The criteria are each to be minimized. The "random start" solutions are from the search for the allocation minimizingtile specified criterion as provided by McRae's program (1971). The "solution start" provides the allocation specifiedin the column heading as the preliminary allocation for the exchange portion of the routine; see p. 9 of text. The ~
·solution.~alueH gives the criterion value at the allocation s~fied in the column heading. ~
A . ',",~)' • 't ..
-- 1# eTABLE 5
.. .. ~
e
....
Comparison of Cluster Solutions with Wolfe's ~1aximum likelihood Approach' andBayes Criterion (9) for Ten Unbalanced Data Sets Composed of Versicolor and Virginfca Observations
_..... ...... -. _ w· • - "'- - _.-.- -- .. -_. - ... __..__.-- --- . ..._-
~1st 10 Versicolor 2nd 10 Versicolor 3rd 10 Versicolor 4th 10 Versicolor 5th 10 Versicolor
TechniqueO and and and and andAll 50 VirQinica All 50 VirQinica All 50 Virginica All 50 Virginica All 50 Virginica
10,50;20:115.82 11 ,49;21 :119.34 10,50;0:114.67Bayes Criterion (10) 10,50;0:122.72* 8,52;2:129.22
-Om __________• __ -_".'-' ...... __ .... _.'......_-- . ___ . _ ._. W_._'.·'.". ___ .____
Wolfe's Maximum 26,34;16:127.22 9,51;1:127.83 22,38;14:132.05 8,52;2:132.13 19,41;29:126.99likel ihood 18 iterations 8 iterations 15 iterations 8 iterations. 12 iterations
Wo1 fe's f4aximum II
Like1 ihood with 10,50;0:131.57 9,51;1:127.83 11,49;21:136.65 11,49;21:133.91 10,50;0:135.34Initial Estimates 6 iterations 8 iterations 21 iterations 8 iterations 4 iterationsfrom the' BayesCriterion (10)solution
.-
~All 50 Versicolor All 50 Versicolor All 50 Versicolor All 50 Versicolor All 50 Versicolor
Technique and and and and and1st 10 Virainica 2nd 10 Virainica 3rd 10 Virainica 4th 10 Virainica 5th 10 Virainica
Bayes Criterion (10) 13,47;3:41.04 8,52;4:57.25 6,54;4:41.53 6,54;6:44.68 9,51;1:22.24
Wolfe's Maximum 27,33;29:161.68 8,52;2:163.36- 30,30;28:171.01 21 ,39; 21 :162.01 23,37;23:172.38likel ihood 10 iterations 16 iterations 21 iterations 11 iterat fons 9 iterations..
Wolfe's MaximumLikel ihood with 13,47,3:174.07 8,52;4:164.60 9,51 ;3:179.13 6,54;6:170.00 9,51;1 :182.71Initial Estimates 4 iterations 21 iterations 18 iterations 4 iterations 5 iterationsfrom the BayesCriterion (10) .solution
- --- -. .. --- ..•.._. .._._... _. - ---_.,- .-*The table entries are: the first two numbers are the cluster sizes; the third is the number of misc1assifications.~ If more th~n one minimum was found with the Bayes Criterion, the criterion value is shown after the colon f~r each
solut~on. The likelihood value with Wolfe's ML approach is shown to compare the results using different initial esti-mates of the parameters . _ _
°The Bayes criter~o~ is to be minimized and Wolfe's ML procedure is to maximize.