+ All Categories
Home > Documents > Mixture model clustering for mixed data with missing information

Mixture model clustering for mixed data with missing information

Date post: 10-Dec-2023
Category:
Upload: aut
View: 0 times
Download: 0 times
Share this document with a friend
12
Computational Statistics & Data Analysis 41 (2003) 429 – 440 www.elsevier.com/locate/csda Mixture model clustering for mixed data with missing information Lynette Hunt , Murray Jorgensen Department of Statistics, University of Waikato, Hamilton, New Zealand Received 1 March 2002 Abstract One diculty with classication studies is unobserved or missing observations that often occur in multivariate datasets. The mixture likelihood approach to clustering has been well developed and is much used, particularly for mixtures where the component distributions are multivariate normal. It is shown that this approach can be extended to analyse data with mixed categorical and continuous attributes and where some of the data are missing at random in the sense of Little and Rubin (Statistical Analysis with Mixing Data, Wiley, New York). c 2002 Elsevier Science B.V. All rights reserved. Keywords: Clustering; Mixed data; Missing at random 1. Introduction Missing observations are frequently seen in multivariate data sets. For example, the specimen may be damaged and thus not all attributes can be measured, or an inexpensive and easy administered test may be administered to all items in the sample whilst the more expensive test may only be administered to a random sub-sample of the items. In such situations, the data matrix will be incomplete with not all attributes being observed for all items. These missing values can be regarded as accidental missing values. Review papers in the literature on partially missing data include those by A and Elasho (1966), Hartley and Hocking (1971), Orchard and Woodbury (1972), and Corresponding author. Fax: +64-7-838-4155. E-mail address: [email protected] (L. Hunt). URL: http://www.stats.waikato.ac.nz/Sta/index.html 0167-9473/03/$ - see front matter c 2002 Elsevier Science B.V. All rights reserved. PII: S0167-9473(02)00190-1
Transcript

Computational Statistics & Data Analysis 41 (2003) 429–440www.elsevier.com/locate/csda

Mixture model clustering for mixed data withmissing information

Lynette Hunt∗, Murray JorgensenDepartment of Statistics, University of Waikato, Hamilton, New Zealand

Received 1 March 2002

Abstract

One di-culty with classi.cation studies is unobserved or missing observations that often occurin multivariate datasets. The mixture likelihood approach to clustering has been well developedand is much used, particularly for mixtures where the component distributions are multivariatenormal. It is shown that this approach can be extended to analyse data with mixed categoricaland continuous attributes and where some of the data are missing at random in the sense ofLittle and Rubin (Statistical Analysis with Mixing Data, Wiley, New York).c© 2002 Elsevier Science B.V. All rights reserved.

Keywords: Clustering; Mixed data; Missing at random

1. Introduction

Missing observations are frequently seen in multivariate data sets. For example,the specimen may be damaged and thus not all attributes can be measured, or aninexpensive and easy administered test may be administered to all items in the samplewhilst the more expensive test may only be administered to a random sub-sample of theitems. In such situations, the data matrix will be incomplete with not all attributes beingobserved for all items. These missing values can be regarded as accidental missingvalues.

Review papers in the literature on partially missing data include those by A.. andElasho= (1966), Hartley and Hocking (1971), Orchard and Woodbury (1972), and

∗ Corresponding author. Fax: +64-7-838-4155.E-mail address: [email protected] (L. Hunt).URL: http://www.stats.waikato.ac.nz/Sta=/index.html

0167-9473/03/$ - see front matter c© 2002 Elsevier Science B.V. All rights reserved.PII: S0167 -9473(02)00190 -1

430 L. Hunt, M. Jorgensen / Computational Statistics & Data Analysis 41 (2003) 429–440

Dempster et al. (1977), and monographs on partially missing data by Little and Ru-bin (1987), and Schafer (1997). The approaches appropriate for handling such data inclassi.cation studies are restricted due to the reluctance of the investigator to makeassumptions about the data (Gordon, 1999) and the lack of a formal model for clus-ter analysis. Given the objective of clustering the data, we need to implement sometechnique when the data to be clustered are incomplete.

Gordon (1999, p. 26) notes that Gower’s (1971) general (dis)similarity coe-cientcan be used as one strategy to cope with missing variables, by assuming that thecontribution that would have been provided by the incompletely recorded variable tothe proximity between the two items is equal to the weighted mean of the contributionsprovided by the variables for which complete information is available.

Data are described as ‘missing at random’ when the probability that a variable ismissing for a particular individual may depend on the values of the observed variablesfor that individual, but not on the value of the missing variable. That is, the distributionof the missing data mechanism does not depend on the missing values. For example,censored data are certainly not missing at random.

Rubin (1976) showed that the process that causes the missing data can be ignoredwhen making likelihood-based inferences about the parameter of the data if the dataare ‘missing at random’ and the parameter of the missing data process is ‘distinct’ fromthe parameter of the data. When the data are missing in this manner, the appropriatelikelihood is simply the density of the observed data, regarded as a function of theparameters. ‘Missing at random’ is a central concept in the work of Little and Rubin(1987).

The EM algorithm of Dempster et al. (1977) is a general iterative procedure formaximum likelihood estimation in incomplete data problems. Their general model in-cludes both the conceptual missing data formulation used in .nite mixture models andthe accidental missing data discussed earlier. Many authors, for example McLachlanand Krishnan (1997), have discussed the EM algorithm and its properties.

Little and Schluchter (1985) present maximum likelihood procedures using the EMalgorithm for the general location model with missing data. They note that their modelreduces to that of Day (1969) for K-variate mixtures when there is one K-level categor-ical variable that is completely missing. Little and Rubin (1987) and Schafer (1997)point out that the parametric mixture models lend themselves well to implementingincomplete data methods. We implement their approach to produce explicit methodol-ogy that enables the clustering of mixed (categorical=continuous) data using a mixturelikelihood approach when data are missing at random. We illustrate this approach byclustering Byar’s prostate cancer data. It is shown that the proposed methodology candetect meaningful structure in mixed data when there is a fairly extreme amount ofmissing information.

2. The mixture approach to clustering data

Suppose that p attributes are measured on n individuals. Let x1; : : : ; xn be the ob-served values of a random sample from a mixture of K underlying populations in un-

L. Hunt, M. Jorgensen / Computational Statistics & Data Analysis 41 (2003) 429–440 431

known proportions �1; : : : ; �K . Let the density of xi in the kth group be fk(xi; �k), where�k is the parameter vector for group k, and let � = (�′; �′)′, where � = (�1; : : : ; �K)′,� = (�1; : : : ; �K)′. The density of xi can be written as

f(xi;�) =K∑k=1

�kfk(xi; �k);

where∑K

k=1 �k = 1, �k¿ 0, for k = 1; : : : ; K .The EM algorithm of Dempster et al. (1977) is applied to the .nite mixture model by

viewing the data as incomplete. In the case of mixtures of distributions, the ‘missing’data are the unobserved indicators of group membership. Let the vector of indicatorvariables, zi = (zi1; : : : ; ziK)′, be de.ned by

zik =

{1 if individual i∈ group k;

0 if individual i �∈ group k;

where zi ; i=1; : : : ; n, are independently and identically distributed according to a multi-nomial distribution generated by a single trial of an experiment with K mutuallyexclusive outcomes having probabilities �1; : : : ; �K .

Let � denote the maximum likelihood estimate of �. Then each observation, xi,can be allocated to group k on the basis of the estimated posterior probabilities. Theestimated posterior probability that observation xi, belongs to group k, is given by

zik =pr(individual i∈ group k|xi; �)

=�kfk(xi; �k)∑Kk=1 �kfk(xi; �k)

for k = 1; : : : ; K ; and xi is assigned to group k if

zik ¿ zik′ for k = 1; : : : ; K; k �= k ′:

Finite mixture models are frequently .tted where the component densities fk(x; �k) aretaken to be multivariate normal; i.e., xi ∼ Np(�k ; �k), if observation i belongs to groupk. This model has been studied by Titterington et al. (1985), and by McLachlan andBasford (1988). Further details on the maximum likelihood estimates of the componentsof � can be found in McLachlan and Peel (2000, p. 82).

The latent class model described, for example, by Everitt (1984), is a .nite mixturemodel for data where each of the p attributes is discrete. Suppose that the jth attributecan take on levels 1; : : : ; Mj and let �kjm be the probability that for individuals fromgroup k, the jth attribute has level m. Then, conditional on individual i belonging togroup k, fk(xi; �k) =

∏pj=1 �kjxij . In other words, within each group the distributions of

the p attributes are independent. This property has been termed local independence.

2.1. Multimix

Jorgensen and Hunt (1996) and Hunt and Jorgensen (1999) proposed a general classof mixture models to include data having both continuous and categorical attributes.

432 L. Hunt, M. Jorgensen / Computational Statistics & Data Analysis 41 (2003) 429–440

This model, which they dubbed the ‘Multimix’ model, was conceived of initially asa joint generalization of both latent class models and mixtures of multivariate normaldistributions. They suggested an approach based on a form of local independence bypartitioning the observational vector xi such that

xi = (‘xi1| : : : |‘xil| : : : |‘xiL)′;

where the attributes within partition cell‘xil, are independent of the attributes in partition

cell‘xil′ , for l �= l′ within each of the K sub-populations. Thus if individual i belongs

to group k, we can write

fk(xi) =L∏l=1

fkl(‘xil):

In this paper, we restrict ourselves to the following distributions suggested for thepartition cells:Discrete distribution: where

‘xil is a one-dimensional discrete attribute taking values

1; : : : ; Ml with probabilities �klMl . We will denote this distribution by D(�kl1; : : : ; �klMl).Multivariate Normal distribution: where

‘xil is a pl-dimensional vector with a

Npl(�kl; �kl) distribution if individual i is in group k.See Hunt and Jorgensen (1999) for the maximum likelihood estimates for the compo-

nents of �. This approach included the latent class model and mixtures of multivariatenormal distributions as special cases.

2.2. Graphical models

A revealing alternative way of looking at these multivariate models is within theframework of graphical models described by Lauritzen and Wermuth (1989). In thisframework graphs are associated with models. The graph of a model contains a vertexcorresponding to each variable in the model. Edges are assigned such that the absenceof a connected path between two vertices corresponds to independence of the corre-sponding variables. If no path exists between two vertices after a set of vertices hasbeen removed, then this means the variables represented by the two vertices are in-dependent conditionally on knowing the values of the variables corresponding to theremoved vertices. Latent class models for p variables are represented by a graph onp+1 vertices corresponding to the p variables plus one categorical variable indicatingthe cluster. Each of the p variables are joined by a single edge to the cluster variable,and these are the only edges in the graph.

A clique in a graph is a maximal set of vertices such that an edge connects eachpair of vertices in the set. No independence assumptions are made about variables ina clique. The graph of a Multimix model may be described as follows: correspondingto each cell in the partition of attributes there is a clique of vertices, each cliqueforms a complete graph, none of these graphs are directly connected, but all verticesin each are joined to an additional vertex that represents a categorical latent variablegiving the cluster assignment of each observation. As a special case, locally independentmultivariate mixture models may also be described in the language of graphical models

L. Hunt, M. Jorgensen / Computational Statistics & Data Analysis 41 (2003) 429–440 433

by a graph in which the edges connect the latent cluster variable to each of the pmanifest variables. If all variables in this special case are discrete we have a latentclass model. Edwards (1995) provides a gentle introduction to the concepts of graphicalmodelling.Multimix models with only continuous variables are mixtures of multivariate normals

in which the covariance matrices are each block-diagonal with the same block pattern.Ban.eld and Raftery (1993) consider other kinds of restrictions to covariance matricesin mixtures of multivariate normals, with possible limitations on volume, orientationand shape of the component distributions.

2.3. Missing data

Little and Rubin (1987, Chapter 3) review several ‘quick’ methods for coping withmissing data in multivariate statistical analyses. Essentially, they consider(1) ‘complete-case’ methods, which discard observations in which any variable is

missing;(2) ‘available-case’ methods based on pairwise sample covariances using all observa-

tions in which both variables are observed;(3) methods based on .lling-in or ‘imputing’ the missing values.They conclude ‘: : : it is hard to recommend any of the simple methods discussed since(1) their performance is unreliable; (2) they often require ad hoc adjustments to yieldsatisfactory estimates, and (3) it is not easy to distinguish situations when the methodswork from when they fail’. Little and Rubin (1987) go on to develop methods forhandling missing data based on the EM algorithm. Essentially their methods are oftwo kinds: those for which the missing data mechanism is ignorable, the data beingmissing at random, and those for which a model must be speci.ed describing themechanism by which the data come to be missing.

In this paper, we put forward a method for mixture model clustering based on theassumption that the data are missing at random and hence the missing data mechanismis ignorable. It is natural to ask whether we can do without the ‘missing at random’assumption and work with non-ignorable missing data mechanisms. However, the ef-fective use of non-ignorable models requires knowledge of the missing data mechanismthat will quite often be lacking in an exploratory clustering situation. Molenberghs et al.(1999) discuss the use of non-ignorable models in the context of longitudinal categor-ical data. They note that such models cannot be tested by the data and advocate usinga range of models in a sensitivity analysis, while employing as much context-derivedinformation as possible. Because our interest in this paper is directed towards the gen-eral clustering problem, we con.ne ourselves to methods that are technically validwhen the missing data mechanism is ignorable.

We now present a form of Multimix suitable for multivariate data sets with missingdata. This model reduces to that given by Hunt and Jorgensen (1999) when all thedata are observed.

Suppose we write the observation vector xi in the form (xobs; i ; xmiss; i), where xobs; i

and xmiss; i, respectively, denote the observed and missing attributes for observation i.This is a formal notation only and does not imply that the data are rearranged to

434 L. Hunt, M. Jorgensen / Computational Statistics & Data Analysis 41 (2003) 429–440

achieve this pattern. In .tting the mixture model, there are now two types of missingdata that have to be considered; one is the conceptual ‘missing’ data, the unobservedindicator of group membership, and the other is the unintended or accidental missingdata values. However, these unintended missing values can also be of two di=erenttypes. They may be continuous and belong to a multivariate normal partition cell, ora categorical variable involved in a partition cell with a discrete distribution.

The E step of the EM algorithm requires the calculation of Q(�;�(t)) = E{LC(�)|xobs;�

(t)}, the expectation of the complete data log-likelihood conditional on the ob-served data and the current value of the parameters. We calculate Q(�;�(t)) by replac-ing zik with

zik = z(t)ik = E(zik |xobs; i;�

(t))

=�kfk(xobs; i; �

(t)k )∑K

k=1 �kfk(xobs; i; �(t)k )

:

That is, zik is replaced by zik , the estimate of the posterior probability that individuali belongs to group K .

The remaining calculations in the E step require the calculation of the expectedvalue of the complete data su-cient statistics for each partition cell l, conditional onthe observed data and the current values of the parameters for that partition cell.

For each discrete partition cell l and each value ml of‘xil, the E step calculates

E(zik�ilm|xobs; i; �(t)k ) =

{zik�ilm; xil observed

zikE(�ilm|xobs; i; �(t)k ); xil missing

=

{zik�ilm; xil observed;

zik �(t)ilm; xil missing;

where we have de.ned an indicator variable

�ilm =

{1 if xil = m;

0 otherwise:

Let

�ilm =

{�ilm; xil observed;

�ilm; xil missing:

Then this expectation can be written in the form

E(zik�ilm|xobs; i; �k(t)) = zik �ilm

for k = 1; : : : ; K ; each categorical‘xil and each value ml of

‘xil.

For multivariate normal partition cells, depending on the attributes observed forindividual i in the cell, these expectations may require the use of the sweep opera-tor described originally by Beaton (1964). The version of sweep we use is the onede.ned by Dempster (1969); also described in Little and Rubin (1987, pp. 112–119).Little and Rubin (1987) and Schafer (1997) demonstrate the usefulness of sweep in

L. Hunt, M. Jorgensen / Computational Statistics & Data Analysis 41 (2003) 429–440 435

maximum likelihood estimation for multivariate missing-data problems. Hunt (1996)implemented this approach with mixtures of multivariate normal distributions. Theapproach is adapted in the following manner:

Suppose we form the augmented covariance matrix Al using the current estimatesof the parameters for group k in cell l where

Al =

−1 �k1 �k2 : : : �kpl

�k1 k11 k12 : : : k1pl

�k2 k21 : : : : : : k2pl

......

......

...

�kpl kp1 : : : : : : kplpl

;

where the rows and columns of Al are indexed from 0 to pl. Then sweeping on rowand column 1 corresponds to sweeping on xi1, and sweeping on row and column jcorresponds to sweeping on xij. Sweeping on the elements of Al corresponding to theobserved xij in cell l, yields the conditional distribution (regression) of the missing xij′on the observed xij in the cell.

The remaining calculations in the E step for multivariate normal partition cells areas follows:

E(zikxij|xobs; i; �(t)k ) =

{zikxij; xij observed;

zikE(xij|xobs; i; �(t)k ); xij missing;

E(zikx2ij|xobs; i ; �

(t)k )

=E(zik |xobs; i; �(t)k )E(x2

ij|xobs; i; �(t)k )

=

{zikx2

ij ; xij observed;

zik [(E(xij|xobs; i; �(t)k ))2 + Var(xij|xobs; i; �

(t)k )]; xij missing:

For j �= j′,

E(zikxijxij′ |xobs; i; �(t)k )

=

zikxijxij′ ; xij and xij′ observed;

zikxijE(xij′ |xobs; i; �(t)k ); xij observed; xij′ missing;

zikE(xij|xobs; i; �(t)k )xij′ ; xij missing; xij′ observed;

zik [E(xij|xobs; i; �(t)k )E(xij′ |xobs; i; �

(t)k )

+ Cov(xij; xij′ |xobs; i; �(t)k )]; xij and xij′ missing;

for i = 1; : : : ; n; k = 1; : : : ; K ; xij ∈‘xil where

‘xil is a multivariate normal partition cell.

It can be seen from the above expectations, that when there is only one factor xijmissing, the missing xij are replaced by the conditional mean of xij, given the set ofvalues xobs; i observed for that individual in that cell and the current estimates of the

436 L. Hunt, M. Jorgensen / Computational Statistics & Data Analysis 41 (2003) 429–440

parameters for the cell. However, for the conditional expectations to be used in thecalculation of the covariance matrix, i.e. E(zikx2

ij|xobs; i; �(t)k ) and E(zikxijxij′ |xobs; i; �

(t)k ),

then, respectively, if xij is missing, or if xij and xij′ are missing in that cell, theconditional mean of xij is adjusted by the conditional covariances as shown above.These conditional means and the non-zero conditional covariances are found by usingthe sweep operator on the augmented covariance matrix that has been created usingthe current estimates of the parameters for that particular multivariate normal partitioncell. The augmented covariance matrix is swept on the observed attributes xobs; i incell l such that these attributes are the predictors in the regression equation and theremaining attributes are the outcome variables for that cell.

In the M step of the algorithm, the new parameter estimates � (t+1) of the parametersare estimated from the complete data su-cient statisticsMixing proportions:

�(t+1)k =

1n

n∑i=1

z(t)ik for k = 1; : : : ; K:

Discrete distribution parameters:

�klm =1n�k

n∑i=1

zik �ilm for k = 1; : : : ; K; m = 1; : : : ; Ml

and where l indexes a discrete partition cell‘xl.

Multivariate Normal parameters:

�(t+1)kj =

1n�k

E

(n∑i=1

z(t)ik xij|xobs; i ; �

(t)k

);

�(t+1)kjj′ =

1n�k

E

(n∑i=1

z(t)ik xijxij′ |xobs; i ; �

(t)k

)− �(t+1)

kj �(t+1)kj′

for k=1; : : : ; K . Here j and j′ index the continuous attributes belonging to a multivariatenormal cell

‘xl.

Let the conditional covariance between attributes j and j′ for individual i, given thatindividual i belongs in group k,

C(t)kir; jj′ =

{0 if xij or xij′ is observed;

Cov(xij; xij′ |xobs; i ; �(t)k ) if xij and xij′ are missing

and let the imputed value for attribute j of individual i, given the current value of theparameters and that the individual belongs in group k, be

x(t)ij; k =

{xij if xij is observed;

E(xij|xobs; i ; �(t)k ) if xij is missing

L. Hunt, M. Jorgensen / Computational Statistics & Data Analysis 41 (2003) 429–440 437

The parameter estimates for the mean and the variance or covariance terms can bewritten in the form

�(t+1)kj =

1n�k

n∑i=1

z(t)ik x

(t)ij; k ;

�(t+1)kjj′ =

1n�k

n∑i=1

z(t)ik [(x(t)

ij; k − �(t+1)kj )(x(t)

ij′ ; k − �(t+1)kj′ ) + C(t)

ki; jj′]

for k = 1; : : : ; K . Here again j and j′ index the continuous attributes belonging to aMultivariate Normal cell

‘xl.

3. Application

The approach will be illustrated by considering the clustering of cases on the basisof the pre-trial variables of the prostate cancer clinical trial data of Byar and Green(1980), reproduced in Andrews and Herzberg (1985, pp. 261–247). The data are avail-able at http://lib.stat.cmu.edu/datasets/Andrews/T46.1. The data were ob-tained from a randomized clinical trial comparing four treatments for 506 patientswith prostatic cancer. These patients had been grouped on clinical criteria into Stage3 and Stage 4 of the disease. As reported by Byar and Green, Stage 3 representslocal extension of the disease with no clinical evidence of distant metastasis, whilstStage 4 represents distant metastasis as evidenced by acid phosphatase levels, X-rays,or both.

There are 12 pre-trial covariates measured on each patient, seven may be taken tobe continuous, four to be discrete, and one variable (SG) is an index nearly all ofwhose values lie between 7 and 15, and which could be considered either discreteor continuous. We treat SG as a continuous variable. Two of the discrete covari-ates have two levels, one has four levels and the fourth discrete covariate has sevenlevels. As detailed in Hunt and Jorgensen (1999), two variables, SZ and AP have beentransformed to make their distributions more symmetric.

Thirty-one individuals have at least one of the pre-trial covariates missing, givinga total of 62 missing values. As only approximately 1% of the data are missing,more missing observations were created, where the probability of an observation onan attribute being missing was taken independently of all other data values. Missingvalues generated in this manner are missing completely at random, and the missingdata mechanism is ignorable for likelihood inferences (Little and Rubin, 1987; Schafer,1997).

Missing values were created by assigning each attribute of each individual a randomdigit generated from the discrete [0; 1] distribution, where the probability of a zerowas taken, respectively, as 0.10, 0.15, 0.20, 0.25 and 0.30. Attributes for an individualwere recorded as missing when the assigned random digit was zero. This process wasrepeated 10 times for each of the probabilities chosen.

We report fully the results taken from one pattern of missing data where the prob-ability of an observation on an attribute being missing was 0.30. This illustrates the

438 L. Hunt, M. Jorgensen / Computational Statistics & Data Analysis 41 (2003) 429–440

approach on a fairly extreme case of the type of data that would be analysed usingthese methods. The data set reported in detail here had 1870 values recorded as missing.These missing values were such that only .ve individuals had all attributes observed.One individual had all 12 attributes recorded as missing and was deleted from furtheranalysis.

The mixture method of clustering requires the speci.cation of the number of under-lying number of clusters to be .tted to the model. Determination of the appropriatenumber of underlying clusters is still an unresolved problem, and there does not ap-pear to be a universally superior method of determining the cluster number (see forexample, Celeux and Soromenho (1996) and the references therein). In this paper, theproblem of determining the group number is peripheral to the theory being presented,and we shall consider .tting two clusters to the model. This decision was based onhaving the clinical classi.cation of the data into two groups, Stage 3 and Stage 4.We shall examine the extent to which the proposed techniques can rediscover thesestages.

Hunt and Jorgensen (1999) report a complete case clustering of the 12 pre-trialcovariates where individuals that had missing values in any of these covariates wereomitted from further analysis, leaving 475 out of the original 506 individuals available.Hunt (1996) and Hunt and Jorgensen (1999) discuss a .tting strategy for incorporatinglocal associations within the model. We report the results for the model where thethree attributes WT, SBP and DBP are in one partition cell. This was the partitioningpreferred by Hunt (1996).

We regard the data as a random sample from the distribution

f(x;�) =2∑

k=1

�k10∏l=1

fkl(x; �kl);

where the component distributions fkl(‘xil; �kl) are N3(�kl; �kl) for the partition cell

containing WT, SBP and DBP, N (�kl; 2kl) for the remaining continuous attributes, and

D(�kl1; : : : ; �klml) for each of the four categorical attributes.This model was .tted iteratively using the EM algorithm with an initial grouping

based on the clinical classi.cation. In the search for other maxima, the model was also.tted from a number of starting values generated by splitting the individuals into twogroups both randomly and using various criteria. The .rst M step is then performedon the basis of these initial groupings. For discrete partition cells the initial estimatesof the probabilities �klm, m = 1; : : : ; Ml are calculated using the available data. Formultivariate normal cells, the estimates of the means are calculated using the availabledata for that cell and in that group. The estimates for the variance covariances arecalculated in this .rst M step by replacing the missing values in the cell by the groupmean for that cell and then calculating the estimates using the ‘.lled in’ data set. Theconvergence criterion used was to cease iterating when the di=erence in log-likelihoodsat iteration t and iteration t−10 was 10−10. Several local maxima were found, and thesolution of the likelihood was taken to be the one corresponding to the largest of these.Each individual was assigned to the group to which it has highest estimated posteriorprobability of belonging.

L. Hunt, M. Jorgensen / Computational Statistics & Data Analysis 41 (2003) 429–440 439

Table 1Agreements and di=erences between the clinical and model classi.cations

Clinical Model classi.cationclassi.cation

Class Group 1 Group 2

Stage 3 265 26Stage 4 41 173

It can be seen from Table 1 that the clinical classi.cation and the ‘statistical diagnosis’are di=erent for 67 individuals. Examination of the posterior probabilities showed that19 of these individuals are decisively assigned to a di=erent group than the one corre-sponding to the clinical classi.cation and nine have greater posterior probabilities lyingbetween 0.5 and 0.6. Another comparison between the clinical classi.cation and themodel .t can be obtained by comparing the estimated parameters for the model withtheir counterparts using the clinical classi.cation. Agreement was fairly close.

Hunt (1996) found in the complete case analysis that 40 of the 475 individuals wereassigned to a di=erent group than the one corresponding to the clinical classi.cation.She found that survival status gave insight into the model classi.cations and the dif-ferences between the ‘statistical diagnosis’ and the clinical classi.cations. The modelclassi.cation gave a better indication of prognosis with patients in Group 1 having ahigher probability of being alive or dying from other causes, whereas patients in Group2 had more chance of dying from prostatic cancer.

4. Discussion

When clustering real multivariate data sets having large numbers of attributes, it israre that all variables are either categorical or continuous as some approaches based on.nite mixture models require. The Multimix approach allows the clustering of mixeddata containing both types of variables.

Missing values are also a problem in many classi.cation studies. The lack of a formalmodel restricts the number of approaches that can cope with incomplete datasets. The.nite mixture model leads itself well into coping with missing values. We have awell speci.ed, yet Texible, model whose parameters can be estimated by maximumlikelihood. As we have shown, the .tting method is able to be extended to cope withunintended missing data.

The approach implemented in this paper works extremely well for the mixed data setthat had a very large amount of missing data. The model has performed well, detectingthe structure known to exist in the data whilst simultaneously coping with an extremeamount of missing data. The parameter estimates for the clusters and the estimates ofthe missing attributes conditional on the group assignment are reasonable. However,as with all problems involving incomplete data, the mechanism that gives rise to themissing values does need careful investigation.

440 L. Hunt, M. Jorgensen / Computational Statistics & Data Analysis 41 (2003) 429–440

References

A.., A.A., Elasho=, R.M., 1966. Missing observations in multivariate statistics I: review of the literature J.Amer. Statist. Assoc. 61, 595–604.

Andrews, D.A., Herzberg, A.M., 1985. Data: A Collection of Problems from Many Fields for the Studentand Research Worker. Springer, New York.

Ban.eld, J.D., Raftery, A.E., 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821.

Beaton, A.E., 1964. The use of special matrix operators in statistical calculus. Educational Testing ServiceResearch Bulletin, RB-64-51.

Byar, B.P., Green, S.B., 1980. The choice of treatment for cancer patients based on covariate information:application to prostate cancer Bull. Cancer 67, 477–490.

Celeux, G., Soromenho, G., 1996. An entropy criterion for assessing the number of clusters in a mixturemodel. J. Class. 13, 195–212.

Day, N.E., 1969. Estimating the components of a mixture of normal components. Biometrika 56, 464–474.Dempster, A.P., 1969. Elements of Continuous Multivariate Analysis. Addison-Wesley, Reading, MA.Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM

algorithm (with discussion) J. Roy. Statist. Soc. B 39, 1–38.Edwards, D., 1995. Introduction to Graphical Modelling. Springer, New York.Everitt, B.S., 1984. A note on parameter estimation for Lazarsfeld’s latent class model using the EM

algorithm. Multivariate Behavioral Res. 19, 79–89.Gordon, A.D., 1999. Classi.cation. Chapman & Hall, CRC press, London.Gower, J.C., 1971. A general coe-cient of similarity and some of its properties. Biometrics 27, 857–874.Hartley, H.O., Hocking, R.R., 1971. The analysis of incomplete data. Biometrics 14, 174–194.Hunt, L.A., 1996. Clustering using .nite mixture models. Ph.D. Thesis, Department of Statistics, University

of Waikato, New Zealand.Hunt, L.A., Jorgensen, M.A., 1999. Mixture model clustering using the multimix program. Austral. and New

Zealand J. Statist. 41, 153–171.Jorgensen, M.A., Hunt, L.A., 1996. Mixture model clustering of data sets with categorical and continuous

variables. In: Proceedings of the Conference on Information, Statistics and Induction in Science,Melbourne, 1996, pp. 375–384.

Lauritzen, S.L., Wermuth, N., 1989. Graphical models for associations between variables, some of whichare qualitative and some quantitative. Ann. Statist. 17, 31–57.

Little, R.J.A., Rubin, D.B., 1987. Statistical Analysis with Missing Data. Wiley, New York.Little, R.J.A., Schluchter, M.D., 1985. Maximum likelihood estimation for mixed continuous and categorical

data with missing values. Biometrika 72, 497–512.McLachlan, G.J., Basford, K.E., 1988. Mixture Models: Inference and Applications to Clustering.

Marcel-Dekker, New York.McLachlan, G.J., Krishnan, T., 1997. The EM Algorithm and Extensions. Wiley, New York.McLachlan, G.J., Peel, D., 2000. Finite Mixture Models. Wiley, New York.Molenberghs, G., Goetghebeur, E.J.T., Lipsitz, S.R., Kenward, M.G., 1999. Nonrandom missingness in

categorical data: strengths and limitations Amer. Statist. 53, 110–118.Orchard, T., Woodbury, M.A., 1972. A missing information principle: theory and applications. Proceedings

of the Sixth Berkeley Symposium, Vol. 1, pp. 697–715.Rubin, D.B., 1976. Inference and missing data. Biometrika 63, 581–593.Schafer, J.L., 1997. Analysis of Incomplete Data. Chapman & Hall, London.Titterington, D.M., Smith, A.F.M., Makov, U.E., 1985. Statistical Analysis of Finite Mixture Distributions.

Wiley, New York.


Recommended