+ All Categories
Home > Documents > Cluster Validity Analysis Using Subsamplingnatacha/TeachFall_2008/Grad... · tion; and 5) Cluster...

Cluster Validity Analysis Using Subsamplingnatacha/TeachFall_2008/Grad... · tion; and 5) Cluster...

Date post: 26-Jun-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
6
Cluster Validity Analysis Using Subsampling* Osman Abul *'~ Anthony Lo t Reda Alhajj* Faruk Polat ~ Ken Barker ++ ++Dept Computer Science University of Calgary Calgary, Alberta, Canada {abul, chiul, alhajj, barker}@cpsc.ucalgary.ca ~Dept of Computer Engineering Middle East Technical University Ankara, Turkey [email protected] Abstract - Cluster validity investigates whether gen- erated clusters are true clusters or due to chance. This is usually done based on subsampling stability analysis. Related to this problem is estimating true number of clusters in a given dataset. There are a number of meth- ods described in the literature to handle both purposes. In this paper, we propose three methods for estimating confidence in the validity of clustering result. The first method validates clustering result by employing super- vised classifiers. The dataset is divided into training and test sets and the accuracy of the classifier is evalu- ated on the test set. This method computes confidence in the generalization capability of clustering. The second method is based on the fact that if a clustering is valid then each of its subsets should be valid as well. The third method is similar to second method; it takes the dual approach, i.e., each cluster is expected to be stable and compact. Confidence is estimated by repeating the process a number of times on subsamples. Experimental results illustrate effectiveness of the proposed methods. Introduction The word "clustering" (unsupervised classification) refers to methods of grouping objects based on some similarity measure between them. Clustering algo- rithms can be classified into four classes, namely Par- titional, Hierarchical, Density-based and Grid-based [8]. Each of these classes has subclasses and different cor- responding approaches, e.g., conceptual, fuzzy, self- organizing maps etc. The clustering task can be di- vided into the following five steps, (the last two are op- tional) [9]: 1) Pattern representation; 2) Pattern prox- imity measure definition; 3) Clustering; 4) Data abstrac- tion; and 5) Cluster validity analysis. In this paper, we only consider the last step which somehow measures effectiveness of the other steps. For a given a dataset, the produced clustering depends on the parameters of the applied clustering algorithm. Usually, it is the case that different algorithms, even the same *0-7"S03-7"952-7"/03/$17".00 (~) 2003 IEEE. algorithm with distinct parameters, generate different clustering results. Cluster validity analysis refers to how to assess the confidence in the resulting clusters. For datasets with few dimensions, the clustering re- sult can be visualized, and hence clusters can be val- idated by human experts. But, this becomes nearly impossible for large dimensions; and hence some other automatic methods are needed. The main criteria for the evaluation of clustering results are [8]: Compact- ness (i.e., members of each cluster should be closer to each other) and separation (i.e., the clusters should be widely spaced). Based on these criteria, a number of indices are proposed for evaluating clusters and select- ing the best possible number of clusters. In most of the cases, assessing validity turns out into determining the best parameters for a clustering algorithm. Confidence estimation is addressed in relatively less number of re- search papers, where confidence is given in terms of the proportion of cases clustering together. Our motivation for the work described in this paper is estimating confi- dence in each cluster, i.e., not addressing specific cases. For this purpose, here we propose three meta-methods from this perspective for cluster validity problem. To the best of our knowledge, these methods are novel and test results demonstrate their effectiveness. The meth- ods are all based on subsampling of the dataset. They are general and can be used for evaluating clustering results generated by a wide range of existing clustering algorithms. The first method, starts by producing a clustering using a given clustering algorithm, with the number of clusters specified. Second, it randomly samples from the labeled clusters. Third, it builds a supervised classifier on the selected subset, and the induced classifier eval- uates the non-selected portion. Random subsampling and evaluation steps are repeated many times. Finally, the overall accuracy gives the stability of clustering. Overall, these steps are repeated for all possible number of clusters, for comparison to other clustering results by different clustering algorithms. Instead of random sub- sampling, 10-fold cross-validation can also be used.
Transcript
Page 1: Cluster Validity Analysis Using Subsamplingnatacha/TeachFall_2008/Grad... · tion; and 5) Cluster validity analysis. In this paper, we only consider the last step which somehow measures

Cluster Validity Analysis Using Subsampling*

O s m a n Abul *'~ A n t h o n y Lo t R e d a Alhajj* Fa ruk Po la t ~ Ken Barker ++

++Dept C o m p u t e r Science

Univers i ty of C a l g a r y

Calgary , A lbe r t a , C a n a d a

{abul , chiul, a lhaj j , ba rke r}@cpsc .uca lga ry . ca

~Dept of C o m p u t e r Eng inee r ing

Middle Eas t Technical Univers i ty

Ankara , Tu rkey

p o l a t @ c e n g . m e t u . e d u . t r

A b s t r a c t - Cluster validity investigates whether gen- erated clusters are true clusters or due to chance. This is usually done based on subsampling stability analysis. Related to this problem is estimating true number of clusters in a given dataset. There are a number of meth- ods described in the literature to handle both purposes. In this paper, we propose three methods for estimating confidence in the validity of clustering result. The first method validates clustering result by employing super- vised classifiers. The dataset is divided into training and test sets and the accuracy of the classifier is evalu- ated on the test set. This method computes confidence in the generalization capability of clustering. The second method is based on the fact that if a clustering is valid then each of its subsets should be valid as well. The third method is similar to second method; it takes the dual approach, i.e., each cluster is expected to be stable and compact. Confidence is estimated by repeating the process a number of times on subsamples. Experimental results illustrate effectiveness of the proposed methods.

I n t r o d u c t i o n The word "clustering" (unsupervised classification)

refers to methods of grouping objects based on some similarity measure between them. Clustering algo- rithms can be classified into four classes, namely Par- titional, Hierarchical, Density-based and Grid-based [8]. Each of these classes has subclasses and different cor- responding approaches, e.g., conceptual, fuzzy, self- organizing maps etc. The clustering task can be di- vided into the following five steps, (the last two are op- tional) [9]: 1) Pattern representation; 2) Pattern prox- imity measure definition; 3) Clustering; 4) Data abstrac- tion; and 5) Cluster validity analysis.

In this paper, we only consider the last step which somehow measures effectiveness of the other steps. For a given a dataset, the produced clustering depends on the parameters of the applied clustering algorithm. Usually, it is the case that different algorithms, even the same

*0-7"S03-7"952-7"/03/$17".00 (~) 2003 IEEE.

algorithm with distinct parameters, generate different clustering results. Cluster validity analysis refers to how to assess the confidence in the resulting clusters.

For datasets with few dimensions, the clustering re- sult can be visualized, and hence clusters can be val- idated by human experts. But, this becomes nearly impossible for large dimensions; and hence some other automatic methods are needed. The main criteria for the evaluation of clustering results are [8]: Compact- ness (i.e., members of each cluster should be closer to each other) and separation (i.e., the clusters should be widely spaced). Based on these criteria, a number of indices are proposed for evaluating clusters and select- ing the best possible number of clusters. In most of the cases, assessing validity turns out into determining the best parameters for a clustering algorithm. Confidence estimation is addressed in relatively less number of re- search papers, where confidence is given in terms of the proportion of cases clustering together. Our motivation for the work described in this paper is estimating confi- dence in each cluster, i.e., not addressing specific cases. For this purpose, here we propose three meta-methods from this perspective for cluster validity problem. To the best of our knowledge, these methods are novel and test results demonstrate their effectiveness. The meth- ods are all based on subsampling of the dataset. They are general and can be used for evaluating clustering results generated by a wide range of existing clustering algorithms.

The first method, starts by producing a clustering using a given clustering algorithm, with the number of clusters specified. Second, it randomly samples from the labeled clusters. Third, it builds a supervised classifier on the selected subset, and the induced classifier eval- uates the non-selected portion. Random subsampling and evaluation steps are repeated many times. Finally, the overall accuracy gives the stability of clustering. Overall, these steps are repeated for all possible number of clusters, for comparison to other clustering results by different clustering algorithms. Instead of random sub- sampling, 10-fold cross-validation can also be used.

Page 2: Cluster Validity Analysis Using Subsamplingnatacha/TeachFall_2008/Grad... · tion; and 5) Cluster validity analysis. In this paper, we only consider the last step which somehow measures

The second method is based on subset selection of the original clusters. First of all, clusters are found by employing a given clustering algorithm. For each sub- set of these clusters, an algorithm that estimates the true number of clusters is used. The argument here is that, if the given clustering is stable, then we expect the number of clusters estimated for each subset is the same as the cardinality of labels of the selected subset. The confidence is computed as the proportion of correct estimations. It may be the case that clustering result contains large number of clusters (say 20 clusters). In this case, trying all subsets becomes computationally- intractable; so we resort to subset sampling instead. If the validity of clustering results generated by random- ized algorithms like k-means is the concern, all the steps should be repeated for averaging for both the first and the second methods.

The third method uses the idea that if a clustering is stable, further clustering of the cases in every cluster will reveal one cluster. For each of the clusters, an esti- mator algorithm is run and expected to give that there is only one cluster. The whole step is repeated several times with dataset subsampling, i.e., a bootstrapping approach is employed for confidence estimation. Confi- dence is computed similar to the second method.

The rest of the paper is organized as follows. In sec- tion 2, some background and recent work on cluster va- lidity are given. Section 3 presents our three methods for cluster validity analysis. Experimental results are presented in Section 4. Section 5 is the conclusions.

2 Cluster Validity and Stability There are basically three methods for assessment of

validity: internal, external and relative [9, 8, 7]. In- ternal indices measure how clustering result reflects the structure inherent in the dataset. Here, only inherent features of the dataset are used for the measurement, i.e., no external information is consulted. Usually be- tween and within sum of square matrices are used as inherent features. There are a number of indices avail- able, including silhouette, gap, gapPC [7]. These indices also define how to select the best number of clusters.

In external assessment of validity, there is a known priori structure; an external index is computed using this structure and the generated structure. These in- dices define a measure of the degree of match between these two structures. The indices are usually defined on contingency tables of the two partitions. Entry, nij (row i and column j) of this table is the number of patterns that belong to cluster i in the priori partition and clus- ter j in the generated partition. Indices on contingency tables include Jaccard, Rand and FM. The FM measure is used in Clest algorithm; it is given below [7].

F M - ( 1 ~ 2 ) ( Z - n ) (1)

where n - Y~'~i~l Y~'~c-1 n i j, Z - y~ iR= l y~c_ 1 n i'2j, n i . -

E ~ - I nij and n.j - E i ~ l nij; with R and C represent the number of clusters of priori and generated clusters, respectively.

Relative assessment compares two structures and measures their relative merit. The idea is to run the clustering algorithm for possible number of pa- rameters (e.g., for each possible number of clusters) and identify the clustering scheme that best fits the dataset. Recent work on cluster validity research con- centrates on a kind of relative indices called cluster sta- bility [2, 3, 11, 12, 7, 13, 15]. Cluster stability exploits the fact that when multiple data sources are sampled from the same distribution, the clustering algorithms are expected to behave in the same way and produce similar structures.

In the work described in [13], supervised predictors are built on each clustered resampling of the original dataset and their match with the original clustering la- beling are used as measures of stability or degree of match. They show that selection of supervised clas- sification algorithm does make a difference, but mea- sured validity is still valid for other choices. They de- fine an instability measure for taking the game-theoretic approach. The number of clusters minimizing this in- stability measure is used as best cluster count in the dataset.

The work described in [13] presents an algorithm for estimating the true number of clusters. For each clus- ter count, the dataset is resampled twice and clustered using the same generic clustering algorithm. Similarity between these two clustering is measured using either Jaccard coefficient or matching coefficient. The resam- pling and similarity computations are repeated many times for each number of clusters for confidence estima- tion. The averaged values are used as measures of stabil- ity of clustering generated by the given clustering algo- rithm. The histograms and cumulative distributions are generated and plotted for selecting best cluster count. Smallest stable cluster count is estimated as the correct number of clusters; the decision is obvious in cumula- tive distributions diagram. They also give a measure for automating this process. The algorithm has a nice property that if there is no large gap between similari- ties across all cluster counts, it is said that the dataset does not tend for clustering, i.e., cluster count is 1.

Another resampling based method is given in [12]. In their settings, original dataset is clustered first and a number of subsamples are gathered and each of them clustered independently using the same clustering algo- rithm. A figure of merit measure (i.e., degree of match in the connectivity matrix) is defined between the orig- inal clustering and each of the subsampled clusterings. The figure of merit is computed for each possible num- ber of parameter sets. The plot of the figure of merit measure against parameter values is used for selecting

Page 3: Cluster Validity Analysis Using Subsamplingnatacha/TeachFall_2008/Grad... · tion; and 5) Cluster validity analysis. In this paper, we only consider the last step which somehow measures

the best parameters. A G aussian finite mixture based method for estimat-

ing true number of clusters is described in [14]. The al- gorithm first divides the dataset into training and test subsets. Next, for each cluster count k, a model is fitted to training set using Expectation Maxmizat ion (ML) al- gorithm. Then, resulting parameter set is evaluated on test set. These steps are repeated many times and av- eraged. These averages are used for estimating the true number of clusters.

3 T h e P r o p o s e d T h r e e M e t h o d s For the methods discussed in this section, we denote

the input dataset by T having n patterns each of them having p dimensions. So, T is effectively a n x p matrix.

The proposed algorithms can be used for different number of cluster counts and different clustering, even generated by different clustering algorithms. We col- lect their confidence measure for all possible number of clusters. These data can be used for relative confi- dence estimation of clustering algorithms on the given dataset. Any clustering algorithm operating on numeric values (e.g., k-means, ORCLUS, PAM, CLARA) hav- ing the cluster count as a parameter can be used in confidence estimation. For randomized algorithms like k-means confidences should be averaged on several runs.

The ORCLUS algorithm is proposed for high- dimensional datasets. The idea behind the algorithm is finding (potentially) different arbitrarily projected sub- spaces for each of the clusters. It is an iterative algo- rithm and starts with an initial partitions and original axis-system. In each iteration, first of all patterns are assigned to a cluster based on their projected distance to seeds of current clustering. Then, centroids of clusters (seeds) are recomputed and the new projected subspaces are computed for each of the clusters. Following this, closer seeds are merged to obtain less number of clusters. Iteration continues until user-specified number of clus- ters is found and the projected subspace dimensionality of each cluster reaches user-specified minimum.

Contrary to feature selection methods which select dimensions in the larger eigen-values, this algorithm se- lects smaller eigen-value subspaces. The reason behind this is to reduce the variability in the projected sub- space, i.e., reduce the distance within cluster. The al- gorithm has capabilities of detecting outliers and scales to very large databases, for details see [1].

3.1 The First Method: Supervised learning based approach

This method validates the result of clustering with supervised classifiers. The rationale behind this method is that if the labels generated by clustering algorithm is valid (i.e., clusters are well-separated), then they can be used by the classifier to classify clusters with high accuracy. So, this accuracy information can be used for

comparing different clustering algorithms with the same input parameters. Additionally, repeated measurements of accuracies on perturbed dataset can be used for es- timating the validity of clustering algorithms. Doing so facilitates the measurement of confidence in cluster va- lidity for multiple (not just two) clustering algorithms on the same basis. The classifier is trained on perturbed version of labeled patterns, and its accuracy is tested on the patterns not selected for training. For confidence es- timation, the subsampling is repeated many times. The average accuracy is used as a measure of confidence in the validity of clustering. The whole process is sketched next in Algorithm 3.1.

A l g o r i t h m 3.1 ( S u p e r v i s e d l e a rn ing based method) Input:T=dataset, K=number of clusters,

B=number of subsampling

f=o.7 L = Cluster(T, K) For b=l to B do

Lb = subsample(L, f ) Cb = Build_Classi f ier(Lb) Ab = Compute_Accuracy(Cb, L - Lb )

end For 1 B At, A - g }-~'~b=l

Any clustering algorithm that partitions the patterns can be used to decide on L in Algorithm 3.1; and the Diagonal Linear Discr iminant Analysis (DLDA) algo- rithm [6] is used to decide on Cb; the authors tested several algorithms and DLDA has been found one of the best as their settings and datasets are concerned. It is also employed in the Clest algorithm, which is a clus- ter estimation/validation method using discrimination analysis approach [7]. As noted in [13], wide range of supervised classifiers can be used for verification. For these reasons, DLDA is employed in this work.

DLDA is based on M a x i m u m Likelihood (ML) ap- proach. Classifier C classifies an instance x by using the class conditional probabilities:

C(x) - arg mkaxP(x y - k) (2)

For multivariate normal class density probabilities, i.e., P ( x l y = k) ~ N ( p k , Ek), the classifier becomes:

C(x) - argmin{(x-/ . . tk)Y]; l (x- / . . tk) ! -Jr-log I,kl} (3) k

The special case is obtained when the class densities have the same diagonal covariance matrix. In this case, the classification formula known as DLDA discrimina- tion rule is obtained as follows:

p (x j _ , ~ j ) 2

j = l (Tj (4)

Page 4: Cluster Validity Analysis Using Subsamplingnatacha/TeachFall_2008/Grad... · tion; and 5) Cluster validity analysis. In this paper, we only consider the last step which somehow measures

3.2 The Second Method: Checking sub- sets of clusters

This method is designed for measuring the validity of clustering using subsets of clusters. The idea is that if the clustering is valid, then each of its subsets is ex- pected to be valid as well. In this method, we aim to measure relative validity without referring to individual patterns. To test the validity of subsets, subsampling based cluster count estimation algorithms can be used. This way, the confidence in validity is computed based on stability of subsets of the original clustering. As in the previous method, multiple algorithms with the same parameters can be compared. The process is outlined next in Algorithm 3.2.

A l g o r i t h m 3.2 (Subse t s of c lus t e r s b a s e d m e t h o d ) Input: T=dataset, K=number of clusters,

B=number of subset subsampling, kmax=maximum number of clusters

L -- Cluster(T, K) For b=l to rain(B, 2 I~:l - 1) do

Lb - -pa t terns belonging to b'th subset o f K Kb -- Est imate_ClusterCount(Lb, kmax)

end For A - ET/I(B'2IKI--1) Ab

min(B,2l KI --1)

In each iteration of Algorithm 3.2, a subset of labels are selected randomly to form Lb. For K clusters, cluster i, 1 < i < K is selected with probability a, i.e., uniform selection. We set a - 0.5 to make the expected value of selected cluster label size K / 2 . A prediction-based re- sampling algorithm, namely Clest [7] is used to decide on K in Algorithm 3.2. In fact Clest is a method hav- ing several parameters, and different instantiations of parameters result in different algorithms. For example, actual clustering and classifier algorithms are generic. The algorithm is given next.

A l g o r i t h m 3.3 (Cles t a l g o r i t h m ) Input: T=dataset, K=number of clusters,

B=number of runs, Bo=number of resampling, kmax=maximum number of clusters

To -- T For k=2 to kmax do

For i=O to Bo do For b=l to B do

Randomly split Ti into non-overlapping learning and test sets Apply clustering algorithm P to the learning set Build a classifier using the labeled learning set Apply the resulting classifier to the test set Apply the clustering algorithm to the test set sk,i,b -- F M external index comparing the two sets

of labels end For tk,i -- median( sk,i,1 , " ' , Sk,i,B)

T i + l - Randomly generate uniform dataset in the

range of T end For

1 BO t~ - ~ ~ = ~ t~,~ #{i I ~,~___~,0, i=l...B0}

Pk -- B0 dk -- tk,0 -- t~

end For K - {k12 <_ k <_ kmax , pk <_ pmax, dk >_ drain} ~ 5 _ [ 1 if empty(K),

t argmaxkc~:dk otherwise.

Clest is based on the idea of hypothesis testing. It tries to estimate true number of clusters against the null hypothesis that there is no need for clustering, i.e., there is only one cluster. The strongest cluster count against the null hypothesis is used for the estimation. If there is no evidence above the threshold, then the null hypothesis is accepted. Since clustering and clas- sifter algorithms are generic, one should select concrete algorithms. In the original algorithm, the authors se- lected Part i t ioning Around Medoids (PAM) algorithm for clustering and DLDA for classification.

3.3 The Third Method: Cluster ten- dency based approach

In this method, every cluster generated by clustering algorithm using subsampling is evaluated against the null hypothesis that there is only one cluster. The moti- vation for this method is: if a clustering algorithm pro- duces reasonable structures for every subsample, then every cluster is expected to be a compact structure, i.e., structure not having any further tendency of sub- clusters. The process is presented next in Algorithm 3.4. A l g o r i t h m 3.4 ( C l u s t e r s t e n d e n c y b a s e d m e t h o d ) Input: T=dataset, K=number of clusters,

B=number of subsampling, kmax= maximum number of clusters

f=o.r For b=l to B do

Tb -- subsample(T, f ) Lb -- Cluster(Tb, K) For each cluster c E Lb do

Kb,c -- Est imate_ClusterCount(c , kmax) Ab,c- l(~:b,c==l )

end For end For

1 B Ec=lnumber of cluster(Lb)Ab,c

A - -~-~-b=~ ,~,,~b~,~ of cluseer(Lb)

Dataset is subsampled and clustered many times, and cluster count for each clustering result is estimated us- ing Algorithm 3.3. For stable and compact clusters high scores are expected. These scores can be used for com- paring multiple clustering algorithms.

4 E x p e r i m e n t s a n d R e s u l t s To test the methods proposed in this paper, we uti-

lized Iris and Ionosphere datasets from the U CI ma-

Page 5: Cluster Validity Analysis Using Subsamplingnatacha/TeachFall_2008/Grad... · tion; and 5) Cluster validity analysis. In this paper, we only consider the last step which somehow measures

chine learning repository [4]. The former dataset has 4 continuous attributes; it contains 150 patterns and 3 classes, with 50 patterns per class. The latter dataset has 34 continuous attributes; it is a binary classification task and contains 351 patterns. Finally, both datasets do not contain missing values.

For all of the three methods, we analyze the clus- tering results generated by two well-known algorithms: k-means and ORCLUS, with initial seed count of 25 and target dimension of 4. For all the datasets, the analysis is done for cluster counts in the range 2 to 9. Parameter B is set to 10 for all the three meth- ods. In all of the experiments the parameters of Algo- rithm 3.3 are set as follows: pmax = 0.05, drain = 0.05, size o f learning set = 2n/3, B = 20, B0 = 20 and krnax = 9. Since these two clustering algorithms con- tain random components, all the values are averaged over 5 runs.

Table 1- Resu l t s of first m e t h o d for Iris d a t a s e t

# of clusters k-means Orclus 1.00 1.00 1.00 1.00 0.97 0.99 0.98 0.99 0.95 0.92 0.98 0.92 1.00 0.97 1.00 0.98

Table 2- Results of first method for Ionosphere dataset

# of clusters k-means Orclus 0.85 0.66 0.61 0.78 0.77 0.76 0.73 0.70 0.66 0.71 0.82 0.70 0.81 0.67 0.81 0.69

Table 3" Resu l t s of second m e t h o d for Iris d a t a s e t

of k-means Orclus k-means Orclus clusters (FM) (FM) (Rand) (Rand)

2 0.60 0.66 0.53 0.46 3 0.35 0.32 0.57 0.51 4 0.38 0.38 0.34 0.30 5 0.30 0.23 0.22 0.16 6 0.29 0.26 0.14 0.14 7 0.23 0.16 0.10 0.08 8 0.16 0.13 0.08 0.20 9 0.14 0.09 0.06 0.10

Results of the first method are given in Tables 5 and 6. The stability confidence for both clustering algorithms is very high; the results are close to each other. For both datasets, k-means gives the maximal confidence for true value of the number of clusters. But, ORCLUS failed to give maximal confidence for true number of clusters

for Ionosphere dataset. The result with both datasets is that k-means clustering gives better confidence than ORCLUS, although they are very close. The overall re- sults of Iris dataset are better than Ionosphere dataset; this is because it is relatively a simple task.

Table 4: Resu l t s of second m e t h o d for I onosphe re d a t a s e t

of k-means Orclus k-means Orclus clusters (FM) (FM) (Rand) (Rand)

2 0.26 0.40 0.50 0.53 3 0.37 0.51 0.42 0.28 4 0.42 0.42 0.42 0.20 5 0.45 0.40 0.20 0.23 6 0.20 0.23 0.11 0.20 7 0.15 0.13 0.08 0.15 8 0.13 0.10 0.08 0.13 9 0.10 0.10 0.08 0.06

Results of the second method for the two datasets are presented in Tables 7 and 8. The experiments are done using both FM and normalized Rand indexes in Algorithm 3.3 [7]. For both datasets, normalized Rand index seems to perform better than FM index with true number of clusters. It also gives the maximum results for true number of clusters among the differing num- ber of clusters. For Iris dataset, k-means results are better than ORCLUS results for true number of clus- ters, but the reverse is true for Ionosphere dataset. We claim that this is the result of ORCLUS algorithm's de- sign for high dimensions. The results seem to gradually decrease with the increasing number of clusters. This is expected and reasonable since we sample same num- ber of patterns regardless of number of clusters, and for small number of clusters these patterns are expected to be well-separated.

Table 5: Re su l t s of t h i r d m e t h o d for Iris d a t a s e t

# of k-means Orclus k-means Orclus clusters (FM) (FM) (Rand) (Rand)

2 0.50 0.53 0.61 0.58 3 0.66 0.52 0.73 0.60 4 0.73 0.54 0.84 0.64 5 0.81 0.61 0.88 0.67 6 0.83 0.66 0.90 0.72 7 0.85 0.73 0.91 0.75 8 0.87 0.77 0.91 0.80 9 0.89 0.83 0.91 0.88

Table 6" Resu l t s of t h i r d m e t h o d for I o n o s p h e r e d a t a s e t

# of k-means Orclus k-means Orclus clusters (FM) (FM) (Rand) (Rand)

2 0.15 0.36 0.76 0.74 3 0.35 0.33 0.75 0.59 4 0.33 0.51 0.66 0.55 5 0.54 0.46 0.74 0.46 6 0.61 0.50 0.74 0.50 7 0.71 0.73 0.67 0.48 8 0.81 0.75 0.71 0.51 9 0.86 0.81 0.75 0.57

Page 6: Cluster Validity Analysis Using Subsamplingnatacha/TeachFall_2008/Grad... · tion; and 5) Cluster validity analysis. In this paper, we only consider the last step which somehow measures

Results of the third method are presented in Tables 9 and 10; experiments are done using both FM and nor- malized Rand indexes in Algorithm 3.3. These results show that k-means generates more compact clusters than ORCLUS. We expect increasing confidence with the larger number of clusters because it is more prob- able that clustering is more compact if the number of patterns within cluster is less. There is a notable ex- ception to this one with Ionosphere dataset and Rand index; in particular highest values are obtained with the true number of clusters. But this is not observed with Iris dataset. Finally, the obtained results show that K- means generates more compact clusters than ORCLUS.

5 S u m m a r y and Conclus ions Three approaches for confidence estimation in cluster

validity are presented. In the first method, clustering re- sult is divided into training and test sets; then accuracy of classifier is evaluated on the test set. By repeatedly doing this, we estimate the stability of the clustering re- sult. The second method is based on the idea of validity of the subsets of clusters. Again repeatedly doing sub- set selection and cluster count estimation, the validity of original clustering is estimated. The third method is similar to the second method and uses the idea that valid clusters should not tend to further sub-clustering.

All the three methods are used for comparison of dif- ferent clustering algorithms with the same input param- eters. Instead of trying to find best number of clusters in a given dataset, our target is to find validity estimation in the result of clustering. These validity estimations across different algorithms are comparable since algo- rithms are run with the same input parameters. On the other hand, for a single clustering algorithm, the valid- ity results for different input parameters are not directly comparable. This is what we do not pursue because this problem is rather a different problem; it is estimation of true number of clusters in a dataset. But, in the second and third methods a kind of this estimation is used as a subroutine.

We have tested all these methods on two selected datasets, namely Iris and Ionosphere from U CI reposi- tory of machine learning. Experimental results show the effectiveness of these methods, and hence our claims are supported and verified.

References [1] Aggarwal C. and Yu P.S., "Redefining Clustering

for High-Dimensional Applications," IEEE TKDE, Vol. 14, 2002.

[2] Ben-Hur A., Elisseeff A. and Guyon I., "A stability based method for discovering structure in clustered data," Proc. of the Pacific Symposium on Biocom- puting, 2002.

[3] Ben-Hur A. and Guyon I., "Detecting stable clus- ters using principal component analysis," In Meth- ods in Molecular Biology, M.J. Brownstein and A. Kohodursky (eds.) Humana press, pp.159-182, 2003.

[4] Blake C.L. and Merz C.J., "UCI Repos- itory of machine learning databases," [http://www.ics.uci.edu / mlearn/MLRepository.html], University of California, Department of Informa- tion and Computer Science, 1998.

[5] Buhmann, J.M., "Learning and Data Clustering," Handbook of Brain Theory and Neural Networks, MIT Press, 2002.

[6] Dudoit S., Fridlyand J., and Speed T., "Compari- son of Discrimination methods for the classification of tumors using gene expression data," Journal of American Statistical Association, 2001.

[7] Fridlyand J. and Dudoit S., "Applications of re- sampling methods to estimate the number of clus- ters and to improve the accuracy of a clustering method," University of California, Statistics De- partment Technical Report, No.600. 2001.

[8] Halkidi M., Batistakis Y. and Vazirgiannis M., "On Clustering Validation Techniques," Journal of In- telligent Information Systems, Vo1.17:2-3, 2001.

[9] Jain A.K., Murty M.N. and Flynn P.J., "Data Clustering: A Review," A CM Computing Surveys, Vol.31, No.3, 1999.

[10] Keller A., et al, "Bayesian Classification of DNA Array Expression Data," University of Washing- ton, Computer Science Department Technical, Re- port No.UW-CSE-2000-08-01, 2000.

[11] Kerr M.K. and Churchill G.A., "Bootstrapping Cluster Analysis: Assessing the Reliability of Con- clusions from Microarray Experiments," Proc. of the National Academy of Sciences, 2000.

[12] Levine E. and Domany E., "Resampling Method for Unsupervised Estimation of Cluster Validity," Neural Computation. 2001.

[13] Roth V., et al, "Stability-Based Model Order Se- lection in Clustering with Applications to Gene Ex- pression Data," Proc. of ICANN, 2002.

[14] Smyth P., "Clustering using Monte Carlo Cross- Validation," Proc. of A CM-KDD, 1996.

[15] Zhang K. and Zhao H., "Assessing Reliability of gene Clusters from Gene Expression Data," Func- tional Genomics, 2000.


Recommended