SuSE : Subspace Selection embedded in an EM algorithm · 2017-01-12 · SuSE : Subspace Selection...

SuSE : Subspace Selection embedded in an EM

algorithm

Laurent Candillier, Isabelle Tellier, Fabien Torre, Olivier Bousquet

To cite this version:

Laurent Candillier, Isabelle Tellier, Fabien Torre, Olivier Bousquet. SuSE : Subspace Selectionembedded in an EM algorithm. Conference d’Apprentissage, 2006, Tregastel, France. 2006.<inria-00471311>

HAL Id: inria-00471311

https://hal.inria.fr/inria-00471311

Submitted on 7 Apr 2010

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinee au depot et a la diffusion de documentsscientifiques de niveau recherche, publies ou non,emanant des etablissements d’enseignement et derecherche francais ou etrangers, des laboratoirespublics ou prives.

https://hal.archives-ouvertes.fr

https://hal.inria.fr/inria-00471311

SuSE : Subspace Selection embedded in an EMalgorithm

L. Candillier1,2, I. Tellier1, F. Torre1, O. Bousquet2

1 GRAppA, Université Charles de Gaulle, Lille [email protected]

2 Pertinence, 32 rue des Jeûneurs, 75002 [email protected]

Résumé :

Subspace clusteringis an extension of traditionalclustering that seeks to findclustersembedded in different subspaces within a dataset. This is a particularlyimportant challenge with high dimensional data where thecurse of dimensionalityoccurs. It also has the benefit of providing smaller descriptions of the clustersfound.

In this field, we show that using probabilistic models provides many advantagesover other existing methods. In particular, we show that thedifficult problem ofthe parameter settings of subspace clustering algorithms can be seen as amodelselectionproblem in the framework of probabilistic models. It thus allows us todesign a method that does not require any input parameter from the user.

We also point out the interest in allowing the clusters to overlap. And finally, weshow that it is well suited for detecting the noise that may exist in the data, andthat this helps to provide a more understandable representation of the clustersfound.

1 Introduction

Clusteringis a powerful exploration tool capable of uncovering previously unknownpatterns in data (Berkhin, 2002).Subspace clusteringis an extension of traditional clus-tering that is based on the observation that differentclusters(groups of data points) mayexist in different subspaces within a dataset (see figure 1 asan example). Subspace clus-tering is thus more general than classicalfeature selectionfor clustering because eachsubspace may be local to each cluster, instead of global to all of them.

This is a particularly important challenge with high dimensional data where thecurseof dimensionalitycan degrad the quality of the results. Besides, it helps to get smallerdescriptions of the clusters found since clusters are defined on fewer dimensions thanthe original number of dimensions.

In this paper, we point out the interest in usingprobabilistic modelsfor subspaceclustering. In particular, we show that the difficult problem of the parameter settings

CAp 2006

X

Y

Z

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

�

��

��

+++++++

++

+++

+++++

+

+

+

+

++

+

+++

++

+

�∈ X × Z

�∈ X × Z

�

∈ Y × Z

+ ∈ X × Y

FIG. 1 – Example of four clusters embedded in different subspaces.

of subspace clustering algorithms can be seen as amodel selectionproblem in the fra-mework of probabilistic models. It thus allows us to proposea method that does notrequire any input parameter from the user. We also show that allowing the clusters tooverlap may be necessary in that field.

Moreover, the problem ofnoise detectioncan also naturally be included into the pro-babilistic framework. Yet another contribution of this work is to tackle the problem ofproviding interpretable results. And we will see that detecting the noise that may existin the data can help to provide more understandable results.Finally, another advan-tage of using probabilistic models is that it allows us to naturally mix different types ofattributes, under some specific assumptions.

The rest of the paper is organized as follows : in section 2 we present existing sub-space clustering methods and discuss their performances ; we then describe how toadapt probabilistic models for subspace clustering and propose a new algorithm cal-led SuSE in section 3 ; the results of our experiments, conducted on artificial as wellas real datasets, and whereSuSE is compared to other existing methods, are then re-ported in section 4 ; finally, section 5 concludes the paper and suggests topics for futureresearch.

2 Subspace clustering

The subspace clustering issue has been first introduced in (Agrawal et al., 1998).Many other methods emerged then, among which two families can be distinguishedaccording to their subspace search method :

1. bottom-upsubspace search methods (Agrawalet al., 1998; Chenget al., 1999;Nageshet al., 1999; Kailinget al., 2004) seek to find clusters in subspaces of in-creasing dimensionality, and produce as output a set of clusters that can overlap ;

2. and top-downsubspace search methods (Aggarwalet al., 1999; Woo & Lee,2002; Yipet al., 2003; Sarafiset al., 2003; Domeniconiet al., 2004) usek-means

like methods with original techniques of local feature selection, and produce asoutput a partition of the dataset.

In (Parsonset al., 2004), the authors have studied and compared these methods. Theypoint out that every method requires input parameters difficult to set for the user, andthat influence the results (density threshold, mean number of relevant dimensions of theclusters, minimal distance between clusters, etc.).

Besides, the existing methods do not tackle the problem of handling the noise thatmay exist in the data. And no proposition was made for producing an interpretable out-put, although being understandable in the field of clustering is an important challenge.Yet we will see that the noise detection may also help to provide more understandableresults. Moreover, although a proposition was made to integrate categorical attributesin bottom-upapproaches, all experiments were conducted on numerical data only.

Finally, let us present a case where both types of existing methods behave badlycompared to what we should expect from subspace clustering algorithms. It is the casefor the data in figure 2, where one cluster is defined on one dimension and takes randomvalues on another one, and conversely for the other cluster.

++++++++++++++++++

++++++++++++++++++

(a) dataset.

++++++++++++++++++

++++++++

++++++++++

(b) bottom-up likeresults.

++++++++++++++++++

++++++++++++++++++

(c) k-means likeresults.

++++++++++++++++++

++++++++

++++++++++

(d) model basedresults.

FIG. 2 – A case where existing subspace clustering methods behave badly contrary tomethods based on probabilistic models.

In such a case, forbottom-upsubspace search methods, all points belong to the samecluster because they form a continuous zone. These methods thus tend to describe datalike these as a unique 2D-space cluster instead of as a pair of1D-space clusters. Itthen becomes worse with many dimensions. Conversely, fork-means likemethods, asintersections are not allowed, the two clusters may not be retrieved. On the other hand,methods based on probabilistic models and the EM algorithm (Ye & Spetsakis, 2003)are able to identify the two clusters.

However, it is well known that the methods based on probabilistic models and theEM algorithm may be slow to converge. Moreover, it would be interesting to adapt suchmethods to subspace clustering, by designing a model able toidentify the subspaces inwhich each cluster is embedded. Finally, it would also be interesting to use a model able

CAp 2006

to handle different types of attributes, and to provide an interpretable output.In the next section, we present such a new statistical subspace clustering algorithm

calledSuSE . This algorithm belongs to the family oftop-downsubspace search me-thods. We will show that assuming that the data values followindependent distributionson each dimension helps to resolve many issues we have presented. And we will see theinterest in using such statistical approach, in particularin order to resolve the difficultproblem of parameter settings.

3 Algorithm SuSE

Let us first introduce some notations. We denote byN the number of data points ofthe input dataset andM the number of dimensions on which they are defined. Thesedimensions can be numerical as well as categorical. We suppose values on numericaldimensions are normalized (so that all values belong to the same interval). And wedenote byCategoriesd the set of all possible categories on a categorical dimension d,andFrequencesd the frequences of all these categories within the dataset.

3.1 Probabilistic model

The basis of our model is the classical mixture of probability distributionsθ =(θ1, ..., θK) where eachθk is the vector of parameters associated with thekth cluster tobe found, denoted byCk (we set toK the total number of clusters).

Besides, we assume that the data values follow independent distributions on eachdimension. Thus, our model is less expressive than the classical one that takes intoaccount the possible correlations between dimensions. Butit allows us to naturally mixnumerical and categorical dimensions. We will see that it also allows us to extract someof the dimensions considered as more relevant to each cluster. Besides, the methodis thus faster than with the classical model because our model needs less parameters(O(M) instead ofO(M2)) and operations on matrices are avoided. And finally, it isthus adapted to the presentation of the results as a set of rules (hypercubes in subspacesof the original description space are easily understandable by humans) because eachdimension of each cluster is characterized independently from one another.

In our model, we suppose that the data follow gaussian distributions on numericaldimensions and multinomial distributions on categorical dimensions. So the model hasthe following parametersθk for each clusterCk : πk denotes its weight,µkd its meanandσkd its standard deviation on the numerical dimensionsd, andFreqskd the fre-quences of each category on the categorical dimensionsd.

Finally, in order to adapt the model for subspace clustering, we add the parameterR

that indicates how many relevant dimensions to consider forthe clusters. And we add tothe parameters of each clusterCk the setMk, of sizeR, of the dimensions consideredas the most relevant to the cluster.

To make theselocal feature selections, we first associate to each dimensiond of eachclusterCk a local weightWkd that indicates its relevance to the cluster. These weightsare computed according to the shape of the distribution of the cluster on the dimension.For example for numerical dimensions, a high standard deviation will induce a low

weight on the dimension whereas a low standard deviation will induce a high weight indetermining if a data point belongs to the corresponding cluster.

So these weights are computed as follows. For numerical dimensions, it is the ratiobetween the local and the global standard deviation according toµkd. And for catego-rical dimensions, it is the relative frequence of the most probable category.

Wkd =

1 −σ2

kd

Σ2

kd

, with Σ2kd = 1

N

∑

i(Xid − µkd)2 if d numerical

Freqskd(cat)−Frequencesd(cat)1−Frequencesd(cat) if d categorical

with cat = Argmax{c∈Categoriesd}Freqskd(c)

So the weightWkd reflects the capability of the dimensiond to discriminate betweenthe data points that belong to the specific clusterCk and the other ones. And theRdimensions of highest weights, that correspond to the most relevant dimensions of theclusterCk, can be selected.

By this model, we set that all the clusters have the same number of relevant dimen-sions, although the dimensions selected for each cluster may be different. If it is notthe case, then some irrelevant dimensions may be selected bysome clusters. However,the influence of such irrelevant dimensions would be lower than the one of the relevantdimensions.

3.2 EM algorithm

Given a setD of N data points−→Xi, Maximum Likelihood Estimationis used to es-

timate the model parameters that best fit the data. To do this,theEM algorithm is aneffective two-step process that seeks to optimize thelog-likelihoodof the modelθ ac-cording to the datasetD, LL(θ|D) =

∑

i log P (−→Xi|θ).

1. E-step (Expectation) : find the class probability of each data point according tothe current model parameters.

2. M-step (Maximization) : update the model parameters according to the new classprobabilities.

These two steps iterate until a stopping criterion is reached. Classicaly, it stops whenLL(θ|D) increases less than a small positive constantδ from one iteration to another.

The E-step consists in computing the membership probability of each data point−→Xi

to each clusterCk with parametersθk. In our case, the dimensions are assumed to beindependent, and each cluster has its own set of relevant dimensionsMk. So the mem-bership probability of a data point to a cluster is the product of membership probabilitieson each dimension considered as relevant for the cluster. Besides, to avoid that a pro-bability equal to zero on one dimension cancels the global probability, we use a verysmall positive constantε.

P (−→Xi|θk) =

∏

d∈Mk

max(P (Xid|θkd), ε)

CAp 2006

P (Xid|θkd) =

1√2πσkd

e− 1

2

(

Xid−µkdσkd

)

2

if d numerical

Freqskd(Xid) if d categorical

P (−→Xi|θ) =

K∑

k=1

πk × P (−→Xi|θk)

P (θk|−→Xi) =

πk × P (−→Xi|θk)

P (−→Xi|θ)

Then the M-step consists in updating the model parameters according to the new classprobabilities as follows :

πk =1

N

∑

i

P (θk|−→Xi)

µkd =

∑

i Xid × P (θk|−→Xi)

∑

i P (θk|−→Xi)

σkd =

√

∑

i P (θk|−→Xi) × (Xid − µkd)2

∑

i P (θk|−→Xi)

Freqskd(cat) =

∑

{i|Xid=cat} P (θk|−→Xi)

∑

i P (θk|−→Xi)

∀ cat ∈ Categoriesd

In order to cope with the problem of slow convergence with theclassical EM algo-rithm, the followingk-means likestopping criterion can be used : stop whenever themembership of each data point to their most probable clusterdoes not change. To dothis, we introduce a new view on each clusterCk, corresponding to the set of data pointsthat belong to it :

Sk = {−→Xi|ArgmaxK

j=1P (−→Xi|θj) = k}

The set of allSk thus defines a partition on the dataset. However, as we discussed earlieron the example of figure 2, this ability to provide a partitionon the object space doesnot prevent us from considering clusters that may overlap onthe description space.

And finally, to cope with the problem of sensitivity to the choice of the initial solution,we run the algorithm many times with random initial solutions and keep the model thatoptimizes thelog-likelihoodLL(θ|D).

3.3 Model selection

At this stage, our algorithm needs two input parameters : thenumberK of clusters tobe found, and the numberR of dimensions to be selected for each cluster. An importantadvantage of using a probabilistic model over other existing methods is that finding

the most appropriate values of these two parameters can be seen as a model selectionproblem.

So we can for example use theBIC criterion (Ye & Spetsakis, 2003) that consists inadding to the log-likelihood of the model to the data a term that penalizes more complexmodels. It thus tries to find a compromise between the fit of themodel to the data, andthe complexity of the model used.

BIC(θ|D) = −2 × LL(θ|D) + Mθ × log N

Mθ represents the number of independent parameters of the model :

Mθ =

K∑

k=1

∑

d∈Mk

{

2 if d numerical|Categoriesd| if d categorical

BIC criterion must be minimized to optimize the likelihood of the model to the data.So to find the model that best fit the data, we consider different models with differentvalues for the parametersK andR, and the model that minimizes theBIC value iskept.

Contrary to the other existing subspace clustering methods, we thus propose a wayto automatically find the most appropriate values for the model parameters. We thus donot need the user to provide any prior knowledge. The relevance of this method will bestudied in the next section.

Finally, another advantage of using probabilistic models for subspace clustering isthat detecting the noise that may exist in the data can be naturally integrated into themethod, by adding a uniform cluster into the model. Moreover, we will see in the nextsection that handling the noise can also help to get more understandable results.

4 Experiments

Experiments were conducted on artificial as well as real datasets. The first ones areused to observe the evolution of the BIC value according to the number of clustersexpected and the number of dimensions selected for each cluster. We also use artificialdatasets to observe the robustness ofSuSE faced with different types of datasets, inparticular datasets containing noise. Then experiments onreal datasets are conductedaccording to the methodology proposed in (Candillieret al., 2005), and thus show theeffectiveness of our method on real-life data.

In order to compare our method with existing ones, we conductthese experimentson numerical-only datasets. Three other clustering algorithms are used in these experi-ments :

– K-means, the well-known full-space clustering algorithmbased on the evolution ofK centroids that represent the K clusters to be found.

– LAC (Domeniconiet al., 2004), an effectivetop-down likesubspace clusteringmethod that is based on K-means and associates with each centroid a vector ofweights on each dimension. At each step and for each cluster,these weights oneach dimension are updated according to the dispersion of the members of thecluster on the dimension (the greater the dispersion, the less the weight).

CAp 2006

– And EMI refers to clustering by learning a mixture of gaussians with the EM algo-rithm under the independence assumption on the dimensions,but without perfor-ming local feature selection as is done bySuSE.

4.1 Artificial datasets

Artificial datasets are generated according to the following parameters :N the numberof data points in the dataset,M the number of (numerical) dimensions on which theyare defined,L the number of clusters,m the mean dimensionality of the subspaces onwhich the clusters are defined,SDm andSDM the minimum and maximum standarddeviation of the coordinates of the data points that belong to a same cluster, from itscentroid and on its relevant dimensions.

L random data points are chosen on theM -dimensional description space and used asseeds of theL clusters(C1, ..., CL) to be generated. Let us denote them by(

−→O1, ...,

−→OL).

With each cluster is associated a subset of theN data points and a subset (of size closeto m) of theM dimensions that will define its specific subspace. Then the coordinatesof the data points that belong to a clusterCk are generated according to a normal dis-tribution with meanOkd and standard deviationsdkd ∈ [SDm..SDM ] on its specificdimensionsd. They are generated uniformly between 0 and 100 on the other dimen-sions.

For all experiments, 100 artificial datasets are generated with N varying between 50and 300,M between 10 and 50,L between 2 and 5,m between 3 and 10,SDm = 3andSDM = 9. Then averages on the expected measures over the various trials arecomputed.

Our first experiments concern the evolution of the BIC value according to the numberR of relevant dimensions selected for each cluster, when the numberK of clustersexpected is provided. Figure 3 shows such a curve, and thus points out that the BICvalue decreases untilR reachesm, and then increases. So it experimentally shows thatthe BIC criterion can be used to automatically determine themost appropriate numberof relevant dimensions for each cluster.

Similarly, figure 4 shows the 3D plot of the BIC value according to the numberKof clusters to be found, and the numberR of relevant dimensions selected for eachcluster. For a better visualization of the results, -BIC is reported instead of BIC. It thusexperimentally points out that the optimum BIC value is reached whenK reachesLandR reachesm.

4.2 Noisy datasets

We also conducted experiments on artificial datasets to observe the robustness of ourmethod to noise. Figure 5 shows the results obtained with or without taking into accountthe noise that may exist in the data. In this example, we see that detecting the noise leadsto more understandable results.

Figure 6 then shows the resistance ofSuSEto noise, compared to EMI and LAC. Theaccuracy of the partition is measured by the average purity of the clusters (the purity

-2000

-1500

-1000

-500

0

-4 -2 0 2 4

BIC

R-m

FIG. 3 – Evolution of the BIC value according to the numberR of relevant dimensionsselected for each cluster.

-3 -2 -1 0 1 2 3 4-4

-20

24

0

600

1200

1800

K-L

R-m

-BIC

FIG. 4 – Evolution of the BIC value according to the numberK of clusters to be foundand the numberR of relevant dimensions selected for each cluster.

CAp 2006

0

9

18

27

36

45

54

63

72

81

90

99

0 10 20 30 40 50 60 70 80 90 100

d19

d16

Noise��

A +

B��

C0

C1

��

��

� �

�

��

��

��

��

��

��

��

��

��

!"#

$%

&'

()

*+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

++

+

+

+

+

+

+

++

+

+ +

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+ +

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+

+

+

++

++

+

++

+

+

+

+

,-./

0123

45

6789

:;

<=>?@ABC

DEFG

HIJKLMNO

PQRSTU

VW

XY

Z[

\]_

a

bcde

fghijk

lmno

pq

rstuvw

xy

z{

|}~�

��

��

��

��

��

��

��

��

��

��

¡¢£

¤¥¦§©

ª«¬

®

°±²³µ¶·¹º»

¼½

¾¿ÀÁÂÃÄÅÆÇ

ÈÉ

ÊË

ÌÍÎÏ

(a) SuSEwithout noise detection.

0

9

18

27

36

45

54

63

72

81

90

99

0 10 20 30 40 50 60 70 80 90 100

d19

d16

NoiseÐÑ

A +

BÒÓ

C0

C1

ÔÕ

Ö×

ØÙ ÚÛ

ÜÝ

Þß

àá

âã

äå

æç

èé

êë

ìí

îï

ðñòó

ôõ

ö÷

øù

úû

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

++

+

+

+

+

+

+

++

+

+ +

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+ +

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+

+

+

++

++

+

++

+

+

+

+

üýþÿ

��

��

��

��

�

�

� ��

��

��

��

��

��

��

��

!

"#

$%

&'

()

*+

,-./

01

23

45

6789:;

<=

>?

@A

BCDE

FG

HI

JK

LM

NO

PQ

RS

TU

VWXY

Z[

\]

_

a

bc

de

fghi

jk

lm

no

pq

rs

tu

vw

xy

z{

|}

~�

��

��

��

��

��

��

��

��

��

��

��

��

(b) SuSEwith noise detection.

FIG. 5 – Example of the interest of the noise detection.

of a cluster is the maximum percentage of data points that belong to the same initialconcept).

80

85

90

95

100

0 5 10 15 20

pu

rity

oft

he

par

titio

n

Noise percentage

SuSE

EMI

LAC

FIG. 6 – Purity of the partition according to the percentage of noise in the dataset.

We can thus observe thatSuSE is more robust to noise than EMI and LAC. Ourmethod is also robust to missing values. When summing over all the data values on onedimension, the only thing to do is to ignore the missing values.

Let us finally note that our method is still robust even if the data are generated byuniform distributions inside given intervals on the relevant dimensions of the clusters,instead of normal distributions.

4.3 Real datasets

We also conducted various experiments on real datasets, following the methodologyproposed in (Candillieret al., 2005) that consists in :

1. performing a supervised learning on a given dataset with classes information

2. performing a supervised learning on the same dataset enriched with the informa-tion coming from the clustering algorithm to be evaluated– perform a clustering without using the classes information– create new attributes from these results– add these new attributes to the dataset– and perform the supervised learning on the enriched dataset

3. and comparing the classification errors of both methods

Thus, if the results of the supervised learning algorithm are improved when someextra-knowledge is added from the clustering process, thenwe conjecture that it meansthat the clustering process managed to capture some new meaningful and useful infor-mation. And the decrease of the error rate of the supervised method when it is helped

CAp 2006

by the information coming from the clustering allows us to quantify the interest of theclustering algorithm.

One way of creating new attributes from the results of a clustering is for exampleto add to each data point an identifier of the cluster it belongs to. We could also addto each data point a set of attributes referring to the centerof the cluster it belongs to.In these experiments, we use C4.5 (Quinlan, 1993) as the supervised algorithm, but ithas been experimented in (Candillieret al., 2005) that the results do not depend on thesupervised algorithm used.

To evaluate the improvement in the results of C4.5 with or without the new informa-tion coming from the clustering process, we test both methods on various independentdatasets coming from the UCI Machine Learning Repository (Blake & Merz, 1998).On each dataset, we perform five 2-fold cross-validations, as proposed in (Dietterich,1998). For each 2-fold cross-validation, we compute the balanced error rates of bothmethods.

Table 1 reports the error rates of C4.5 on the initial datasets, and on the datasetsenriched with the corresponding clustering algorithms. Rand is a random clusteringthat is used as a reference.

C4.5 C4.5 C4.5 C4.5 C4.5 C4.5alone + Rand + K-means + LAC + EMI + SuSE

ecoli 48.5 48.3 42.8 40.3 42 43glass 32.6 40.8 35.7 37 40.4 35.8image 4.8 6 4.8 4.6 4.6 4.2iono 14.1 15.8 14.2 13.1 9.8 10.9iris 7.3 7.9 6.7 3.7 5.1 4.8

pima 31 35 32.1 32.1 30.8 30.5sonar 31 35.2 30 28.8 28.8 27.4vowel 29.5 38.5 25 26.4 24.1 23.7wdbc 5.9 6.8 4.6 3.9 5.1 3.8wine 8.7 8.8 10.4 9.6 2.7 4.1

TAB . 1 – Balanced error rates (in %) of C4.5 enriched by clustering algorithms. Thebold values correspond to the minimum error rates obtained on each dataset.

From this table, we can already observe that most of the time,the results of C4.5 areimproved when some information coming from real clusteringalgorithms are added,whereas adding information from a random clustering degrads the results. Besides, wesee that the results of the methods based on the use of probabilistic models are oftenbetter than those of K-means based methods.

Then four measures are used to compare the results of C4.5 alone with those of C4.5helped with the corresponding clustering algorithms :

– nb wins: the number of wins of each method– sign wins: the number of significant wins, using the5×2cv F-test(Alpaydin, 1999)

to check if the results are significantly different

– wilcoxon: the wilcoxon signed rank test, that indicates if a method issignificantlybetter than another one on a set of independent problems (if its value is above 1.96)

– andav perf : the mean balanced error rate (in %)Table 2 shows the results of such an evaluation. The first column concerns the mea-

sures obtained using C4.5 on the initial dataset, the secondcolumn using C4.5 on thedataset enriched with information coming from the random clustering, and the nextones using C4.5 on the dataset enriched with information coming from the correspon-ding clustering algorithm.

C4.5 C4.5 C4.5 C4.5 C4.5 C4.5alone + Rand + K-means + LAC + EMI + SuSE

nb wins - 1/9 5/4 7/3 9/1 9/1sign wins - 0/1 0/0 1/0 2/0 3/0wilcoxon - -2.67 -0.05 1.31 1.83 2.36av perf 21.3 24.3 20.6 20 19.3 18.8

TAB . 2 – Comparison of C4.5 alone with C4.5 enriched by clustering algorithms.

It thus shows thatSuSEis the only clustering algorithm among the ones tested herethat significantly helps C4.5 improve its results, according to the wilcoxon signed ranktest. It is significantly better on 3 datasets according to the5×2cv F-test. But asSuSE,EMI improves the results of C4.5 nine times over ten, contrary to K-means and LAC.All algorithms improve the results of C4.5 on average, except the random clustering.

5 Conclusion

We have shown in this paper the interest in using probabilistic models for subspaceclustering. Indeed, we have seen that it allows us to transform the difficult problem ofthe parameter settings into a model selection problem, so that the most appropriate num-ber of relevant dimensions to consider for each cluster can be determined automatically,instead of requiring the user to specify it, as is done by the other subspace clusteringmethods. Besides, we have also shown the interest in allowing the clusters to overlap inthat field.

We have also pointed out the interest in assuming that the data values follow inde-pendent distributions on each dimension. Indeed, it allowsus to speed up the algorithm,to naturally mix different types of attributes, and to provide an understandable result asa set of rules. However, in the case where correlations between dimensions exist, ourmethod does not provide irrelevant results. Instead, in many cases, it points out the cor-relations by generating clusters along the axes defined by the correlations. An exampleof such results is presented in figure 7.

The experiments we conducted on artificial datasets pointedout the robustness of ourmethod to noise. Moreover, we have seen that detecting the noise may help to get moreunderstandable results. Then experiments on real datasetspointed out the relevance ofusing probabilistic models for subspace clustering. In particular, methods based on the

CAp 2006

++

++++

++++++

+++

++++

+++++++

+++++++

FIG. 7 – Results ofSuSEwhen a correlation between dimensions exists.

use of probabilistic models have been shown to outperform K-means based methods.And more specifically,SuSEhas been shown to outperform EMI, thus pointing out therelevance of our proposed method for selecting the most appropriate number of relevantdimensions.

To continue our investigations in that field, we could now conduct more experimentsand compare our method with many others, on many other artificial and real datasets.Finally, we could also study more in detail how to design an efficient way to reach theoptimum BIC value, referring to the most appropriate numberof clusters and numberof relevant dimensions to be considered.

Références

AGGARWAL C. C., WOLF J. L., YU P. S., PROCOPIUCC. & PARK J. S. (1999). Fast al-gorithms for projected clustering. InACM SIGMOD Int. Conf. on Management of Data, p.61–72.

AGRAWAL R., GEHRKE J., GUNOPULOSD. & RAGHAVAN P. (1998). Automatic subspaceclustering of high dimensional data for data mining applications. InACM SIGMOD Int. Conf.on Management of Data, p. 94–105, Seattle, Washington.

ALPAYDIN E. (1999). Combined 5x2cv F-test for comparing supervised classification learningalgorithms.Neural Computation, 11(8), 1885–1892.

BERKHIN P. (2002).Survey Of Clustering Data Mining Techniques. Rapport interne, AccrueSoftware, San Jose, California.

BLAKE C. & M ERZ C. (1998). UCI repository of machine learning databases[http ://www.ics.uci.edu/∼mlearn/MLRepository.html].

CANDILLIER L., TELLIER I., TORREF. & BOUSQUETO. (2005). Cascade evaluation. NIPS2005 Workshop on Theoretical Foundations of Clustering.

CHENG C. H., FU A. W.-C. & ZHANG Y. (1999). Entropy-based subspace clustering formining numerical data. InKnowledge Discovery and Data Mining, p. 84–93.

DIETTERICHT. G. (1998). Approximate statistical test for comparing supervised classificationlearning algorithms.Neural Computation, 10(7), 1895–1923.

DOMENICONI C., PAPADOPOULOSD., GUNOPOLOSD. & M A S. (2004). Subspace clusteringof high dimensional data. InSIAM Int. Conf. on Data Mining.

KAILING K., KRIEGEL H.-P. & KRÖGERP. (2004). Density-connected subspace clusteringfor high-dimensional data. InSIAM Int. Conf. on Data Mining, p. 246–257.

NAGESH H., GOIL S. & CHOUDHARY A. (1999). MAFIA : Efficient and scalable subspaceclustering for very large data sets. Rapport interne, Northwestern University.

PARSONS L., HAQUE E. & L IU H. (2004). Evaluating subspace clustering algorithms. InWorkshop on Clustering High Dimensional Data and its Applications, SIAM Int. Conf. on DataMining, p. 48–56.

QUINLAN J. R. (1993).C4.5 : Programs for Machine Learning. KAUFM.

SARAFIS I. A., TRINDER P. W. & ZALZALA A. M. S. (2003). Towards effective subspaceclustering with an evolutionary algorithm. InIEEE Congress on Evolutionary Computation,Canberra, Australia.

WOO K.-G. & L EE J.-H. (2002).FINDIT : a fast and intelligent subspace clustering algorithmusing dimension voting. PhD thesis, Korea Advanced Institute of Science and Technology,Department of Electrical Engineering and Computer Science.

YE L. & SPETSAKISM. (2003). Clustering on Unobserved Data using Mixture of Gaussians.Rapport interne, York University, Toronto, Canada.

Y IP K. Y., CHEUNGD. W. & NG M. K. (2003). A highly-usable projected clustering algorithmfor gene expression profiles. In3rd ACM SIGKDD Workshop on Data Mining in Bioinformatics,p. 41–48.

Date post:	12-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

SuSE : Subspace Selection embedded in an EM algorithm · 2017-01-12 · SuSE : Subspace Selection...

Documents