+ All Categories
Home > Documents > IEEE TRANSACTIONS ON KNOWLEDGE AND DATA...

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA...

Date post: 27-Apr-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
15
IEEE TRANSACTIONSON KNOWLEDGE AND DATAENGINEERING, VOL. XX, NO. YY, 2011 1 Clustering with Multi-Viewpoint based Similarity Measure Duc Thang Nguyen, Lihui Chen, Senior Member, IEEE, and Chee Keong Chan Abstract—All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity between a pair of objects can be dened either explicitly or implicitly. In this paper, we introduce a novel multi-viewpoint based similarity measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that use other popular similarity measures on various document collections to verify the advantages of our proposal. Index Terms—Document clustering, text mining, similarity measure. 1 I NTRODUCTION Clustering is one of the most interesting and impor- tant topics in data mining. The aim of clustering is to find intrinsic structures in data, and organize them into meaningful subgroups for further study and analysis. There have been many clustering algorithms published every year. They can be proposed for very distinct research fields, and developed using totally different techniques and approaches. Nevertheless, according to a recent study [1], more than half a century after it was introduced, the simple algorithm k-means still remains as one of the top 10 data mining algorithms nowadays. It is the most frequently used partitional clustering al- gorithm in practice. Another recent scientific discussion [2] states that k-means is the favourite algorithm that practitioners in the related fields choose to use. Need- less to mention, k-means has more than a few basic drawbacks, such as sensitiveness to initialization and to cluster size, and its performance can be worse than other state-of-the-art algorithms in many domains. In spite of that, its simplicity, understandability and scalability are the reasons for its tremendous popularity. An algorithm with adequate performance and usability in most of application scenarios could be preferable to one with better performance in some cases but limited usage due to high complexity. While offering reasonable results, k- means is fast and easy to combine with other methods in larger systems. A common approach to the clustering problem is to treat it as an optimization process. An optimal partition is found by optimizing a particular function of similarity (or distance) among data. Basically, there is an implicit The authors are with the Division of Information Engineering, School of Electrical and Electronic Engineering, Nanyang Technological Univer- sity, Block S1, Nanyang Ave., Republic of Singapore, 639798. Email: vic- [email protected], {elhchen, eckchan}@ntu.edu.sg. assumption that the true intrinsic structure of data could be correctly described by the similarity formula de- fined and embedded in the clustering criterion function. Hence, effectiveness of clustering algorithms under this approach depends on the appropriateness of the similar- ity measure to the data at hand. For instance, the origi- nal k-means has sum-of-squared-error objective function that uses Euclidean distance. In a very sparse and high- dimensional domain like text documents, spherical k- means, which uses cosine similarity instead of Euclidean distance as the measure, is deemed to be more suitable [3], [4]. In [5], Banerjee et al. showed that Euclidean distance was indeed one particular form of a class of distance measures called Bregman divergences. They proposed Bregman hard-clustering algorithm, in which any kind of the Bregman divergences could be applied. Kullback- Leibler divergence was a special case of Bregman di- vergences that was said to give good clustering results on document datasets. Kullback-Leibler divergence is a good example of non-symmetric measure. Also on the topic of capturing dissimilarity in data, Pakalska et al. [6] found that the discriminative power of some distance measures could increase when their non-Euclidean and non-metric attributes were increased. They concluded that non-Euclidean and non-metric measures could be informative for statistical learning of data. In [7], Pelillo even argued that the symmetry and non-negativity as- sumption of similarity measures was actually a limi- tation of current state-of-the-art clustering approaches. Simultaneously, clustering still requires more robust dis- similarity or similarity measures; recent works such as [8] illustrate this need. The work in this paper is motivated by investigations from the above and similar research findings. It appears to us that the nature of similarity measure plays a very important role in the success or failure of a cluster- Digital Object Indentifier 10.1109/TKDE.2011.86 1041-4347/11/$26.00 © 2011 IEEE This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Transcript
Page 1: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 1

Clustering with Multi-Viewpoint basedSimilarity Measure

Duc Thang Nguyen, Lihui Chen, Senior Member, IEEE, and Chee Keong Chan

Abstract—All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similaritybetween a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multi-viewpoint based similaritymeasure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours isthat the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objectsassumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessmentof similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions fordocument clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms thatuse other popular similarity measures on various document collections to verify the advantages of our proposal.

Index Terms—Document clustering, text mining, similarity measure.

1 INTRODUCTIONClustering is one of the most interesting and impor-tant topics in data mining. The aim of clustering is tofind intrinsic structures in data, and organize them intomeaningful subgroups for further study and analysis.There have been many clustering algorithms publishedevery year. They can be proposed for very distinctresearch fields, and developed using totally differenttechniques and approaches. Nevertheless, according toa recent study [1], more than half a century after it wasintroduced, the simple algorithm k-means still remainsas one of the top 10 data mining algorithms nowadays.It is the most frequently used partitional clustering al-gorithm in practice. Another recent scientific discussion[2] states that k-means is the favourite algorithm thatpractitioners in the related fields choose to use. Need-less to mention, k-means has more than a few basicdrawbacks, such as sensitiveness to initialization and tocluster size, and its performance can be worse than otherstate-of-the-art algorithms in many domains. In spite ofthat, its simplicity, understandability and scalability arethe reasons for its tremendous popularity. An algorithmwith adequate performance and usability in most ofapplication scenarios could be preferable to one withbetter performance in some cases but limited usage dueto high complexity. While offering reasonable results, k-means is fast and easy to combine with other methodsin larger systems.

A common approach to the clustering problem is totreat it as an optimization process. An optimal partitionis found by optimizing a particular function of similarity(or distance) among data. Basically, there is an implicit

The authors are with the Division of Information Engineering, Schoolof Electrical and Electronic Engineering, Nanyang Technological Univer-sity, Block S1, Nanyang Ave., Republic of Singapore, 639798. Email: [email protected], {elhchen, eckchan}@ntu.edu.sg.

assumption that the true intrinsic structure of data couldbe correctly described by the similarity formula de-fined and embedded in the clustering criterion function.Hence, effectiveness of clustering algorithms under thisapproach depends on the appropriateness of the similar-ity measure to the data at hand. For instance, the origi-nal k-means has sum-of-squared-error objective functionthat uses Euclidean distance. In a very sparse and high-dimensional domain like text documents, spherical k-means, which uses cosine similarity instead of Euclideandistance as the measure, is deemed to be more suitable[3], [4].

In [5], Banerjee et al. showed that Euclidean distancewas indeed one particular form of a class of distancemeasures called Bregman divergences. They proposedBregman hard-clustering algorithm, in which any kindof the Bregman divergences could be applied. Kullback-Leibler divergence was a special case of Bregman di-vergences that was said to give good clustering resultson document datasets. Kullback-Leibler divergence is agood example of non-symmetric measure. Also on thetopic of capturing dissimilarity in data, Pakalska et al.[6] found that the discriminative power of some distancemeasures could increase when their non-Euclidean andnon-metric attributes were increased. They concludedthat non-Euclidean and non-metric measures could beinformative for statistical learning of data. In [7], Pelilloeven argued that the symmetry and non-negativity as-sumption of similarity measures was actually a limi-tation of current state-of-the-art clustering approaches.Simultaneously, clustering still requires more robust dis-similarity or similarity measures; recent works such as[8] illustrate this need.

The work in this paper is motivated by investigationsfrom the above and similar research findings. It appearsto us that the nature of similarity measure plays a veryimportant role in the success or failure of a cluster-

Digital Object Indentifier 10.1109/TKDE.2011.86 1041-4347/11/$26.00 © 2011 IEEE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 2: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 2

TABLE 1Notations

Notation Description

n number of documentsm number of termsc number of classesk number of clustersd document vector, ‖d‖ = 1

S = {d1, . . . , dn} set of all the documentsSr set of documents in cluster r

D =∑

di∈Sdi composite vector of all the documents

Dr =∑

di∈Srdi composite vector of cluster r

C = D/n centroid vector of all the documentsCr = Dr/nr centroid vector of cluster r, nr = |Sr|

ing method. Our first objective is to derive a novelmethod for measuring similarity between data objects insparse and high-dimensional domain, particularly textdocuments. From the proposed similarity measure, wethen formulate new clustering criterion functions andintroduce their respective clustering algorithms, whichare fast and scalable like k-means, but are also capableof providing high-quality and consistent performance.

The remaining of this paper is organized as follows.In Section 2, we review related literature on similar-ity and clustering of documents. We then present ourproposal for document similarity measure in Section 3.It is followed by two criterion functions for documentclustering and their optimization algorithms in Sec-tion 4. Extensive experiments on real-world benchmarkdatasets are presented and discussed in Sections 5 and 6.Finally, conclusions and potential future work are givenin Section 7.

2 RELATED WORKFirst of all, Table 1 summarizes the basic notations thatwill be used extensively throughout this paper to rep-resent documents and related concepts. Each documentin a corpus corresponds to an m-dimensional vector d,where m is the total number of terms that the documentcorpus has. Document vectors are often subjected tosome weighting schemes, such as the standard TermFrequency-Inverse Document Frequency (TF-IDF), andnormalized to have unit length.

The principle definition of clustering is to arrange dataobjects into separate clusters such that the intra-clustersimilarity as well as the inter-cluster dissimilarity is max-imized. The problem formulation itself implies that someforms of measurement are needed to determine suchsimilarity or dissimilarity. There are many state-of-the-art clustering approaches that do not employ any specificform of measurement, for instance, probabilistic model-based method [9], non-negative matrix factorization [10],information theoretic co-clustering [11] and so on. Inthis paper, though, we primarily focus on methods thatindeed do utilize a specific measure. In the literature,Euclidean distance is one of the most popular measures:

Dist (di, dj) = ‖di − dj‖ (1)

It is used in the traditional k-means algorithm. The ob-jective of k-means is to minimize the Euclidean distancebetween objects of a cluster and that cluster’s centroid:

min

k∑r=1

∑di∈Sr

‖di − Cr‖2 (2)

However, for data in a sparse and high-dimensionalspace, such as that in document clustering, cosine simi-larity is more widely used. It is also a popular similarityscore in text mining and information retrieval [12]. Par-ticularly, similarity of two document vectors di and dj ,Sim(di, dj), is defined as the cosine of the angle betweenthem. For unit vectors, this equals to their inner product:

Sim (di, dj) = cos (di, dj) = dtidj (3)

Cosine measure is used in a variant of k-means calledspherical k-means [3]. While k-means aims to mini-mize Euclidean distance, spherical k-means intends tomaximize the cosine similarity between documents in acluster and that cluster’s centroid:

max

k∑r=1

∑di∈Sr

dtiCr

‖Cr‖ (4)

The major difference between Euclidean distance andcosine similarity, and therefore between k-means andspherical k-means, is that the former focuses on vectormagnitudes, while the latter emphasizes on vector di-rections. Besides direct application in spherical k-means,cosine of document vectors is also widely used in manyother document clustering methods as a core similaritymeasurement. The min-max cut graph-based spectralmethod is an example [13]. In graph partitioning ap-proach, document corpus is consider as a graph G =(V,E), where each document is a vertex in V and eachedge in E has a weight equal to the similarity between apair of vertices. Min-max cut algorithm tries to minimizethe criterion function:

mink∑

r=1

Sim (Sr, S \ Sr)

Sim (Sr, Sr)(5)

where Sim (Sq, Sr)1≤q,r≤k

=∑

di∈Sq,dj∈Sr

Sim(di, dj)

and when the cosine as in Eq. (3) is used, minimizingthe criterion in Eq. (5) is equivalent to:

min

k∑r=1

DtrD

‖Dr‖2(6)

There are many other graph partitioning methods withdifferent cutting strategies and criterion functions, suchas Average Weight [14] and Normalized Cut [15], allof which have been successfully applied for documentclustering using cosine as the pairwise similarity score[16], [17]. In [18], an empirical study was conducted tocompare a variety of criterion functions for documentclustering.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 3: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 3

Another popular graph-based clustering technique isimplemented in a software package called CLUTO [19].This method first models the documents with a nearest-neighbor graph, and then splits the graph into clustersusing a min-cut algorithm. Besides cosine measure, theextended Jaccard coefficient can also be used in thismethod to represent similarity between nearest docu-ments. Given non-unit document vectors ui, uj (di =ui/‖ui‖, dj = uj/‖uj‖), their extended Jaccard coefficientis:

SimeJacc (ui, uj) =utiuj

‖ui‖2 + ‖uj‖2 − utiuj

(7)

Compared with Euclidean distance and cosine similarity,the extended Jaccard coefficient takes into account boththe magnitude and the direction of the document vec-tors. If the documents are instead represented by theircorresponding unit vectors, this measure has the sameeffect as cosine similarity. In [20], Strehl et al. comparedfour measures: Euclidean, cosine, Pearson correlationand extended Jaccard, and concluded that cosine andextended Jaccard are the best ones on web documents.

In nearest-neighbor graph clustering methods, suchas the CLUTO’s graph method above, the concept ofsimilarity is somewhat different from the previouslydiscussed methods. Two documents may have a certainvalue of cosine similarity, but if neither of them is inthe other one’s neighborhood, they have no connec-tion between them. In such a case, some context-basedknowledge or relativeness property is already taken intoaccount when considering similarity. Recently, Ahmadand Dey [21] proposed a method to compute distancebetween two categorical values of an attribute based ontheir relationship with all other attributes. Subsequently,Ienco et al. [22] introduced a similar context-based dis-tance learning method for categorical data. However, fora given attribute, they only selected a relevant subsetof attributes from the whole attribute set to use as thecontext for calculating distance between its two values.

More related to text data, there are phrase-based andconcept-based document similarities. Lakkaraju et al.[23] employed a conceptual tree-similarity measure toidentify similar documents. This method requires rep-resenting documents as concept trees with the help ofa classifier. For clustering, Chim and Deng [24] pro-posed a phrase-based document similarity by combiningsuffix tree model and vector space model. They thenused Hierarchical Agglomerative Clustering algorithmto perform the clustering task. However, a drawback ofthis approach is the high computational complexity dueto the needs of building the suffix tree and calculatingpairwise similarities explicitly before clustering. Thereare also measures designed specifically for capturingstructural similarity among XML documents [25]. Theyare essentially different from the document-content mea-sures that are discussed in this paper.

In general, cosine similarity still remains as the mostpopular measure because of its simple interpretation

and easy computation, though its effectiveness is yetfairly limited. In the following sections, we propose anovel way to evaluate similarity between documents,and consequently formulate new criterion functions fordocument clustering.

3 MULTI-VIEWPOINT BASED SIMILARITY3.1 Our novel similarity measureThe cosine similarity in Eq. (3) can be expressed in thefollowing form without changing its meaning:

Sim (di, dj) = cos (di−0, dj−0) = (di−0)t (dj−0) (8)

where 0 is vector 0 that represents the origin point.According to this formula, the measure takes 0 as oneand only reference point. The similarity between twodocuments di and dj is determined w.r.t. the anglebetween the two points when looking from the origin.

To construct a new concept of similarity, it is possibleto use more than just one point of reference. We mayhave a more accurate assessment of how close or distanta pair of points are, if we look at them from many differ-ent viewpoints. From a third point dh, the directions anddistances to di and dj are indicated respectively by thedifference vectors (di − dh) and (dj − dh). By standing atvarious reference points dh to view di, dj and workingon their difference vectors, we define similarity betweenthe two documents as:

Sim(di, dj)di,dj∈Sr

=1

n−nr

∑dh∈S\Sr

Sim(di−dh, dj−dh) (9)

As described by the above equation, similarity of twodocuments di and dj - given that they are in the samecluster - is defined as the average of similarities mea-sured relatively from the views of all other documentsoutside that cluster. What is interesting is that the simi-larity here is defined in a close relation to the clusteringproblem. A presumption of cluster memberships hasbeen made prior to the measure. The two objects tobe measured must be in the same cluster, while thepoints from where to establish this measurement mustbe outside of the cluster. We call this proposal theMulti-Viewpoint based Similarity, or MVS. From thispoint onwards, we will denote the proposed similar-ity measure between two document vectors di and djby MVS(di, dj |di, dj∈Sr), or occasionally MVS(di, dj) forshort.

The final form of MVS in Eq. (9) depends on particularformulation of the individual similarities within the sum.If the relative similarity is defined by dot-product of thedifference vectors, we have:

MVS(di, dj |di, dj ∈ Sr)

=1

n−nr

∑dh∈S\Sr

(di−dh)t(dj−dh)

=1

n−nr

∑dh

cos(di−dh, dj−dh)‖di−dh‖‖dj−dh‖ (10)

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 4: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 4

The similarity between two points di and dj insidecluster Sr, viewed from a point dh outside this cluster, isequal to the product of the cosine of the angle between diand dj looking from dh and the Euclidean distances fromdh to these two points. This definition is based on theassumption that dh is not in the same cluster with di anddj . The smaller the distances ‖di−dh‖ and ‖dj−dh‖ are,the higher the chance that dh is in fact in the same clusterwith di and dj , and the similarity based on dh shouldalso be small to reflect this potential. Therefore, throughthese distances, Eq. (10) also provides a measure of inter-cluster dissimilarity, given that points di and dj belongto cluster Sr, whereas dh belongs to another cluster. Theoverall similarity between di and dj is determined bytaking average over all the viewpoints not belonging tocluster Sr. It is possible to argue that while most of theseviewpoints are useful, there may be some of them givingmisleading information just like it may happen with theorigin point. However, given a large enough number ofviewpoints and their variety, it is reasonable to assumethat the majority of them will be useful. Hence, the effectof misleading viewpoints is constrained and reduced bythe averaging step. It can be seen that this method offersmore informative assessment of similarity than the singleorigin point based similarity measure.

3.2 Analysis and practical examples of MVSIn this section, we present analytical study to show thatthe proposed MVS could be a very effective similaritymeasure for data clustering. In order to demonstrateits advantages, MVS is compared with cosine similarity(CS) on how well they reflect the true group structurein document collections. Firstly, exploring Eq. (10), wehave:

MVS(di, dj |di, dj ∈ Sr)

=1

n− nr

∑dh∈S\Sr

(dtidj − dtidh − dtjdh + dthdh

)

= dtidj−1

n−nrdti

∑dh

dh− 1

n−nrdtj

∑dh

dh+1, ‖dh‖=1

= dtidj −1

n− nrdtiDS\Sr

− 1

n− nrdtjDS\Sr

+ 1

= dtidj − dtiCS\Sr− dtjCS\Sr

+ 1 (11)

where DS\Sr=

∑dh∈S\Sr

dh is the composite vectorof all the documents outside cluster r, called the outercomposite w.r.t. cluster r, and CS\Sr

= DS\Sr/(n−nr)

the outer centroid w.r.t. cluster r, ∀r = 1, . . . , k. FromEq. (11), when comparing two pairwise similaritiesMVS(di, dj) and MVS(di, dl), document dj is more simi-lar to document di than the other document dl is, if andonly if:

dtidj − dtjCS\Sr> dtidl − dtlCS\Sr

⇔ cos(di, dj)− cos(dj , CS\Sr)‖CS\Sr

‖ >cos(di, dl)− cos(dl, CS\Sr

)‖CS\Sr‖ (12)

1: procedure BUILDMVSMATRIX(A)2: for r ← 1 : c do3: DS\Sr

←∑di /∈Sr

di4: nS\Sr

← |S \ Sr|5: end for6: for i← 1 : n do7: r ← class of di8: for j ← 1 : n do9: if dj ∈ Sr then

10: aij ← dtidj − dtiDS\Sr

nS\Sr

− dtjDS\Sr

nS\Sr

+ 1

11: else12: aij←dtidj−dti

DS\Sr−dj

nS\Sr−1 −d

tj

DS\Sr−dj

nS\Sr−1 +1

13: end if14: end for15: end for16: return A = {aij}n×n

17: end procedure

Fig. 1. Procedure: Build MVS similarity matrix.

From this condition, it is seen that even when dlis considered “closer” to di in terms of CS, i.e.cos(di, dj)≤ cos(di, dl), dl can still possibly be regardedas less similar to di based on MVS if, on the contrary,it is “closer” enough to the outer centroid CS\Sr

thandj is. This is intuitively reasonable, since the “closer” dlis to CS\Sr

, the greater the chance it actually belongsto another cluster rather than Sr and is, therefore, lesssimilar to di. For this reason, MVS brings to the table anadditional useful measure compared with CS.

To further justify the above proposal and analysis, wecarried out a validity test for MVS and CS. The purposeof this test is to check how much a similarity measurecoincides with the true class labels. It is based on oneprinciple: if a similarity measure is appropriate for theclustering problem, for any of a document in the corpus,the documents that are closest to it based on this measureshould be in the same cluster with it.

The validity test is designed as following. For eachtype of similarity measure, a similarity matrix A ={aij}n×n is created. For CS, this is simple, as aij = dtidj .The procedure for building MVS matrix is described inFig. 1. Firstly, the outer composite w.r.t. each class isdetermined. Then, for each row ai of A, i = 1, . . . , n,if the pair of documents di and dj , j = 1, . . . , n are inthe same class, aij is calculated as in line 10, Fig. 1.Otherwise, dj is assumed to be in di’s class, and aijis calculated as in line 12, Fig. 1. After matrix A isformed, the procedure in Fig. 2 is used to get its validityscore. For each document di corresponding to row ai ofA, we select qr documents closest to di. The value ofqr is chosen relatively as percentage of the size of theclass r that contains di, where percentage ∈ (0, 1]. Then,validity w.r.t. di is calculated by the fraction of these qrdocuments having the same class label with di, as in line12, Fig. 2. The final validity is determined by averaging

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 5: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 5

Require: 0 < percentage ≤ 11: procedure GETVALIDITY(validity, A, percentage)2: for r ← 1 : c do3: qr ← �percentage× nr4: if qr = 0 then � percentage too small5: qr ← 16: end if7: end for8: for i← 1 : n do9: {aiv[1], . . . , aiv[n]} ←Sort {ai1, . . . , ain}

10:s.t. aiv[1] ≥ aiv[2] ≥ . . . ≥ aiv[n]

{v[1], . . . , v[n]} ← permute {1, . . . , n}11: r ← class of di12: validity(di)←

|{dv[1], . . . , dv[qr ]} ∩ Sr|qr

13: end for

14: validity ←∑n

i←1 validity(di)

n15: return validity16: end procedure

Fig. 2. Procedure: Get validity score.

over all the rows of A, as in line 14, Fig. 2. It is clearthat validity score is bounded within 0 and 1. The highervalidity score a similarity measure has, the more suitableit should be for the clustering task.

Two real-world document datasets are used as exam-ples in this validity test. The first is reuters7, a subsetof the famous collection, Reuters-21578 Distribution 1.0,of Reuter’s newswire articles1. Reuters-21578 is one ofthe most widely used test collection for text categoriza-tion. In our validity test, we selected 2,500 documentsfrom the largest 7 categories: “acq”, “crude”, “interest”,“earn”, “money-fx”, “ship” and “trade” to form reuters7.Some of the documents may appear in more than onecategory. The second dataset is k1b, a collection of 2,340web pages from the Yahoo! subject hierarchy, including6 topics: “health”, “entertainment”, “sport”, “politics”,“tech” and “business”. It was created from a past studyin information retrieval called WebAce [26], and is nowavailable with the CLUTO toolkit [19].

The two datasets were preprocessed by stop-wordremoval and stemming. Moreover, we removed wordsthat appear in less than two documents or more than99.5% of the total number of documents. Finally, thedocuments were weighted by TF-IDF and normalized tounit vectors. The full characteristics of reuters7 and k1bare presented in Fig. 3.

Fig. 4 shows the validity scores of CS and MVS onthe two datasets relative to the parameter percentage.The value of percentage is set at 0.001, 0.01, 0.05, 0.1,0.2,. . . ,1.0. According to Fig. 4, MVS is clearly better thanCS for both datasets in this validity test. For example,with k1b dataset at percentage = 1.0, MVS’ validityscore is 0.80, while that of CS is only 0.67. This indicates

1. http://www.daviddlewis.com/resources/testcollections/reuters21578/

acq29%

interest5%

crude8%

trade4%

ship4%

earn43%

money-fx7%

Reuters7Classes: 7

Documents: 2,500Words: 4,977

entertainment59%

health21%

business6%

sports6%

tech3%

politics5%

k1bClasses: 6

Documents: 2,340Words: 13,859

Fig. 3. Characteristics of reuters7 and k1b datasets.

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1percentage

validity

k1b-CS k1b-MVSreuters7-CS reuters7-MVS

Fig. 4. CS and MVS validity test.

that, on average, when we pick up any document andconsider its neighborhood of size equal to its true classsize, only 67% of that document’s neighbors based onCS actually belong to its class. If based on MVS, thenumber of valid neighbors increases to 80%. The validitytest has illustrated the potential advantage of the newmulti-viewpoint based similarity measure compared tothe cosine measure.

4 MULTI-VIEWPOINT BASED CLUSTERING

4.1 Two clustering criterion functions IR and IV

Having defined our similarity measure, we now formu-late our clustering criterion functions. The first function,called IR, is the cluster size-weighted sum of averagepairwise similarities of documents in the same cluster.Firstly, let us express this sum in a general form byfunction F :

F =

k∑r=1

nr

⎡⎣ 1

n2r

∑di,dj∈Sr

Sim(di, dj)

⎤⎦ (13)

We would like to transform this objective function intosome suitable form such that it could facilitate the op-timization procedure to be performed in a simple, fast

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 6: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 6

and effective way. According to Eq. (10):∑di,dj∈Sr

Sim(di, dj)

=∑

di,dj∈Sr

1

n− nr

∑dh∈S\Sr

(di − dh)t(dj − dh)

=1

n− nr

∑di,dj

∑dh

(dtidj − dtidh − dtjdh + dthdh

)

Since ∑di∈Sr

di =∑

dj∈Sr

dj = Dr,

∑dh∈S\Sr

dh = D −Dr and ‖dh‖ = 1,

we have ∑di,dj∈Sr

Sim(di, dj)

=∑

di,dj∈Sr

dtidj −2nr

n− nr

∑di∈Sr

dti∑

dh∈S\Sr

dh + n2r

= DtrDr − 2nr

n− nrDt

r(D −Dr) + n2r

=n+ nr

n− nr‖Dr‖2 − 2nr

n− nrDt

rD + n2r

Substituting into Eq. (13) to get:

F =

k∑r=1

1

nr

[n+ nr

n− nr‖Dr‖2 −

(n+ nr

n− nr− 1

)Dt

rD

]+ n

Because n is constant, maximizing F is equivalent tomaximizing F :

F =k∑

r=1

1

nr

[n+nr

n−nr‖Dr‖2 −

(n+nr

n−nr− 1

)Dt

rD

](14)

If comparing F with the min-max cut in Eq. (5), bothfunctions contain the two terms ||Dr||2 (an intra-clustersimilarity measure) and Dt

rD (an inter-cluster similaritymeasure). Nonetheless, while the objective of min-maxcut is to minimize the inverse ratio between these twoterms, our aim here is to maximize their weighted differ-ence. In F , this difference term is determined for eachcluster. They are weighted by the inverse of the clus-ter’s size, before summed up over all the clusters. Oneproblem is that this formulation is expected to be quitesensitive to cluster size. From the formulation of COSA[27] - a widely known subspace clustering algorithm - wehave learned that it is desirable to have a set of weightfactors λ = {λr}k1 to regulate the distribution of thesecluster sizes in clustering solutions. Hence, we integrateλ into the expression of F to have it become:

Fλ =

k∑r=1

λr

nr

[n+nr

n−nr‖Dr‖2 −

(n+nr

n−nr−1

)Dt

rD

](15)

In common practice, {λr}k1 are often taken to be simplefunctions of the respective cluster sizes {nr}k1 [28]. Let

us use a parameter α called the regulating factor, whichhas some constant value (α ∈ [0, 1]), and let λr = nα

r inEq. (15), the final form of our criterion function IR is:

IR=k∑

r=1

1

n1−αr

[n+nr

n−nr‖Dr‖2−

(n+nr

n−nr−1

)Dt

rD

](16)

In the empirical study of Section 5.4, it appears thatIR’s performance dependency on the value of α is notvery critical. The criterion function yields relatively goodclustering results for α ∈ (0, 1).

In the formulation of IR, a cluster quality is measuredby the average pairwise similarity between documentswithin that cluster. However, such an approach can leadto sensitiveness to the size and tightness of the clusters.With CS, for example, pairwise similarity of documentsin a sparse cluster is usually smaller than those in adense cluster. Though not as clear as with CS, it is stillpossible that the same effect may hinder MVS-basedclustering if using pairwise similarity. To prevent this,an alternative approach is to consider similarity betweeneach document vector and its cluster’s centroid instead.This is expressed in objective function G:

G=k∑

r=1

∑di∈Sr

1

n−nr

∑dh∈S\Sr

Sim

(di−dh, Cr

‖Cr‖−dh)

G=

k∑r=1

1

n−nr

∑di∈Sr

∑dh∈S\Sr

(di−dh)t(

Cr

‖Cr‖−dh)

(17)

Similar to the formulation of IR, we would like toexpress this objective in a simple form that we couldoptimize more easily. Exploring the vector dot product,we get:

∑di∈Sr

∑dh∈S\Sr

(di − dh)t

(Cr

‖Cr‖ − dh

)

=∑di

∑dh

(dti

Cr

‖Cr‖ − dtidh − dthCr

‖Cr‖ + 1

)

= (n−nr)Dtr

Dr

‖Dr‖ −Dtr(D−Dr)− nr(D−Dr)

t Dr

‖Dr‖+ nr(n− nr) , since

Cr

‖Cr‖ =Dr

‖Dr‖= (n+ ‖Dr‖) ‖Dr‖ − (nr + ‖Dr‖) D

trD

‖Dr‖ + nr(n− nr)

Substituting the above into Eq. (17) to have:

G =

k∑r=1

[n+‖Dr‖n−nr

‖Dr‖ −(n+‖Dr‖n−nr

− 1

)Dt

rD

‖Dr‖]+ n

Again, we could eliminate n because it is a constant.Maximizing G is equivalent to maximizing IV below:

IV =

k∑r=1

[n+‖Dr‖n−nr

‖Dr‖−(n+‖Dr‖n−nr

−1)

DtrD

‖Dr‖]

(18)

IV calculates the weighted difference between the twoterms: ‖Dr‖ and Dt

rD/‖Dr‖, which again represent an

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 7: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 7

1: procedure INITIALIZATION2: Select k seeds s1, . . . , sk randomly3: cluster[di]← p = argmaxr{strdi}, ∀i = 1, . . . , n4: Dr ←

∑di∈Sr

di, nr ← |Sr|, ∀r = 1, . . . , k5: end procedure6: procedure REFINEMENT7: repeat8: {v[1 : n]} ← random permutation of {1, . . . , n}9: for j ← 1 : n do

10: i← v[j]11: p← cluster[di]12: ΔIp ← I(np − 1, Dp − di)− I(np, Dp)13: q ← arg max

r,r �=p{I(nr+1, Dr+di)−I(nr, Dr)}

14: ΔIq ← I(nq + 1, Dq + di)− I(nq, Dq)15: if ΔIp +ΔIq > 0 then16: Move di to cluster q: cluster[di]← q17: Update Dp, np, Dq, nq

18: end if19: end for20: until No move for all n documents21: end procedure

Fig. 5. Algorithm: Incremental clustering.

intra-cluster similarity measure and an inter-cluster sim-ilarity measure, respectively. The first term is actuallyequivalent to an element of the sum in spherical k-meansobjective function in Eq. (4); the second one is similar toan element of the sum in min-max cut criterion in Eq.(6), but with ‖Dr‖ as scaling factor instead of ‖Dr‖2. Wehave presented our clustering criterion functions IR andIV in the simple forms. Next, we show how to performclustering by using a greedy algorithm to optimize thesefunctions.

4.2 Optimization algorithm and complexityWe denote our clustering framework by MVSC, meaningClustering with Multi-Viewpoint based Similarity. Sub-sequently, we have MVSC-IR and MVSC-IV , which areMVSC with criterion function IR and IV respectively.The main goal is to perform document clustering by opti-mizing IR in Eq. (16) and IV in Eq. (18). For this purpose,the incremental k-way algorithm [18], [29] - a sequentialversion of k-means - is employed. Considering that theexpression of IV in Eq. (18) depends only on nr and Dr,r = 1, . . . , k, IV can be written in a general form:

IV =k∑

r=1

Ir (nr, Dr) (19)

where Ir (nr, Dr) corresponds to the objective value ofcluster r. The same is applied to IR. With this generalform, the incremental optimization algorithm, whichhas two major steps Initialization and Refinement, isdescribed in Fig. 5. At Initialization, k arbitrary doc-uments are selected to be the seeds from which intialpartitions are formed. Refinement is a procedure that

consists of a number of iterations. During each iteration,the n documents are visited one by one in a totallyrandom order. Each document is checked if its move toanother cluster results in improvement of the objectivefunction. If yes, the document is moved to the clusterthat leads to the highest improvement. If no clustersare better than the current cluster, the document isnot moved. The clustering process terminates when aniteration completes without any documents being movedto new clusters. Unlike the traditional k-means, thisalgorithm is a stepwise optimal procedure. While k-means only updates after all n documents have been re-assigned, the incremental clustering algorithm updatesimmediately whenever each document is moved to newcluster. Since every move when happens increases theobjective function value, convergence to a local optimumis guaranteed.

During the optimization procedure, in each iteration,the main sources of computational cost are:

• Searching for optimum clusters to move individualdocuments to: O(nz · k).

• Updating composite vectors as a result of suchmoves: O(m · k).

where nz is the total number of non-zero entries in alldocument vectors. Our clustering approach is partitionaland incremental; therefore, computing similarity matrixis absolutely not needed. If τ denotes the number ofiterations the algorithm takes, since nz is often severaltens times larger than m for document domain, thecomputational complexity required for clustering withIR and IV is O(nz · k · τ).

5 PERFORMANCE EVALUATION OF MVSCTo verify the advantages of our proposed methods, weevaluate their performance in experiments on documentdata. The objective of this section is to compare MVSC-IR and MVSC-IV with the existing algorithms that alsouse specific similarity measures and criterion functionsfor document clustering. The similarity measures to becompared includes Euclidean distance, cosine similarityand extended Jaccard coefficient.

5.1 Document collectionsThe data corpora that we used for experiments consist oftwenty benchmark document datasets. Besides reuters7and k1b, which have been described in details earlier,we included another eighteen text collections so thatthe examination of the clustering methods is more thor-ough and exhaustive. Similar to k1b, these datasets areprovided together with CLUTO by the toolkit’s authors[19]. They had been used for experimental testing inprevious papers, and their source and origin had alsobeen described in details [30], [31]. Table 2 summarizestheir characteristics. The corpora present a diversity ofsize, number of classes and class balance. They wereall preprocessed by standard procedures, including stop-word removal, stemming, removal of too rare as well as

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 8: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 8

TABLE 2Document datasets

Data Source c n m Balance

fbis TREC 17 2,463 2,000 0.075hitech TREC 6 2,301 13,170 0.192k1a WebACE 20 2,340 13,859 0.018k1b WebACE 6 2,340 13,859 0.043la1 TREC 6 3,204 17,273 0.290la2 TREC 6 3,075 15,211 0.274re0 Reuters 13 1,504 2,886 0.018re1 Reuters 25 1,657 3,758 0.027tr31 TREC 7 927 10,127 0.006reviews TREC 5 4,069 23,220 0.099wap WebACE 20 1,560 8,440 0.015

classic CACM/CISI/ 4 7,089 12,009 0.323CRAN/MEDla12 TREC 6 6,279 21,604 0.282new3 TREC 44 9,558 36,306 0.149sports TREC 7 8,580 18,324 0.036tr11 TREC 9 414 6,424 0.045tr12 TREC 8 313 5,799 0.097tr23 TREC 6 204 5,831 0.066tr45 TREC 10 690 8,260 0.088reuters7 Reuters 7 2,500 4,977 0.082

c: # of classes, n: # of documents, m: # of wordsBalance= (smallest class size)/(largest class size)

too frequent words, TF-IDF weighting and normaliza-tion.

5.2 Experimental setup and evaluationTo demonstrate how well MVSCs can perform, wecompare them with five other clustering methods onthe twenty datasets in Table 2. In summary, the sevenclustering algorithms are:

• MVSC-IR: MVSC using criterion function IR• MVSC-IV : MVSC using criterion function IV• k-means: standard k-means with Euclidean distance• Spkmeans: spherical k-means with CS• graphCS: CLUTO’s graph method with CS• graphEJ: CLUTO’s graph with extended Jaccard• MMC: Spectral Min-Max Cut algorithm [13]

Our MVSC-IR and MVSC-IV programs are implementedin Java. The regulating factor α in IR is always set at 0.3during the experiments. We observed that this is oneof the most appropriate values. A study on MVSC-IR’sperformance relative to different α values is presentedin a later section. The other algorithms are provided bythe C library interface which is available freely with theCLUTO toolkit [19]. For each dataset, cluster number ispredefined equal to the number of true class, i.e. k = c.

None of the above algorithms are guaranteed tofind global optimum, and all of them are initialization-dependent. Hence, for each method, we performed clus-tering a few times with randomly initialized values,and chose the best trial in terms of the correspondingobjective function value. In all the experiments, each testrun consisted of 10 trials. Moreover, the result reportedhere on each dataset by a particular clustering methodis the average of 10 test runs.

After a test run, clustering solution is evaluated bycomparing the documents’ assigned labels with theirtrue labels provided by the corpus. Three types of ex-ternal evaluation metric are used to assess clusteringperformance. They are the FScore, Normalized MutualInformation (NMI) and Accuracy. FScore is an equallyweighted combination of the “precision” (P ) and “re-call” (R) values used in information retrieval. Given aclustering solution, FScore is determined as:

FScore =k∑

i=1

ni

nmax

j(Fi,j)

where Fi,j =2× Pi,j ×Ri,j

Pi,j +Ri,j;Pi,j =

ni,j

nj, Ri,j =

ni,j

ni

where ni denotes the number of documents in class i, nj

the number of documents assigned to cluster j, and ni,j

the number of documents shared by class i and clusterj. From another aspect, NMI measures the informationthe true class partition and the cluster assignment share.It measures how much knowing about the clusters helpsus know about the classes:

NMI =

∑ki=1

∑kj=1 ni,j log

(n·ni,j

ninj

)√(∑k

i=1 ni logni

n

)(∑kj=1 nj log

nj

n

)

Finally, Accuracy measures the fraction of documents thatare correctly labels, assuming a one-to-one correspon-dence between true classes and assigned clusters. Let qdenote any possible permutation of index set {1, . . . , k},Accuracy is calculated by:

Accuracy =1

nmax

q

k∑i=1

ni,q(i)

The best mapping q to determine Accuracy could befound by the Hungarian algorithm2. For all three metrics,their range is from 0 to 1, and a greater value indicatesa better clustering solution.

5.3 ResultsFig. 6 shows the Accuracy of the seven clustering al-gorithms on the twenty text collections. Presented in adifferent way, clustering results based on FScore and NMIare reported in Table 3 and Table 4 respectively. For eachdataset in a row, the value in bold and underlined is thebest result, while the value in bold only is the second tobest.

It can be observed that MVSC-IR and MVSC-IV per-form consistently well. In Fig. 6, 19 out of 20 datasets,except reviews, either both or one of MVSC approachesare in the top two algorithms. The next consistent per-former is Spkmeans. The other algorithms might workwell on certain dataset. For example, graphEJ yieldsoutstanding result on classic; graphCS and MMC are

2. http://en.wikipedia.org/wiki/Hungarian algorithm

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 9: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 9

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

fbis hit. k1a k1b la1 la2 re0 re1 tr31 rev.

Accuracy

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

wap clas. la12 new3 spo. tr11 tr12 tr23 tr45 reu.

Accuracy

MVSC-IR MVSC-IV kmeans Spkmeans graphCS graphEJ MMC

Fig. 6. Clustering results in Accuracy. Left-to-right in legend corresponds to left-to-right in the plot.

good on reviews. But they do not fare very well on therest of the collections.

To have a statistical justification of the clusteringperformance comparisons, we also carried out statisticalsignificance tests. Each of MVSC-IR and MVSC-IV waspaired up with one of the remaining algorithms for apaired t-test [32]. Given two paired sets X and Y ofN measured values, the null hypothesis of the test isthat the differences between X and Y come from apopulation with mean 0. The alternative hypothesis isthat the paired sets differ from each other in a significantway. In our experiment, these tests were done based onthe evaluation values obtained on the twenty datasets.The typical 5% significance level was used. For example,considering the pair (MVSC-IR, k-means), from Table 3,it is seen that MVSC-IR dominates k-means w.r.t. FScore.If the paired t-test returns a p-value smaller than 0.05,we reject the null hypothesis and say that the dominance

is significant. Otherwise, the null hypothesis is true andthe comparison is considered insignificant.

The outcomes of the paired t-tests are presented in Ta-ble 5. As the paired t-tests show, the advantage of MVSC-IR and MVSC-IV over the other methods is statisticallysignificant. A special case is the graphEJ algorithm. Onthe one hand, MVSC-IR is not significantly better thangraphEJ if based on FScore or NMI. On the other hand,when MVSC-IR and MVSC-IV are tested obviously bet-ter than graphEJ, the p-values can still be consideredrelatively large, although they are smaller than 0.05. Thereason is that, as observed before, graphEJ’s results onclassic dataset are very different from those of the otheralgorithms. While interesting, these values can be con-sidered as outliers, and including them in the statisticaltests would affect the outcomes greatly. Hence, we alsoreport in Table 5 the tests where classic was excludedand only results on the other 19 datasets were used.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 10: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 10

TABLE 3Clustering results in FScore

Data MVSC-IR MVSC-IV k-means Spkmeans graphCS graphEJ MMC

fbis .645 .613 .578 .584 .482 .503 .506hitech .512 .528 .467 .494 .492 .497 .468k1a .620 .592 .502 .545 .492 .517 .524k1b .873 .775 .825 .729 .740 .743 .707la1 .719 .723 .565 .719 .689 .679 .693la2 .721 .749 .538 .703 .689 .633 .698re0 .460 .458 .421 .421 .468 .454 .390re1 .514 .492 .456 .499 .487 .457 .443tr31 .728 .780 .585 .679 .689 .698 .607reviews .734 .748 .644 .730 .759 .690 .749wap .610 .571 .516 .545 .513 .497 .513classic .658 .734 .713 .687 .708 .983 .657la12 .719 .735 .559 .722 .706 .671 .693new3 .548 .547 .500 .558 .510 .496 .482sports .803 .804 .499 .702 .689 .696 .650tr11 .749 .728 .705 .719 .665 .658 .695tr12 .743 .758 .699 .715 .642 .722 .700tr23 .560 .553 .486 .523 .522 .531 .485tr45 .787 .788 .692 .799 .778 .798 .720reuters7 .774 .775 .658 .718 .651 .670 .687

TABLE 4Clustering results in NMI

Data MVSC-IR MVSC-IV k-means Spkmeans graphCS graphEJ MMC

fbis .606 .595 .584 .593 .527 .524 .556hitech .323 .329 .270 .298 .279 .292 .283k1a .612 .594 .563 .596 .537 .571 .588k1b .739 .652 .629 .649 .635 .650 .645la1 .569 .571 .397 .565 .490 .485 .553la2 .568 .590 .381 .563 .496 .478 .566re0 .399 .402 .388 .399 .367 .342 .414re1 .591 .583 .532 .593 .581 .566 .515tr31 .613 .658 .488 .594 .577 .580 .548reviews .584 .603 .460 .607 .570 .528 .639wap .611 .585 .568 .596 .557 .555 .575classic .574 .644 .579 .577 .558 .928 .543la12 .574 .584 .378 .568 .496 .482 .558new3 .621 .622 .578 .626 .580 .580 .577sports .669 .701 .445 .633 .578 .581 .591tr11 .712 .674 .660 .671 .634 .594 .666tr12 .686 .686 .647 .654 .578 .626 .640tr23 .432 .434 .363 .413 .344 .380 .369tr45 .734 .733 .640 .748 .726 .713 .667reuters7 .633 .632 .512 .612 .503 .520 .591

Under this circumstance, both MVSC-IR and MVSC-IVoutperform graphEJ significantly with good p-values.

5.4 Effect of α on MVSC-IR’s performanceIt has been known that criterion function based par-titional clustering methods can be sensitive to clustersize and balance. In the formulation of IR in Eq. (16),there exists parameter α which is called the regulatingfactor, α ∈ [0, 1]. To examine how the determination ofα could affect MVSC-IR’s performance, we evaluated

MVSC-IR with different values of α from 0 to 1, with 0.1incremental interval. The assessment was done based onthe clustering results in NMI, FScore and Accuracy, eachaveraged over all the twenty given datasets. Since theevaluation metrics for different datasets could be verydifferent from each other, simply taking the average overall the datasets would not be very meaningful. Hence,we employed the method used in [18] to transformthe metrics into relative metrics before averaging. Ona particular document collection S, the relative FScore

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 11: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 11

TABLE 5Statistical significance of comparisons based on paired t-tests with 5% significance level

k-means Spkmeans graphCS graphEJ* MMC

FScore MVSC-IR � � � > (�) �1.77E-5 1.60E-3 4.61E-4 .056 (7.68E-6) 3.27E-6

MVSC-IV � � � � (�) �7.52E-5 1.42E-4 3.27E-5 .022 (1.50E-6) 2.16E-7

NMI MVSC-IR � � � > (�) �7.42E-6 .013 2.39E-7 .060 (1.65E-8) 8.72E-5

MVSC-IV � � � � (�) �4.27E-5 .013 4.07E-7 .029 (4.36E-7) 2.52E-4

Accuracy MVSC-IR � � � � (�) �1.45E-6 1.50E-4 1.33E-4 .028 (3.29E-5) 8.33E-7

MVSC-IV � � � � (�) �1.74E-5 1.82E-4 4.19E-5 .014 (8.61E-6) 9.80E-7

“�” (or “�”) indicates the algorithm in the row performs significantly better (or worse) than the one in the column; “>” (or“<”) indicates an insignificant comparison. The values right below the symbols are p-values of the t-tests.

* Column of graphEJ: entries in parentheses are statistics when classic dataset is not included.

0.9

0.95

1

1.05

1.1

1.15

1.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

relative_NMI

relative_FScore

relative_Accuracy

Fig. 7. MVSC-IR’s performance with respect to α.

measure of MVSC-IR with α = αi is determined asfollowing:

relative FScore (IR;S, αi) =

maxαj

{FScore(IR;S, αj)}FScore(IR;S, αi)

where αi, αj ∈ {0.0, 0.1, . . . , 1.0}, FScore(IR;S, αi) is theFScore result on dataset S obtained by MVSC-IR withα = αi. The same transformation was applied to NMIand Accuracy to yield relative NMI and relative Accuracyrespectively. MVSC-IR performs the best with an αi

if its relative measure has a value of 1. Otherwise itsrelative measure is greater than 1; the larger this valueis, the worse MVSC-IR with αi performs in comparisonwith other settings of α. Finally, the average relativemeasures were calculated over all the datasets to presentthe overall performance.

Figure 7 shows the plot of average relative FScore, NMIand Accuracy w.r.t. different values of α. In a broad view,MVSC-IR performs the worst at the extreme values ofα (0 and 1), and tends to get better when α is set atsome soft values in between 0 and 1. Based on ourexperimental study, MVSC-IR always produces resultswithin 5% of the best case, regarding any types ofevaluation metric, with α from 0.2 to 0.8.

6 MVSC AS REFINEMENT FOR k-MEANSFrom the analysis of Eq. (12) in Section 3.2, MVS pro-vides an additional criterion for measuring the simi-larity among documents compared with CS. Alterna-tively, MVS can be considered as a refinement for CS,and consequently MVSC algorithms as refinements forspherical k-means, which uses CS. To further investigatethe appropriateness and effectiveness of MVS and itsclustering algorithms, we carried out another set ofexperiments in which solutions obtained by Spkmeanswere further optimized by MVSC-IR and MVSC-IV . Therationale for doing so is that if the final solutions byMVSC-IR and MVSC-IV are better than the intermediateones obtained by Spkmeans, MVS is indeed good forthe clustering problem. These experiments would revealmore clearly if MVS actually improves the clusteringperformance compared with CS.

In the previous section, MVSC algorithms have beencompared against the existing algorithms that are closelyrelated to them, i.e. ones that also employ similarity mea-sures and criterion functions. In this section, we makeuse of the extended experiments to further comparethe MVSC with a different type of clustering approach,the NMF methods [10], which do not use any form ofexplicitly defined similarity measure for documents.

6.1 TDT2 and Reuters-21578 collectionsFor variety and thoroughness, in this empirical study,we used two new document copora described in Table6: TDT2 and Reuters-21578. The original TDT2 corpus3,which consists of 11,201 documents in 96 topics (i.e.classes), has been one of the most standard sets fordocument clustering purpose. We used a sub-collectionof this corpus which contains 10,021 documents in thelargest 56 topics. The Reuters-21578 Distribution 1.0 hasbeen mentioned earlier in this paper. The original corpusconsists of 21,578 documents in 135 topics. We used a

3. http://www.nist.gov/speech/tests/tdt/tdt98/index.html

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 12: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 12

TABLE 6TDT2 and Reuters-21578 document corpora

TDT2 Reuters-21578

Total number of documents 10,021 8,213Total number of classes 56 41Largest class size 1,844 3,713Smallest class size 10 10

sub-collection having 8,213 documents from the largest41 topics. The same two document collections had beenused in the paper of the NMF methods [10]. Documentsthat appear in two or more topics were removed, andthe remaining documents were preprocessed in the sameway as in Section 5.1.

6.2 Experiments and resultsThe following clustering methods:

• Spkmeans: spherical k-means• rMVSC-IR: refinement of Spkmeans by MVSC-IR• rMVSC-IV : refinement of Spkmeans by MVSC-IV• MVSC-IR: normal MVSC using criterion IR• MVSC-IV : normal MVSC using criterion IV

and two new document clustering approaches that donot use any particular form of similarity measure:

• NMF: Non-negative Matrix Factorization method• NMF-NCW: Normalized Cut Weighted NMF

were involved in the performance comparison. Whenused as a refinement for Spkmeans, the algorithms

rMVSC-IR and rMVSC-IV worked directly on the outputsolution of Spkmeans. The cluster assignment producedby Spkmeans was used as initialization for both rMVSC-IR and rMVSC-IV . We also investigated the performanceof the original MVSC-IR and MVSC-IV further on thenew datasets. Besides, it would be interesting to see howthey and their Spkmeans-initialized versions fare againsteach other. What is more, two well-known documentclustering approaches based on non-negative matrix fac-torization, NMF and NMF-NCW [10], are also includedfor a comparison with our algorithms, which use explicitMVS measure.

During the experiments, each of the two corpora inTable 6 were used to create 6 different test cases, eachof which corresponded to a distinct number of topicsused (c = 5, . . . , 10). For each test case, c topics wererandomly selected from the corpus and their documentswere mixed together to form a test set. This selection wasrepeated 50 times so that each test case had 50 differenttest sets. The average performance of the clusteringalgorithms with k = c were calculated over these 50test sets. This experimental set-up is inspired by thesimilar experiments conducted in the NMF paper [10].Furthermore, similar to previous experimental setup inSection 5.2, each algorithm (including NMF and NMF-NCW) actually considered 10 trials on any test set beforeusing the solution of the best obtainable objective func-tion value as its final output.

The clustering results on TDT2 and Reuters-21578 areshown in Table 7 and 8 respectively. For each test case in

TABLE 7Clustering results on TDT2

NMI Accuracy

Algorithms k=5 k=6 k=7 k=8 k=9 k=10 k=5 k=6 k=7 k=8 k=9 k=10

Spkmeans .690 .704 .700 .677 .681 .656 .708 .689 .668 .620 .605 .578rMVSC-IR .753 .777 .766 .749 .738 .699 .855 .846 .822 .802 .760 .722rMVSC-IV .740 .764 .742 .729 .718 .676 .839 .837 .801 .785 .736 .701MVSC-IR .749 .790 .797 .760 .764 .722 .884 .867 .875 .840 .832 .780MVSC-IV .775 .785 .779 .745 .755 .714 .886 .871 .870 .825 .818 .777NMF .621 .630 .607 .581 .593 .555 .697 .686 .642 .604 .578 .555NMF-NCW .713 .746 .723 .707 .702 .659 .788 .821 .764 .749 .725 .675

TABLE 8Clustering results on Reuters-21578

NMI Accuracy

Algorithms k=5 k=6 k=7 k=8 k=9 k=10 k=5 k=6 k=7 k=8 k=9 k=10

Spkmeans .370 .435 .389 .336 .348 .428 .512 .508 .454 .390 .380 .429rMVSC-IR .386 .448 .406 .347 .359 .433 .591 .592 .522 .445 .437 .485rMVSC-IV .395 .438 .408 .351 .361 .434 .591 .573 .529 .453 .448 .477MVSC-IR .377 .442 .418 .354 .356 .441 .582 .588 .538 .473 .477 .505MVSC-IV .375 .444 .416 .357 .369 .438 .589 .588 .552 .475 .482 .512NMF .321 .369 .341 .289 .278 .359 .553 .534 .479 .423 .388 .430NMF-NCW .355 .413 .387 .341 .344 .413 .608 .580 .535 .466 .432 .493

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 13: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 13

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SpkmeansrMVSC-IRrMVSC-IVrMVSC-IR

(a) TDT2

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SpkmeansrMVSC-IRrMVSC-IV

(b) Reuters-21578

Fig. 8. Accuracies on the 50 test sets (in sorted order of Spkmeans) in the test case k = 5.

a column, the value in bold and underlined is the bestamong the results returned by the algorithms, while thevalue in bold only is the second to best. From the tables,several observations can be made. Firstly, MVSC-IR andMVSC-IV continue to show they are good clusteringalgorithms by outperforming other methods frequently.They are always the best in every test case of TDT2.Compared with NMF-NCW, they are better in almost allthe cases, except only the case of Reuters-21578, k = 5,where NMF-NCW is the best based on Accuracy.

The second observation, which is also the main objec-tive of this empirical study, is that by applying MVSCto refine the output of spherical k-means, clusteringsolutions are improved significantly. Both rMVSC-IRand rMVSC-IV lead to higher NMIs and Accuracies thanSpkmeans in all the cases. Interestingly, there are manycircumstances where Spkmeans’ result is worse than thatof NMF clustering methods, but after refined by MVSCs,it becomes better. To have a more descriptive picture ofthe improvements, we could refer to the radar charts inFig. 8. The figure shows details of a particular test casewhere k = 5. Remember that a test case consists of 50 dif-ferent test sets. The charts display result on each test set,including the accuracy result obtained by Spkmeans, andthe results after refinement by MVSC, namely rMVSC-IR and rMVSC-IV . For effective visualization, they aresorted in ascending order of the accuracies by Spkmeans(clockwise). As the patterns in both Fig. 8(a) and Fig. 8(b)reveal, improvement in accuracy is most likely attainableby rMVSC-IR and rMVSC-IV . Many of the improve-ments are with a considerably large margin, especiallywhen the original accuracy obtained by Spkmeans is low.

There are only few exceptions where after refinement,accuracy becomes worst. Nevertheless, the decreases insuch cases are small.

Finally, it is also interesting to notice from Table 7and Table 8 that MVSC preceded by spherical k-meansdoes not necessarily yields better clustering results thanMVSC with random initialization. There are only a smallnumber of cases in the two tables that rMVSC can befound better than MVSC. This phenomenon, however, isunderstandable. Given a local optimal solution returnedby spherical k-means, rMVSC algorithms as a refinementmethod would be constrained by this local optimumitself and, hence, their search space might be restricted.The original MVSC algorithms, on the other hand, arenot subjected to this constraint, and are able to followthe search trajectory of their objective function fromthe beginning. Hence, while performance improvementafter refining spherical k-means’ result by MVSC provesthe appropriateness of MVS and its criterion functionsfor document clustering, this observation in fact onlyreaffirms its potential.

7 CONCLUSIONS AND FUTURE WORKIn this paper, we propose a Multi-Viewpoint basedSimilarity measuring method, named MVS. Theoreticalanalysis and empirical examples show that MVS ispotentially more suitable for text documents than thepopular cosine similarity. Based on MVS, two criterionfunctions, IR and IV , and their respective clusteringalgorithms, MVSC-IR and MVSC-IV , have been intro-duced. Compared with other state-of-the-art clusteringmethods that use different types of similarity measure,

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 14: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 14

on a large number of document datasets and underdifferent evaluation metrics, the proposed algorithmsshow that they could provide significantly improvedclustering performance.

The key contribution of this paper is the fundamentalconcept of similarity measure from multiple viewpoints.Future methods could make use of the same principle,but define alternative forms for the relative similarity inEq. (10), or do not use average but have other methods tocombine the relative similarities according to the differ-ent viewpoints. Besides, this paper focuses on partitionalclustering of documents. In the future, it would also bepossible to apply the proposed criterion functions for hi-erarchical clustering algorithms. Finally, we have shownthe application of MVS and its clustering algorithms fortext data. It would be interesting to explore how theywork on other types of sparse and high-dimensionaldata.

REFERENCES[1] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda,

G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach,D. J. Hand, and D. Steinberg, “Top 10 algorithms in data mining,”Knowl. Inf. Syst., vol. 14, no. 1, pp. 1–37, 2007.

[2] I. Guyon, U. von Luxburg, and R. C. Williamson, “Clustering:Science or Art?” NIPS’09 Workshop on Clustering Theory, 2009.

[3] I. Dhillon and D. Modha, “Concept decompositions for largesparse text data using clustering,” Mach. Learn., vol. 42, no. 1-2,pp. 143–175, Jan 2001.

[4] S. Zhong, “Efficient online spherical K-means clustering,” in IEEEIJCNN, 2005, pp. 3180–3185.

[5] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, “Clustering withBregman divergences,” J. Mach. Learn. Res., vol. 6, pp. 1705–1749,Oct 2005.

[6] E. Pekalska, A. Harol, R. P. W. Duin, B. Spillmann, and H. Bunke,“Non-Euclidean or non-metric measures can be informative,” inStructural, Syntactic, and Statistical Pattern Recognition, ser. LNCS,vol. 4109, 2006, pp. 871–880.

[7] M. Pelillo, “What is a cluster? Perspectives from game theory,” inProc. of the NIPS Workshop on Clustering Theory, 2009.

[8] D. Lee and J. Lee, “Dynamic dissimilarity measure for supportbased clustering,” IEEE Trans. on Knowl. and Data Eng., vol. 22,no. 6, pp. 900–905, 2010.

[9] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, “Clustering on theunit hypersphere using von Mises-Fisher distributions,” J. Mach.Learn. Res., vol. 6, pp. 1345–1382, Sep 2005.

[10] W. Xu, X. Liu, and Y. Gong, “Document clustering based on non-negative matrix factorization,” in SIGIR, 2003, pp. 267–273.

[11] I. S. Dhillon, S. Mallela, and D. S. Modha, “Information-theoreticco-clustering,” in KDD, 2003, pp. 89–98.

[12] C. D. Manning, P. Raghavan, and H. Schutze, An Introduction toInformation Retrieval. Press, Cambridge U., 2009.

[13] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, “A min-max cutalgorithm for graph partitioning and data clustering,” in IEEEICDM, 2001, pp. 107–114.

[14] H. Zha, X. He, C. H. Q. Ding, M. Gu, and H. D. Simon, “Spectralrelaxation for k-means clustering,” in NIPS, 2001, pp. 1057–1064.

[15] J. Shi and J. Malik, “Normalized cuts and image segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, pp. 888–905, 2000.

[16] I. S. Dhillon, “Co-clustering documents and words using bipartitespectral graph partitioning,” in KDD, 2001, pp. 269–274.

[17] Y. Gong and W. Xu, Machine Learning for Multimedia ContentAnalysis. Springer-Verlag New York, Inc., 2007.

[18] Y. Zhao and G. Karypis, “Empirical and theoretical comparisonsof selected criterion functions for document clustering,” Mach.Learn., vol. 55, no. 3, pp. 311–331, Jun 2004.

[19] G. Karypis, “CLUTO a clustering toolkit,” Dept. of ComputerScience, Uni. of Minnesota, Tech. Rep., 2003, http://glaros.dtc.umn.edu/gkhome/views/cluto.

[20] A. Strehl, J. Ghosh, and R. Mooney, “Impact of similarity measureson web-page clustering,” in Proc. of the 17th National Conf. on Artif.Intell.: Workshop of Artif. Intell. for Web Search. AAAI, Jul. 2000,pp. 58–64.

[21] A. Ahmad and L. Dey, “A method to compute distance betweentwo categorical values of same attribute in unsupervised learningfor categorical data set,” Pattern Recognit. Lett., vol. 28, no. 1, pp.110 – 118, 2007.

[22] D. Ienco, R. G. Pensa, and R. Meo, “Context-based distancelearning for categorical data clustering,” in Proc. of the 8th Int.Symp. IDA, 2009, pp. 83–94.

[23] P. Lakkaraju, S. Gauch, and M. Speretta, “Document similaritybased on concept tree distance,” in Proc. of the 19th ACM conf. onHypertext and hypermedia, 2008, pp. 127–132.

[24] H. Chim and X. Deng, “Efficient phrase-based document similar-ity for clustering,” IEEE Trans. on Knowl. and Data Eng., vol. 20,no. 9, pp. 1217–1229, 2008.

[25] S. Flesca, G. Manco, E. Masciari, L. Pontieri, and A. Pugliese, “Fastdetection of xml structural similarity,” IEEE Trans. on Knowl. andData Eng., vol. 17, no. 2, pp. 160–175, 2005.

[26] E.-H. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis,V. Kumar, B. Mobasher, and J. Moore, “Webace: a web agent fordocument categorization and exploration,” in AGENTS ’98: Proc.of the 2nd ICAA, 1998, pp. 408–415.

[27] J. Friedman and J. Meulman, “Clustering objects on subsets ofattributes,” J. R. Stat. Soc. Series B Stat. Methodol., vol. 66, no. 4,pp. 815–839, 2004.

[28] L. Hubert, P. Arabie, and J. Meulman, Combinatorial data analysis:optimization by dynamic programming. Philadelphia, PA, USA:Society for Industrial and Applied Mathematics, 2001.

[29] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification,2nd ed. New York: John Wiley & Sons, 2001.

[30] S. Zhong and J. Ghosh, “A comparative study of generativemodels for document clustering,” in SIAM Int. Conf. Data MiningWorkshop on Clustering High Dimensional Data and Its Applications,2003.

[31] Y. Zhao and G. Karypis, “Criterion functions for document clus-tering: Experiments and analysis,” Dept. of Computer Science,Uni. of Minnesota, Tech. Rep., 2002.

[32] T. M. Mitchell, Machine Learning. McGraw-Hill, 1997.

Duc Thang Nguyen received the B.Eng. de-gree in Electrical & Electronic Engineering atNanyang Technological University, Singapore,where he is also pursuing the Ph.D. degreein the Division of Information Engineering. Cur-rently, he is an Operations Planning Analystat PSA Corporation Ltd, Singapore. His re-search interests include algorithms, informationretrieval, data mining, optimizations and opera-tions research.

Lihui Chen received the B.Eng. in ComputerScience & Engineering at Zhejiang University,China and the PhD in Computational Science atUniversity of St. Andrews, UK. Currently she isan Associate Professor in the Division of Infor-mation Engineering at Nanyang TechnologicalUniversity in Singapore. Her research interestsinclude machine learning algorithms and appli-cations, data mining and web intelligence. Shehas published more than seventy referred pa-pers in international journals and conferences in

these areas. She is a senior member of the IEEE, and a member of theIEEE Computational Intelligence Society.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 15: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …eeeweba.ntu.edu.sg/elhchen/chenlh-researchpaper...maximize the cosine similarity between documents in a cluster and that cluster’s centroid:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, 2011 15

Chee KeongChan received his B.Eng. from Na-tional University of Singapore in Electrical andElectronic Engineering, MSC and DIC in Com-puting from Imperial College, University of Lon-don and PhD from Nanyang Technological Uni-versity. Upon graduation from NUS, he workedas an R&D Engineer in Philips Singapore forseveral years. Currently, he is an Associate Pro-fessor in the Information Engineering Division,lecturing in subjects related to computer sys-tems, artificial intelligence, software engineering

and cyber security. Through his years in NTU, he has published aboutnumerous research papers in conferences, journals, books and bookchapter. He has also provided numerous consultations to the industries.His current research interest areas include data mining (text and solarradiation data mining), evolutionary algorithms (scheduling and games)and renewable energy.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.


Recommended