Advanced Techniques in Computing Sciences and Software Engineering || Fuzzy Document Clustering...

Fuzzy Document Clustering Approach using WordNet Lexical Categories

Tarek F. Gharib

Faculty of Computer and Information Sciences, Ain Shams University

Cairo, Egypt

Mohammed M. Fouad Akhbar El-Yom Academy

Cairo, Egypt

Mostafa M. Aref Faculty of Computer and

Information Sciences, Ain Shams University

Cairo, Egypt

Abstract- Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text. This area is growing rapidly mainly because of the strong need for analysing the huge and large amount of textual data that reside on internal file systems and the Web. Text document clustering provides an effective navigation mechanism to organize this large amount of data by grouping their documents into a small number of meaningful classes. In this paper we proposed a fuzzy text document clustering approach using WordNet lexical categories and Fuzzy c-Means algorithm. Some experiments are performed to compare efficiency of the proposed approach with the recently reported approaches. Experimental results show that Fuzzy clustering leads to great performance results. Fuzzy c-means algorithm overcomes other classical clustering algorithms like k-means and bisecting k-means in both clustering quality and running time efficiency.

I. INTRODUCTION With the growth of World Wide Web and information

society, more information is available and accessible. The main problem is how to find the truly relevant data among these huge and large data sources. Most of the applications that querying a search engine obtained a large number of irrelevant results and a small number of relevant pages that meet the keyword typed by the user. Text document clustering can be used here to solve this problem by organizing this large amount of retrieval results.

Text document clustering provides an effective navigation mechanism to organize this large amount of data by grouping their documents into a small number of meaningful classes. Text document clustering can be defined as the process of grouping of text documents into semantically related groups [16]. Most of the current methods for text clustering are based on the similarity between the text sources. The similarity measures work on the syntactically relationships between these sources and neglect the semantic information in them. By using the vector-space model in which each document is represented as a vector or ‘bag of words’, i.e., by the words (terms) it contains and their weights regardless of their order [3].

Many well-known methods of text clustering have two problems: first, they don’t consider semantically related words/terms (e.g., synonyms or hyper/hyponyms) in the document. For instance, they treat {Vehicle, Car, and Automobile} as different terms even though all these words

have very similar meaning. This problem may lead to a very low relevance score for relevant documents because the documents do not always contain the same forms of words/terms.

Second, on vector representations of documents based on the bag-of-words model, text clustering methods tend to use all the words/terms in the documents after removing the stop-words. This leads to thousands of dimensions in the vector representation of documents; this is called the “Curse of Dimensionality”. However, it is well known that only a very small number of words/terms in documents have distinguishable power on clustering documents [19] and become the key elements of text summaries. Those words/terms are normally the concepts in the domain related to the documents.

Recent studies for fuzzy clustering algorithms [4, 21, 22, 23] have proposed a new approach for using fuzzy clustering algorithms in document clustering process. But these studies neglect the lexical information that can be extracted from text as we used WordNet lexical categories in our proposed approach.

In this paper, we propose fuzzy text document clustering approach using WordNet lexical categories and Fuzzy c-Means algorithm. The proposed approach uses WordNet words lexical categories information to reduce the size of vector space and present semantic relationships between words. The generated document vectors will be input for fuzzy c-means algorithm in the clustering process to increase the clustering accuracy.

The rest of this paper is organized as following; section II show the proposed fuzzy text clustering approach. In section III a set of experiments is presented to compare the performance of the proposed approach with current text clustering methods. Related work is discussed and presented in section IV. Finally, conclusion and future work is given in section V.

II. FUZZY TEXT DOCUMENTS CLUSTERING In this section we describe in details the components of the

proposed fuzzy text clustering approach. There are two main processes: first is Feature Extraction that generated output document vectors from input text documents using WordNet [12] lexical information. The second process is Document

K. Elleithy (ed.), Advanced Techniques in Computing Sciences and Software Engineering, DOI 10.1007/978-90-481-3660-5_31, © Springer Science+Business Media B.V. 2010

Clustering that applies fuzzy c-means algorithm on document vectors to obtain output clusters as illustrated in fig. 1.

Fig. 1. Fuzzy Text Documents Clustering Approach

A. Feature Extraction The first step in the proposed approach is feature extraction

or documents preprocessing which aims to represent the corpus (input documents collection) into vector space model. In this model a set of words (terms) is extracted from the corpus called “bag-of-words” and represent each document as a vector by the words (terms) it contains and their weights regardless of their order. Documents preprocessing step contains four sub-steps: PoS Tagging, Stopword Removal, Stemming, and WordNet Lexical Category Mapping. 1. PoS Tagging

The first preprocessing step is to PoS tag the corpus. The PoS tagger relies on the text structure and morphological differences to determine the appropriate part-of-speech. This requires the words to be in their original order. This process is to be done before any other modifications on the corpora. For this reason, PoS tagging is the first step to be carried out on the corpus documents as proposed in [16].

2. Stopwords Removal

Stopwords, i.e. words thought not to convey any meaning, are removed from the text. In this work, the proposed approach uses a static list of stopwords with PoS information about all tokens. This process removes all words that are not nouns, verbs or adjectives. For example, stopwords removal process will remove all the words like: he, all, his, from, is, an, of, your, and so on.

3. Stemming

The stem is the common root-form of the words with the same meaning appear in various morphological forms (e.g. player, played, plays from stem play). In the proposed approach, we use the morphology function provided with WordNet is used for stemming process. Stemming will find the stems of the output terms to enhance term frequency counting process because terms like “learners” and “learning” come down from the same stem “learn”. This process will output all the stems of extracted terms.

The frequency of each stemmed word across the corpus can be counted and every word occurring less often than the pre-specified threshold (called Minimum Support) is pruned, i.e. removed from the words vector, to reduce the document vector dimension. In our implementation we use minimum support value set to 10%, which means that the words found in less than 10% of the input documents is removed from the output vector.

4. WordNet Lexical Category Mapping

As proposed in [15], we use WordNet lexical categories to map all the stemmed words in all documents into their lexical categories. We use WordNet 2.1 that has 41 lexical categories for nouns and verbs. For example, the word “dog” and “cat” both belong to the same category “noun.animal”. Some words also has multiple categories like word “Washington” has 3 categories (noun.location, noun.group, noun.person) because it can be the name of the American president, the city place, or a group in the concept of capital.

Some word disambiguation techniques are used to remove the resulting noise added by multiple categories mapping which are: disambiguation by context and concept map which are discussed in details in [15].

B. Document Clustering

After generating the documents' vectors for all the input documents using feature extraction process, we continue with the clustering process as shown in fig. 1.

The problem of document clustering is defined as follows. Given a set of n documents called DS, DS is clustered into a user-defined number of k document clusters D1, D2,…Dk, (i.e. {D1, D2,…Dk} = DS) so that the documents in a document cluster are similar to one another while documents from different clusters are dissimilar.

There are two main approaches to document clustering, hierarchical clustering (agglomerative and divisive) and partitioning clustering algorithms [17]. In this process we apply three different clustering algorithms which are k-means (partitioning clustering), bisecting k-means (hierarchical clustering) and fuzzy c-means (fuzzy clustering). 1. K-means and Bisecting k-means

We have implemented the k-means and bisecting k-means algorithms as introduced in [17]. We will state some details on bisecting k-means algorithm that begins with all data as one cluster then perform the following steps:

Step1: Choose the largest cluster to split. Step2: Use k-means to split this cluster into two sub-

clusters. (Bisecting step) Step3: Repeat step 2 for some iterations (in our case 10

times) and choose the split with the highest clustering overall similarity.

Step4: Go to step 1 again until the desired k clusters are obtained.

Text Documents

WordNet Lexical

Feature Extraction

D1 D2

…… Dm

Document Clustering

Clustered Documents

Documents’ Vectors

GHARIB ET AL. 182

2. Fuzzy c-means Fuzzy c-means is a data clustering technique wherein each

data point belongs to a cluster to some degree that is specified by a membership grade while other classical clustering algorithms assign each data point to exactly one cluster. This technique was originally introduced by Bezdec [2] as an improvement on earlier clustering methods. It provides a method that shows how to group data points that populate some multidimensional space into a specific number of different clusters.

Most fuzzy clustering algorithms are objective function based: they determine an optimal (fuzzy) partition of a given data set c clusters by minimizing an objective function with some constraints [4].

In our proposed approach, we use the implementation of fuzzy c-means algorithm in MATLAB (fcm function). This function takes the document vectors as a matrix and the desired number of clusters and outputs the clusters centers and the optimal objective function values.

3. Silhouette Coefficient (SC) for Clustering Evaluation

For clustering, two measures of cluster “goodness” or quality are used. One type of measure allows us to compare different sets of clusters without reference to external knowledge and is called an internal quality measure. The other type of measures lets us evaluate how well the clustering is working by comparing the groups produced by the clustering techniques to known classes which called an external quality measure [17].

In our application of document clustering, we don’t have the knowledge of document classes in order to use external quality measures. We will investigate silhouette coefficient (SC Measure) as one of the main internal quality measures.

To measure the similarity between two documents d1 and d2 we use the cosine of the angle between the two document vectors. This measure tries to approach the semantic closeness of documents through the size of the angle between vectors associated to them as in (1).

1 21 2

1 2

( , ).

d ddist d d

d d

•=

(1)

Where ( )• denotes vector dot product and (| |) is the dimension of the vector. A cosine measure of 0 means the two documents are unrelated whereas value closed to 1 means that the documents are closely related [15].

Let M 1 k

D ={D ,..,D } describe a clustering result, i.e. it is an exhaustive partitioning of the set of documents DS. The distance of a document d DS∈ to a cluster

i MD D∈ is given

as in (2). ( , )

( , ) ip D

i

i

dist d pdist d D

D∈

=∑

(2)

Let further consider ( , ) ( , )M ia d D dist d D= being the distance of document d to its cluster Di where

i(d D )∈ . ( , ) min ( , ) iM d D i i Mb d D dist d D D D

∉= ∀ ∈ is the

distance of document d to the nearest neighbor cluster. The silhouette S (d, DM) of a document d is then defined as in (3).

( , ) ( , )( , )

max( ( , ), ( , ))M M

M

M M

b d D a d DS d D

b d D a d D

−=

(3)

The silhouette coefficient (SC Measure) is defined as shown in (4).

( , )( )

Mp DS

M

S p DSC D

DS∈=

∑

(4)

The silhouette coefficient is a measure for the clustering quality that is rather independent from the number of clusters. Experiences, such as documented in [11], show that values between 0.7 and 1.0 indicate clustering results with excellent separation between clusters, viz. data points are very close to the center of their cluster and remote from the next nearest cluster. For the range from 0.5 to 0.7 one finds that data points are clearly assigned to cluster centers. Values from 0.25 to 0.5 indicate that cluster centers can be found, though there is considerable "noise". Below a value of 0.25 it becomes practically impossible to find significant cluster centers and to definitely assign the majority of data points.

III. EPERIMENTS AND DISCUSSION

Some experiments are performed on some real text

documents to compare the performance of three text clustering algorithms which are k-means, bisecting k-means and fuzzy c-means. There are two main parameters to evaluate the performance of the proposed approach which are clustering quality and running time.

Fuzzy c-means algorithm is implemented in MATLAB and k-means, bisecting k-means algorithms are implemented in Java. Feature extraction process is also implemented in Java using Java NetBeans 5.5.1 and Java API for WordNet Searching (JAWS Library) to access WordNet 2.1.

All experiments were done on Processor P4 (3GHz) machine with 1GB main memory, running the Windows XP Professional® operating system and all times are reported in seconds.

A. Text Document Datasets

We evaluate the proposed semantic text document clustering approach on three text document datasets: EMail1200, SCOTS and Reuters text corpuses.

EMail1200 corpus contains test email documents for spam email detection with about 1,245 documents with about 550 words per document. SCOTS corpus (Scottish Corpus Of Text and Speech) contains over 1100 written and spoken texts, with about 4 million words of running text. 80% of this total is made up of written texts and 20% is made up of spoken texts. SCOTS dataset contains about 3,425 words per document. Reuters corpus contains about 21,578 documents that appeared on the Reuters newswire in 1987. The documents were assembled and indexed with categories by

FUZZY DOCUMENT CLUSTERING APPROACH USING WORDNET LEXICAL CATEGORIES 183

personnel from Reuters Ltd. and Carnegie Group, Inc. in 1987. All the three datasets are used in the text mining testing studies and they are available for download at [24, 25, 26] respectively.

B. Results

First, we pass the three document datasets into the feature extraction process to generate the corresponding document vectors. The vectors are then used as an input for each clustering algorithm. We used the following list (2, 5, 10, 20, 50, 70, and 100) as the desired number of clusters for each algorithm. The output clusters for each one are measured using silhouette coefficient (SC Measure) and report the total running time of the whole process.

1. Clustering Quality

Fig. 2, 3, and 4 show the silhouette coefficient values for the three datasets respectively. In all experiments fuzzy c-means algorithm outperforms bisecting k-means and k-means algorithms in the overall clustering quality using silhouette measure. This experiment shows that the fuzzy clustering is more suitable for the unstructured nature of the text document clustering process itself.

10 1000.0

0.2

0.4

0.6

0.8

1.0

Silh

ouet

te V

alue

Clusters

k-means Bisecting k-means Fuzzy c-means

Fig. 2. Silhouette values comparing all clustering algorithms – EMail1200

10 1000.0

0.2

0.4

0.6

0.8

1.0

Silh

ouet

te V

alue

Clusters


Fig. 3. Silhouette values comparing all clustering algorithms – SCOTS

10 1000.0

0.2

0.4

0.6

0.8

1.0

Silh

ouet

te V

alue

Clusters


Fig. 4. Silhouette values comparing all clustering algorithms – Reuters

Fig. 5 and 6 show the comparison of using fuzzy c-means clustering algorithm on both SCOTS and Reuters datasets in case of using WordNet lexical categories and not using them. This experiment shows that using WordNet lexical categories in the feature extraction process improves the overall clustering quality of the input dataset document.

10 1000.0

0.2

0.4

0.6

0.8

1.0

Silh

ouet

te V

alue

Clusters

Fuzzy c-means only Fuzzy c-means with WordNet

Fig. 5. WordNet improves fuzzy clustering results using SCOTS dataset

10 1000.0

0.2

0.4

0.6

0.8

1.0

Silh

ouet

te V

alue

Clusters

Fuzzy c-means only Fuzzy c-means with WordNet

Fig. 6. WordNet improves fuzzy clustering results using Reuters dataset

GHARIB ET AL. 184

2. Running Time Reuters dataset, as mentioned early in this section, contains

about 21,578 documents. This is considered a real challenge task that faces any clustering approach because of “Scalability”. Some clustering techniques that are helpful for small data sets can be overwhelmed by large data sets to the point that they are no longer helpful.

For that reason we test the scalability of our proposed approach with the different algorithms using Reuters dataset. This experiment shows that the fuzzy c-means performs a great running time optimization with comparison to other two algorithms. Also, according to the huge size of Reuters dataset, the proposed approach shows very good scalability against document size.

Fig. 7 depicts the running time of the different clustering algorithms using Reuters dataset with respect to different values of desired clusters.

10 1000

10

20

30

40

50

60

Run

ning

Tim

e (S

ec.)

Clusters


Fig. 7. Scalability of all clustering algorithms on Reuters dataset

IV. RELATED WORK In the recent years, text document clustering has been

introduced as an efficient method for navigating and browsing large document collections and organizing the results returned by search engines in response to user queries [20]. Many clustering techniques are proposed like k-secting k-means [13] and bisecting k-means [17], FTC and HFTC [1] and many others. From the performed experiments in [17] bisecting k-means overcomes all these algorithms in the performance although FTC and HTFC allows to reduce the dimensionality if the data when working with large datasets.

WordNet is used by Green [8, 9] to construct lexical chains from the occurrences of terms in a document: WordNet senses that are related receive high higher weights than senses that appear in isolation from others in the same document. The senses with the best weights are selected and the corresponding weighted term frequencies constitute a base vector representation of a document.

Other works [14, 18] have explored the possibility to use WordNet for retrieving documents by carefully choosing a search keyword. Dave and Lawrence [7] use WordNet synsets as features for document representation and subsequent

clustering. But the word sense disambiguation has not been performed showing that WordNet synsets decreases clustering performance in all the experiments. Hotho et al. [10] use WordNet in a unsupervised scenario taking into account the WordNet ontology and lexicon and some strategy for word sense disambiguation achieving improvements of the clustering results.

A technique for feature selection by using WordNet to discover synonymous terms based on cross-referencing is introduced in [6]. First, terms with overlapping word senses co-occurring in a category are selected. A signature for a sense is a synset containing synonyms. Then, the list of noun synsets is checked for all senses for signatures similarity. The semantic context of a category is aggregated by overlapping synsets of different terms senses. The original terms from the category that belongs to the similar synsets will be finally added as features for category representation.

In [16] the authors explore the benefits of partial disambiguation of words by their PoS and the inclusion of WordNet concepts; they show how taking into account synonyms and hypernyms, disambiguated only by PoS tags, is not successful in improving clustering effectiveness because the noise produced by all the incorrect senses extracted from WordNet. Adding all synonyms and all hypernyms into the document vectors seems to increase the noise.

Reforgiato [15] presented a new unsupervised method for document clustering by using WordNet lexical and conceptual relations. In this work, Reforgiato uses WordNet lexical categories and WordNet ontology in order to create a well structured document vector space whose low dimensionality allows common clustering algorithms to perform well. For the clustering step he has chosen the bisecting k-means and the Multipole tree algorithms for their accuracy and speed.

Friedman et al. [21] introduced FDCM algorithm for clustering documents that are represented by vectors of variable size. The algorithm utilizes fuzzy logic to construct the cluster center and introduces a fuzzy based similarity measure which provided reasonably good results in the area of web document monitoring.

Rodrigues and Sacks [23] have modified the Fuzzy c-means algorithm for clustering text documents based on the cosine similarity coefficient rather than on the Euclidean distance. The modified algorithm works with normalized k-dimensional data vectors that lie in hyper-sphere of unit radius and hence has been named Hyper-spherical Fuzzy c-means (H-FCM). Also they proposed a hierarchical fuzzy clustering algorithm (H2-FCM) to discover relationships between information resources based on their textual content, as well as to represent knowledge through the association of topics covered by those resources [22].

The main problem about these studies is that they neglect any lexical information about the textual data in the documents. This information helps in improving clustering quality as shown in fig. 5 and 6.

FUZZY DOCUMENT CLUSTERING APPROACH USING WORDNET LEXICAL CATEGORIES 185

V. CONCLUSION AND FUTURE WORK In this paper we proposed a fuzzy document clustering

approach based on the WordNet lexical categories and fuzzy c-means clustering algorithm. The proposed approach generates documents vectors using the lexical category mapping of WordNet after preprocessing the input documents. We apply three different clustering algorithms, k-means, bisecting k-means and fuzzy c-means, to the generated documents vectors for clustering process to test the performance of fuzzy c-means.

There are some points that appeared in this work which are: • Using word sense disambiguation technique reduces

the noise of the generated documents vectors and achieves higher clustering quality.

• Fuzzy c-means clustering algorithm achieves higher clustering quality than classical clustering algorithms like k-means (partitioning clustering), and bisecting k-means (hierarchical clustering).

• Using WordNet lexical categories in the feature extraction process for text documents improves the overall clustering quality.

From the shown experimental results, we found that the proposed approach shows good scalability against the huge number of documents in the Reuters dataset along with different values of desired clusters.

For our future work there are two points to investigate which are:

• Using WordNet ontology for generating features vectors [15] for text document along with fuzzy clustering may improve the overall clustering quality.

• Apply the proposed approach on the web documents to solve the problem of web content mining as addressed in [5].

REFERENCES

[1] F. Beil, M. Ester, and X. Xu, “Frequent term-based text clustering.”

KDD 02, 2002, pp. 436–442. [2] J.C. Bezdec, “Pattern Recognition with Fuzzy Objective Function

Algorithms”, Plenum Press, New York, 1981. [3] M. Lan, C.L. Tan, H.B. Low, and S.Y. Sung, “A Comprehensive

Comparative Study on Term Weighting Schemes”, 14th International World Wide Web (WWW2005) Conference, Japan 2005

[4] C. Borgelt, and A. Nurnberger, “Fast Fuzzy Clustering of Web Page Collections”. PKDD Workshop on Statistical Approaches for Web Mining, 2004.

[5] S. Chakrabarti, “Mining the Web: Discovering Knowledge from Hypertext Data”. Morgan Kaufmann Publishers, 2002.

[6] S. Chua, and N. Kulathuramaiyer, “Semantic feature selection using wordnet”. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, 2004, pp. 166–172.

[7] D.M.P.K. Dave, & S. Lawrence, “Mining the peanut gallery: Opinion extraction and semantic classification of product reviews” WWW 03 ACM, 2003, pp. 519–528.

[8] S.J. Green, “Building hypertext links in newspaper articles using semantic similarity”. NLDB 97, 1997, pp. 178–190.

[9] S.J. Green, “Building hypertext links by computing semantic similarity”. TKDE 1999, 11(5), pp.50–57.

[10] A. Hotho, S. Staab, and G. Stumme, “Wordnet improves text document clustering”. ACM SIGIR Workshop on Semantic Web 2003.

[11] L. Kaufman, and P.J. Rousseeuw, “Finding Groups in Data: an Introduction to Cluster Analysis”, John Wiley & Sons, 1999.

[12] WordNet project available at: http://wordnet.princeton.edu/ [13] B. Larsen, and C. Aone, “Fast and effective text mining using linear-

time document clustering”. The 5th ACM SIGKDD international conference on knowledge discovery and data mining, 1999, pp. 16–22.

[14] D.I. Moldovan, and R. Mihalcea, “Using wordnet and lexical operators to improve internet searches”. IEEE Internet Computing 2000, 4(1), pp. 34–43.

[15] D. Reforgiato, “A new unsupervised method for document clustering by using WordNet lexical and conceptual relations”. Journal of Information Retrieval, Vol (10), 2007, pp.563–579.

[16] J. Sedding, and D. Kazakov, “WordNet-based Text Document Clustering”, COLING 3rd Workshop on Robust Methods in Analysis of Natural Language Data, 2004.

[17] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques”, Department of Computer Science and Engineering, University of Minnesota, Technical Report #00-034, 2000

[18] E.M. Voorhees, “Query expansion using lexical-semantic relations”. In Proceedings of ACM-SIGIR, 1994, pp. 61–69.

[19] B.B. Wang, R.I. McKay, H.A. Abbass, and M. Barlow, “Learning text classifier using the domain concept hierarchy.” In Proceedings of International Conference on Communications, Circuits and Systems, China, 2002.

[20] O. Zamir, O. Etzioni, O. Madani, and R.M. Karp, “Fast and intuitive clustering of web documents”. KDD 97, 1997, pp. 287–290.

[21] M. Friedman, A. Kandel, M. Schneider, M. Last, B. Shapka, Y. Elovici and O. Zaafrany, “A Fuzzy-Based Algorithm for Web Document Clustering”. Fuzzy Information, Processing NAFIPS '04, IEEE, 2004.

[22] M.E.S. Mendes Rodrigues, L. Sacks, “A Scalable Hierarchical Fuzzy Clustering Algorithm for Text Mining”. In Proceedings of the 5th International Conference on Recent Advances in Soft Computing, 2004.

[23] M.E.S. Mendes Rodrigues, L. Sacks, “Evaluating fuzzy clustering for relevance-based information access”. In Proceedings of the 12th IEEE International Conference on Fuzzy Systems, FUZZ-IEEE 2003

[24] EMail1200 dataset available at: http://boole.cs.iastate.edu/book/acad/bag/data/lingspam

[25] SCOTS dataset available at: http://www.scottishcorpus.ac.uk/ [26] Reuters dataset available at:

http://www.daviddlewis.com/resources/testcollections/reuters21578/

GHARIB ET AL. 186

Date post:	30-Nov-2016
Category:	Documents
Upload:	khaled
View:	213 times
Download:	0 times