Topic Extraction with AGAPE

Topic Extraction with AGAPE

Julien Velcin and Jean-Gabriel Ganascia

Universite de Paris 6 – LIP6104 avenue du President Kennedy, 75016 Paris

{Julien.Velcin, Jean-Gabriel.Ganascia}@lip6.fr

Abstract. This paper uses an optimization approach to address theproblem of conceptual clustering. The aim of AGAPE, which is based onthe tabu-search meta-heuristic using split, merge and a special “k-means”move, is to extract concepts by optimizing a global quality function. Itis deterministic and uses no a priori knowledge about the number ofclusters. Experiments carried out in topic extraction show very promisingresults on both artificial and real datasets.

Keywords: conceptual clustering, global optimization, tabu search,topic extraction.

1 Introduction

Conceptual clustering is an unsupervised learning problem the aim of which isto extract concepts from a dataset [1,2]. This paper focuses on concept extrac-tion using optimization techniques. The concepts are descriptions that label thedataset clusters. They are automatically discovered from the internal structureof the data and not known a priori.

AGAPE is a general approach that solves the clustering problem by optimiz-ing a global function [3]. This quality function q must be given and computesthe correspondence between the given dataset E and the concept set C. Hence,the objective is to find the solution C∗ that optimizes q. The function q is rarelyconvex and the usual methods are trapped in local optima. AGAPE thereforeuses the meta-heuristic of tabu search, which improves the basic local search[4]. The main contribution of AGAPE is to discover better solutions in a deter-ministic way by escaping from local optima. In addition, an original operatorinspired by the classical k-means algorithm is proposed. Note that the numberof concepts to be found during the search process is not fixed in advance.

This approach has been implemented for the problem of topic extraction fromtextual datasets [5]. These datasets are binary (presence or absence of words),high dimensional (more than 10,000 variables) and sparse (each description con-tains only a few words), and require a specific computation of the tabu-searchmeta-heuristic. Experiments have been done using this kind of dataset to demon-strate AGAPE’s validity.

The paper is organized as follows: section 2 presents the general framework ofAGAPE and its implementation for the topic extraction problem; section 3 details

R. Alhajj et al. (Eds.): ADMA 2007, LNAI 4632, pp. 377–388, 2007.c© Springer-Verlag Berlin Heidelberg 2007

378 J. Velcin and J.-G. Ganascia

the experiments based on both artificial and real datasets. Artificial datasets areused to compare the results to the score of the “ideal” solution. The experimentsshow the effectiveness of our approach compared to other clustering algorithms.The real dataset is based on French news and very convincing topics have beenextracted. The conclusion proposes a number of lines of future research.

2 AGAPE Framework

2.1 Logical Framework

The aim of the AGAPE framework is to extract concept sets from the datasetE. Each example e of this dataset is associated to a description δ(e). Let D bethe set of all the possible descriptions. Let us define the concept set C as a D-element set including the empty-description d∅. The latter is the most possiblegeneral description which is used to cover the examples uncovered by the otherconcepts of C.

In the AGAPE approach, the function that evaluates the quality of the con-cept set is given. Let q be this function, which uses as input the concept setC and the dataset E. This function provides a score q(C, E) ∈ [0, 1] which hasto be optimized. The greater the value of q(C, E), the better the concept setcorresponds to the dataset. For our experiments, we have defined the followingquality function q:1

q(C, E) =1

|E|∑

e∈E

sim(δ(e), RC(e)) (1)

where sim is a similarity measure, δ(e) is the description of the example e in Dand RC(e) is the function that relates the example e to the closest concept c inC (relative to the similarity measure sim). The similarity measure sim can beof any kind and will be defined in our experiments (see section 3.1). The idealobjective of the clustering algorithm is to find the best concept set C∗ such that:

C∗(E) = argmaxC∈H

q(C, E) (2)

where H is the hypothesis space, i.e. the set of all the possible concept sets.This task is known to be NP-complete and it is very difficult to find a perfectsolution. A good approximation C(E) of C∗(E) is often enough and can bediscovered through optimization techniques, which is why we have chosen a localsearch-based algorithm to solve this clustering problem. The algorithm uses ameta-heuristic called “tabu search” to escape from local optima.

2.2 Tabu-Based Algorithm

This global optimization approach for conceptual clustering is based on tabusearch [6,7], which extends the classical local search to go beyond the local1 Note that q is similar to the classical “squared-error criterion”; this is just an example

used for the experiments presented in this paper.

Topic Extraction with AGAPE 379

optima. The objective is to optimize the function q in order to find a solutionclose to the optimal one. Remember that each concept set C ∈ H is a potentialsolution to the clustering problem and that the best set C∗ optimizes q. Thehypothesis space is explored by computing at each step a neighborhood V ofthe current solution. The neighborhood contains new solutions computed fromthe current solution using moves, i.e. operators that change one solution intoanother. The solution of V optimizing q becomes the new current solution andthe process is iterated.

To create these new potential solutions, three kinds of moves are considered:(1) splitting one existing concept into p (p > 1) new concepts, (2) mergingtwo existing concepts to form one new concept, (3) performing a k-means step.Split and merge moves can be seen as similar than those used in the COBWEBframework [8]. The original “k-means move” is inspired from the classical k-means algorithm [9]. In fact, it is really important to “recenter” the search inorder to better fit the dataset and this requires two steps: an allocation stepfollowed by an updating step. The allocation step associates each example e tothe closest concept in C relative to sim and builds clusters around the differentset items. The updating step computes a new description for each concept inline with the examples covered.

Tabu search enhances the performance of a basic local search method byusing memory structures. Short-term memory implements a tabu list T thatrecords the lastly chosen moves and therefore prevents the search process frombacktracking. If the move that lead to the new current solution has an oppositemove (e.g. splitting a concept after merging it), the latter is marked tabu. Thismove remains in T for a fixed number of iterations (called the tabu tenure). Themoves in T cannot be chosen temporarily to go through the hypothesis space.Hence, the search process avoids backtracking to previously-visited solutions.The search is stopped when a predefined iteration number MAXiter is reached.

The k-means move needs special attention because, contrary to split andmerge moves, there is no opposite move. Furthermore, our preliminary exper-iments show that once it has been used it cannot be marked tabu. We havetherefore chosen to mark it tabu as soon as the current solution can no longerbe improved. In a way, the k-means move is its “own” opposite move.

2.3 AGAPE for Topic Extraction

The general AGAPE framework has been implemented for the problem of topicextraction from textual datasets. Here, a topic is just a set of words or phrases(similarly to the “bag-of-words” representation) that label a cluster of texts cov-ering the same theme. Let L be the word dictionary and a description d be a setof L elements. Therefore, the description set D corresponds to all the possible Lelements. A constraint is added, that one word or phrase cannot belong to morethan one topic at the same time. This is similar to the co-clustering approach[10] and is particularly useful since it is possible to restrain the hypothesis spaceand to obtain more intelligible solutions. Here is a minimalistic example of topic


set: { { ball, football player }, { presidential election, France }, {} }. The wholeclustering algorithm is detailed as shown below:

1. An initial solution S0 is given, which is also the current solution Si (i = 0).We choose the topic set S0 = {dT , d∅}, where dT is the description thatcontains all the words or phrases of L. The tabu list T is initially empty.

2. The neighborhood V is computed from the current solution Si with theassistance of authorized moves relative to T . This moves are split, mergeand k-means moves.

3. The best topic set B is chosen from V , i.e. the solution that optimizes thequality function q.

4. The aspiration criterion is applied: if the k-means move leads to a solutionA better than all previously discovered solutions, then B is replaced by A.

5. The new current solution Si+1 is replaced by B.6. If the chosen move has an opposite move (e.g. splitting a topic after merging

it), the latter is marked tabu. This move remains in a tabu list T for a fixednumber of iterations (called the tabu tenure). The moves in T cannot bechosen temporarily to go through the hypothesis space. Hence, the searchprocess avoids backtracking to previously-visited solutions.

7. The tabu list T is updated: the oldest tabu moves are removed according tothe tabu tenure tT .

8. If a local optimum is found then an intensification process is executed tem-porarily without taking into account the tabu list. It means that the neigh-borhood V is built with all the possible moves (even if they are marked tabu)and that the best quality solution is chosen until a local optimum is reached.Otherwise, better solutions can be missed because of the tabu list.

9. The best up-to-now discovered solution B is recorded.10. The search is stopped when a predefined iteration number MAXiter is

reached and B is returned.

Note that the moves used to compute the neighborhood, especially split andmerge moves, must be adapted to high-dimensional data. These moves are notthe focus of this paper and are therefore not detailed here.

The overall complexity can be estimated to O(|L| × |E| × maxδ × max2C ×

MAXiter), where maxδ is the maximum length of example description andmaxC is the maximum number of topics. Note that this algorithm is linearrelative to the number of examples |E| and almost linear relative to the numberof words |L|. In fact, the factor |L| × maxδ is greater, but far lower than thesquared one |L|2. The term max2

C is very overestimated because a low proportionof topic combinations is considered in practice (see fig. 6). The common drawbackof local search is the factor MAXiter which inevitably increases runtime, butthis extra time is the price to pay for better solutions, as we will see below.

3 Experiments and Results

This section presents the experiments carried out using AGAPE on topic ex-traction. The first subsection describes the whole methodology, including the


evaluation measure and the datasets chosen. The second subsection concernsthe experiments carried out on artificial datasets where an ideal topic set can beused in the evaluation. The third subsection presents the results obtained froma real news dataset.

3.1 Evaluation Methodology

Quality Function. As said in the introduction, the clustering problem is solvedusing a given quality function q. In fact, the algorithm is general and can useany kind of function as input, its purpose primarily being to find a solution thatoptimizes this function as well as possible. The function we propose here wasalready defined in equation 1 and computes the average homogeneity betweenthe topics of C and the associated clusters of E. The similarity measure sα weuse is inspired by works on adaptive distances [11] and is defined as follows:

sα(d1, d2) =a

a + b + α.c(3)

where a is the number of common words (or phrases) in the descriptions d1 andd2, b is the number of words appearing only in d1 and c is the number of wordsappearing only in d2. This measure is not symmetrical and gives as output anon negative real number. The value 0 means that no word is shared by the twodescriptions, whereas the value 1 means two equal descriptions. d1 stands forthe example description δ(e), e ∈ E, and d2 stands for the topic t ∈ C. Notethat α = 1 corresponds to the classical jaccard measure used for binary data.After investigation, it was decided to set α to 10%, which corresponds to a goodtrade-off for our experiments.

Internal Criteria. The evaluation is mainly based on internal criteria [12]and on the number of topics extracted. The internal criteria are based on acompactness-separation trade-off, where compactness computes the within-classhomogeneity while separation computes the dissimilarity between the differentclasses. The compactness measure is that of [13] and relies on the correspon-dence2 between the topic set and the example descriptions covered. A low valuemeans a better average homogeneity within the clusters. A new separation mea-sure adapted to topic extraction is proposed below:

σ(C, E) =1

|E|∑

e ∈ E/sα(δ(e), RC(e)) > 0

sα(δ(e), DC(e))sα(δ(e), RC(e))

(4)

where DC(e) relates the example e to the second best topic in C. This newmeasure is more adapted to topic set search from binary datasets than the usualseparation measure [13]. A low value means a better separation in E: each ex-ample is associated to one topic with no ambiguity. A high value means that theexamples could more easily be associated to other topics.2 Diday et al. (1979) uses the term of “adequation”.


Attention must also be paid to the number of topics extracted, which maybe very high (several hundred) depending on the dataset considered. This isthe reason why we divide the topic set into three types of topic depending thenumber of examples they cover: main topics (over 1% of the examples covered),weak topics (between 2 examples and 1% of the dataset) and outliers (only 1example covered). The 1% parameter may appear arbitrary and will be discussedin section 3.3.

General Methodology. The methodology requires two steps. First, experi-ments are carried out on artificial datasets, which were generated from an origi-nal topic set INI in V . Each topic in INI is the seed of examples with a similartheme. These examples are mixed and form the new dataset E, which can beanalyzed using clustering algorithms. Then, it is possible to compare the resultsobtained using these algorithms to the “optimal” solution INI. The second stepcompares the topic sets extracted from a real dataset, containing French newsavailable online between the 1st and the 8th September 2006. The process isdescribed in terms of q-score evolution in order to highlight the advantages oftabu search.

3.2 Experiments on Artificial Datasets

Artificial Datasets. 40 datasets were generated artificially, corresponding to5 datasets for 8 different noise levels from 0% to 70%. The generation process isas follows:

• k topics are generated over L with an homogeneous distribution. These kdescriptions and d∅ stand for the initial topic set INI.

• Each topic t ∈ INI is the seed of δ example descriptions. μ% of the words orphrases are drawn from these seeds to create a dataset E of k × δ examples.This corresponds to the sparseness of real textual data.

• E is degraded by a percentage ν of noise (between 0% and 70%), done byswapping the location of two well-placed words or phrases.

In our experiments, the number k of initial topics was set to 5, the duplicationdup to 100 (E therefore contains 500 examples) and μ to 99%. The number ofelements in the dictionary L was set to 30, 000, which is not that much higherthan the dictionary size of our real dataset (17, 459). The results could thus beaveraged over 5 similar datasets for each noise level.

Comparison with two Algorithms. AGAPE is compared with the initialtopic set INI and with two other clustering algorithms. The first of these al-gorithms is the classical k-means, adapted to binary co-clustering, and is verysimilar to the k-modes algorithm [14]. The best topic set was chosen from 30runs with three values of k (respectively 5, 30 and 300). The second algorithmis a variation called bisecting k-means, known to obtain very good results intext (or document) clustering [15]. It was implemented in order to be adaptedto binary co-clustering. 30 k-means were executed at each step to discover the


best splitting of the biggest cluster. The process was stopped when the qualityfunction showed no more improvements.

Figure 1 shows the evolution of the q-score and the topic number relative to anoise between 0% and 70%. The tabu tenure tT was set to 20 and the maximumnumber of iterations MAXiter to 100. Even if AGAPE does not always reachthe maximum score, it seems to be the most stable algorithm. Furthermore, itsresults are the best for noise over 30% and the number of main topics discoveredremains close to the 5 original topics. Note that the k-means algorithm can findfewer than k topics because the empty clusters are automatically removed.

Fig. 1. Compared evolution of the q-scores and of the main topic number

Figure 2 shows the dual evolution of compactness and separation. Lower scoresentail more compact and well-separated clusters. The compactness curves reflectthe q-score exactly because the two formulas rely on the similarity between thetopics and the example descriptions. The separation score obtained by AGAPE isvery good, even if the initial topic set INI is better. It is interesting to note thatas noise increases the topic sets discovered are more compact but less separatedthan the initial topic set. Hence, the q-score shows that the added noise makesit impossible to find the initial solution using the function q.

These results show the effectiveness of the AGAPE-based algorithm, whateverthe rate of noise added to the datasets. Runtime is not very long (between 90 and400 seconds on a pentium II with 2Gh-memory) and similar to the time takenby the bisecting k-means algorithm. Although the simple k-means algorithm is


Fig. 2. Compared evolution of the compactness-separation scores

much faster (between 12 and 50 seconds), its results are much worse. We decidednot to consider the 5-means algorithm in the following experiments.

3.3 Experiments on AFP News

AFP News. This dataset was automatically extracted from the French webnews by the research team under F. Chateauraynaud. The news was availableon a public site linked to the French press agency AFP and was collected betweenthe 1st and the 8th September 2006. It contains 1566 news items described usinga vocabulary of 17, 459 words or phrases. These words (or word sequences) wereidentified by a semi-automatic indexation tool and are the basis of a Frenchdictionary supervised by researchers. In this work, only nouns and noun phraseswere considered (e.g. “president”, “presidential election”, “information society”).For more details about the word extraction technique, see [16]. If you wish toreceive the complete dataset, please contact us by email.

Results. Table 1 compares the topic sets obtained using three clustering algo-rithms on the French news dataset. The 5-means algorithm was not consideredbecause the results on the artificial datasets were poor. The tabu tenure tT wasset to 20 and the maximum number of iterations MAXiter to 500. Here, the re-sults are even clearer than in the previous experiments: using the AGAPE-basedalgorithm it is possible to extract extremely compact and well-separated topics.


Table 1. Comparisons with the AFP dataset

AGAPE BKM KM30 KM300

q-score 0.3122 0.2083 0.1832 0.2738

compactness 0.7171 0.824 0.8496 0.7583

separation 0.1511 0.1827 0.2464 0.2519

main topics 14 27 29 23

weak topics 186 3 1 157

outliers 301 0 0 57

The price to pay is a higher (but still tractable) runtime: AGAPE needs 6654seconds, which is ten times more than with the BKM algorithm (694 seconds).

The number of topics extracted, 501, may seem rather high, though theycan be subdivided into 14 main topics, 186 weak topics and 301 outliers. Theboundary between main and weak topics is not a hard one and Figure 3 showsthat a small variation of the cut parameter (1% here) has little impact on thispartition. Besides, remember that the quality function to optimize is given. qcould thus take into account the number of topics as a penalty, similarly to whatis done in the BIC criterion [17].

Fig. 3. Example distribution through the topics

Fig. 4 describes the most frequent words and phrases of the 7 main topics of thecollection. Their meaning was clear and corresponded to newspaper headlines.Figure 5 gives term clarifications for non-French readers.

Figure 6 presents the join evolution of the q-score and of the neighborhoodsize |V|. The neighborhood size is computed both with (grey bars) and without(dark bars) the tabu list. Note that the iterations without dark bars correspondsto intensification steps (step 8 in our algorithm). The temporary degradation ofthe current solution (see (1)) leads to better solutions (see (2)).


t1 (101 news): students, education, de Robien, academic, children, teachers, minister,parents, academy, high school . . .t2 (93 news): AFP, Royal, UMP, Sarkozy, PS, president, campaigner, French, Paris,party, debate, presidential election . . .t3 (73 news): project, merger, French deputy, bill, GDF, privatization, government,cluster, Suez . . .t4 (67 news): nuclear, UN, sanction, USA, Iran, Russia, security council, China, Coun-tries, uranium, case, enrichment . . .t5 (49 news): rise, market, drop, Euros, analyst, indext6 (48 news): tennis, Flushing Meadows, US Open, chelem, year, tournament . . .t7 (41 news): health minister, health insurance, health, statement, doctors, Bertrand,patients . . .

Fig. 4. The first 7 main topics of the AFP dataset

-Sarkozy, Royal: candidates for the French presidential election in 2007; they were theheads of the UMP and PS party (respectively).-De Robien, Bertrand: members of the French government (minister of education andHealth minister respectively).-GDF, Suez: two big French companies.

Fig. 5. Clarification of some French terms

Fig. 6. Join evolution of the neighborhood size and of the q-score

4 Conclusion and Future Work

This paper presents a general framework for concept extraction. The learningproblem is centered on concepts and addressed using optimization techniques.The AGAPE-based algorithm can be adapted to any kind of function q andsimilarity measure sim. In addition, the number or nature of the concepts canbe constrained. This approach was successfully applied to topic extraction fromtextual data.


Textual data are ubiquitous, especially on the internet (blogs, forums, RSSfeeds, emails). It seems natural to develop algorithms capable of summarizingsuch data, which is why we choose to implement AGAPE to extract topics fromhigh-dimensional data. Results on both artificial datasets and the real AFP newsdataset seem to be better than those obtained using classical text clusteringalgorithms, and gives better solutions in the whole hypothesis space, at the costof slightly longer runtime.

The AGAPE framework offers future research possibilities. Different qualityfunctions [18] and similarity measures can be compared thanks to the gener-ality of our approach. Furthermore, the different optima discovered during thesearch process can be recorded and used to return not just one, but several goodsolutions. These solutions can either be redundant or offers specific interestingfeatures, which can be computed using a measure such as mutual information[19]. They can then generate ensembles [20] which can be used in robust clus-tering [21].

References

1. Michalski, R.S., Stepp, R.E., Diday, E.: A recent advance in data analysis: cluster-ing objects into classes characterized by conjunctive concepts. In: Pattern Recog-nition, vol. (1), pp. 33–55 (1981)

2. Mishra, N., Ron, D., Swaminathan, R.: A New Conceptual Clustering Framework.In: Machine Learning, vol. 56 (1-3), pp. 115–151. Kluwer Academic Publishers,Dordrecht (2004)

3. Sherali, H.D., Desai, J.: A Global Optimization RLT-based Approach for Solvingthe Hard Clustering Problem. In: Journal of Global Optimization, vol. 32(2), pp.281–306. Kluwer Academic Publishers, Dordrecht (2005)

4. Glover, F., Laguna, M.S.: Tabu Search. Kluwer Academic Publishers, Dordrecht(1997)

5. Newman, D.J., Block, S.: Probabilistic Topic Decomposition of an Eighteenth-Century American Newspaper. Journal of the American Society for InformationScience and Technology 57(6), 753–767 (2006)

6. Ng, M.K., Wong, J.C.: Clustering categorical data sets using tabu search tech-niques. In: Pattern Recognition, vol. 35(12), pp. 2783–2790 (2002)

7. Velcin, J., Ganascia, J.-G.: Stereotype Extraction with Default Clustering. In: Pro-ceedings of the 19th International Joint Conference on Artificial Intelligence, Ed-inburgh, Scotland (2005)

8. Fisher, D.H.: Knowledge Acquisition Via Incremental Conceptual Clustering. In:Machine Learning, vol. (2), pp. 139–172 (1987)

9. MacQueen, J.: Some methods for classification and analysis of multivariate obser-vations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statis-tics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley,Califonia (1967)

10. Dhillon, I.S., Mallela, S., Modha, D.S.: Information-Theoretic Co-Clustering. In:KDD-2003. Proceedings of The Ninth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pp. 89–98. ACM Press, New York (2003)

11. Aggarwal, C.: Re-designing distance functions and distance-based applications forhigh dimensional data. In: ACM SIGMOD Record, vol. 30(1), pp. 13–18. ACMPress, New York (2001)


12. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Cluster Validity Methods: Part I -Part II. In: Special Interest Groups on Management Of Data (2002)

13. He, J., Tan, A.-H., Tan, C.-L., Sung, S.-Y.: On Qualitative Evaluation of ClusteringSystems. In: Information Retrieval and Clustering, Kluwer Academic Publishers,Dordrecht (2002)

14. Huang, Z.: A Fast Clustering Algorithm to Cluster Very Large Categorical DataSets in Data Mining. In: DMKD, vol. 8 (1997)

15. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering tech-niques. In: Proceedings of the KDD Workshop on Text Mining (2000)

16. Chateauraynaud, F.: Prospero: une technologie litteraire pour les sciences hu-maines. CNRS Editions (2003)

17. Kass, R.E., Raftery, A.E.: Bayes factors. Journal of American Statistical Associa-tion 90, 773–795 (1995)

18. Zhao, Y., Karypis, G.: Empirical and Theoretical Comparisons of Selected CriterionFunctions for Document Clustering. In: Machine Learning, vol. 55, pp. 311–331.Kluwer Academic Publishers, Dordrecht (2004)

19. Gondek, D., Hofmann, T.: Non-redundant clustering with conditional ensembles.In: Proceedings of the eleventh ACM SIGKDD international conference on knowl-edge discovery and data mining, Chicago, Illinois, pp. 70–77 (2005)

20. Dimitriadou, E., Weingessel, A., Hornik, K.: A cluster ensembles framework. In:Design and application of hybrid intelligent systems, pp. 528–534. IOS Press, Am-sterdam (2003)

21. Fred, A., Jain, A.: Robust data clustering. In: Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition, pp. 128–133.IEEE Computer Society Press, Los Alamitos (2003)

Date post:	06-Nov-2023
Category:	Documents
Upload:	sorbonne-fr
View:	0 times
Download:	0 times

Topic Extraction with AGAPE

Documents