+ All Categories
Home > Documents > A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

Date post: 28-Dec-2015
Category:
Upload: alexia-west
View: 220 times
Download: 1 times
Share this document with a friend
Popular Tags:
17
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Den g WWW 07
Transcript
Page 1: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

A New Suffix TreeSimilarity Measure forDocument Clustering

Hung Chim, Xiaotie Deng

WWW 07

Page 2: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

1. Document Clustering

• Agglomerative Hierarchical Clustering (AHC)

Page 3: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

• Suffix Tree Clustering (STC)

- commonly used in result clustering

Page 4: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

2-1. Suffix Tree Clustering

Ex: 3 documents

• cat ate cheese• cat ate mouse too• mouse ate cheese too

Page 5: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

cat ate cheese

Page 6: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

cat ate cheese

Page 7: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

cat ate cheese

Page 8: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

cat ate cheese

Page 9: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

score(B) = |B| f(|P|)f: remove stopwords, <= 3

, > 40% && penalize single word, constant for |P| > 6

2-2. Base Cluster

Page 10: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

2-3. Combining Base Cluster

• Keep top k(=500) base cluster

• Merge high overlap base clustersmerge Bi & Bj iff

|Bi∩Bj| / |Bi| > 0.5

|Bj∩Bi| / |Bj| > 0.5

Page 11: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

2-4. Advantage

• High precision even using snippet

• Incremental and linear time

• Order Independent

• No magic k

top k base clusters? 0.5?

Page 12: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
Page 13: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

3. New Suffix Tree Clustering

diT =

[tfidf(n1, di), tfidf(n2, di), …]

Group-average AHC

(GAHC)

Page 14: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

4. Evaluation

• Use F-measure

precision(Ci, Gj) = |Ci∩ Gj | / |Ci|

recall(Ci, Gj) = |Ci∩ Gj | / | Gj |

Page 15: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

• OHSUMED Document CollectionMeSH indexing terms

• RCV1 Document Collectioncategories

Page 16: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
Page 17: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

5. Comparison

• STC : seldom generate large cluster

• NSTC : not incremental


Recommended