Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy...

transcript

Matjaž Juršič, Vid Podpečan, Nada Lavrač

FUZZY CLUSTERING OF DOCUMENTS

http://kt.jis.si

OVERVIEWBASIC CONCEPTS

- Clustering- Fuzzy Clustering- Clustering of Documents

PROBLEM DOMAIN- Conference Papers Clustering (Phase 1)- Combining Constraint-Based & Fuzzy Clustering- Conference Papers Clustering (Phase 2)

FUZZY CLUSTERING OF DOCUMENTS- C-Means Algorithm- Distance Measure- Comparison of Crisp & Fuzzy Clustering- Time Complexity

FURTHER WORK

Fuzzy Clustering of Documents

CLUSTERING

Important unsupervised learning problem that deals with finding a structure in a collection of unlabeled data.

Dividing data into groups (clusters) such that:- “similar” objects are in the same cluster,- “dissimilar” objects are in different clusters.

Problems:- correct similarity/distance function between objects,- evaluating clustering results.

FUZZY CLUSTERING

•No sharp boundaries between clusters.•Each data object can belong to more than one cluster (with certain probability).

e.g. membership of “red square” data object: - 70% in “red” cluster - 30% in “green” cluster

CLUSTERING OF DOCUMENTS

BAG OF WORDS & VECTOR SPACE MODEL- text represented as an unordered collection of words- using tf-idf (term frequency–inverse document frequency)- document = one vector in high dimensional space- similarity = cosine similarity between vectors

TEXT-GARDEN SOFTWARE LIBRARY (www.textmining.net)- collection of text-minig software tools

(text analysis, model generation, documents classification/clustering, web crawling, ...)

- c++ library- developed at JSI

CONFERENCE PAPERS CLUSTERING (PHASE 1)

PROBLEMGrouping conference papers with regard to their contents into predefined sessions schedule.

Session A (3 papers)

Coffee break

EXAMPLE

Session B(4 papers)

Lunch break

Session C(4 papers)

Session D(3 papers)

Coffee break

Papers

Sessions schedule

Constraint-basedclustering

Session A – Title

Session B – Title

Session C – Title

Session D – Title

COMBINING CONSTRAINT-BASED & FUZZY CLUSTERING

PHASE 1 SOLUTION- constrained-based clustering (CBC)

DIFFICULTIES- CBC can get stuck in local minimum- often low quality result (created schedule)- user interaction needed to repair schedule

PHASE 2 NEEDED- run fuzzy clustering (FC) with initial clusters from CBC- if output clusters of FC differ from CBC repeat everything- if the clusters of FC equal to CBC show new info to user

CONFERENCE PAPERS CLUSTERING (PHASE 2)

RUN FUZZY CLUSTERING ON PHASE 1 RESULTS- insight into result quality- identify problematic papers

Coffee break

EXAMPLE

Lunch break

Coffee break

Sessions scheduleSession A –

Session B – Title

Session C – Title

Session D – Title

13%42%

C-MEANS ALGORITHM generate initial (random) clusters centres repeat

for each example calculate membership weights

for each cluster recompute new centre

until the difference of the clusters between two iterations drops under some threshold

tcenterdisttcenterdistk

),(),(

ttucenter

distancefuzzinessclusterexample

DISTANCE MEASURE

VECTOR SPACE- Usual similarity measure: cosine similarity

C-MEANS EXPLICITLY NEEDS DISTANCE (DISSIMILARITY), NOT SIMILARITY:

- There are many possibilities:

- None has ideal properties.- Experimental evaluation shows no significant difference. - We used

1,0cos),(21

xxxx Θsim

.1,cos

,0,1cos1sin),(,0,1cos1),(

Θdist

ΘΘdistΘdist

.sinΘ

COMPARISON OF CRISP & FUZZY CLUSTERING

TIME COMPLEXITY

If dimensionality of the vector is much higher than the number of clusters then comparable to k-means (this holds for document clustering).

)())(( then if

))(( :means-c)( : means-k

vkniOkvkniOO(k)O(v)

kvkniOvkniO

vectoroflity dimensionaclusters ofnumber vectorsofnumber iterations ofnumber

FURTHER WORK

EVALUATION- Test scenarios- Benchmarks- Using data from past conferences

USER INTERFACE- Web interface for semi-automatic conference schedule creation

ALGORITHMS FINE-TUNING

DISCUSSION

CONTACTSmatjaz.jursic@ijs.si, vid.podpecan@ijs.si,

nada.lavrac@ijs.si

THANK YOU FOR YOUR ATTENTION