Post on 27-Dec-2015
transcript
Matjaž Juršič, Vid Podpečan, Nada Lavrač
FUZZY CLUSTERING OF DOCUMENTS
http://kt.jis.si
OVERVIEWBASIC CONCEPTS
- Clustering- Fuzzy Clustering- Clustering of Documents
PROBLEM DOMAIN- Conference Papers Clustering (Phase 1)- Combining Constraint-Based & Fuzzy Clustering- Conference Papers Clustering (Phase 2)
FUZZY CLUSTERING OF DOCUMENTS- C-Means Algorithm- Distance Measure- Comparison of Crisp & Fuzzy Clustering- Time Complexity
FURTHER WORK
2/13
Fuzzy Clustering of Documents
CLUSTERING
Important unsupervised learning problem that deals with finding a structure in a collection of unlabeled data.
Dividing data into groups (clusters) such that:- “similar” objects are in the same cluster,- “dissimilar” objects are in different clusters.
Problems:- correct similarity/distance function between objects,- evaluating clustering results.
3/13
Fuzzy Clustering of Documents
FUZZY CLUSTERING
•No sharp boundaries between clusters.•Each data object can belong to more than one cluster (with certain probability).
4/13
Fuzzy Clustering of Documents
e.g. membership of “red square” data object: - 70% in “red” cluster - 30% in “green” cluster
5/13
Fuzzy Clustering of Documents
CLUSTERING OF DOCUMENTS
BAG OF WORDS & VECTOR SPACE MODEL- text represented as an unordered collection of words- using tf-idf (term frequency–inverse document frequency)- document = one vector in high dimensional space- similarity = cosine similarity between vectors
TEXT-GARDEN SOFTWARE LIBRARY (www.textmining.net)- collection of text-minig software tools
(text analysis, model generation, documents classification/clustering, web crawling, ...)
- c++ library- developed at JSI
6/13
Fuzzy Clustering of Documents
CONFERENCE PAPERS CLUSTERING (PHASE 1)
PROBLEMGrouping conference papers with regard to their contents into predefined sessions schedule.
Session A (3 papers)
Coffee break
EXAMPLE
Session B(4 papers)
Lunch break
Session C(4 papers)
Session D(3 papers)
Coffee break
Papers
Sessions schedule
Constraint-basedclustering
Session A – Title
Session B – Title
Session C – Title
Session D – Title
7/13
Fuzzy Clustering of Documents
COMBINING CONSTRAINT-BASED & FUZZY CLUSTERING
PHASE 1 SOLUTION- constrained-based clustering (CBC)
DIFFICULTIES- CBC can get stuck in local minimum- often low quality result (created schedule)- user interaction needed to repair schedule
PHASE 2 NEEDED- run fuzzy clustering (FC) with initial clusters from CBC- if output clusters of FC differ from CBC repeat everything- if the clusters of FC equal to CBC show new info to user
8/13
Fuzzy Clustering of Documents
CONFERENCE PAPERS CLUSTERING (PHASE 2)
RUN FUZZY CLUSTERING ON PHASE 1 RESULTS- insight into result quality- identify problematic papers
Coffee break
EXAMPLE
Lunch break
Coffee break
Sessions scheduleSession A –
Title
Session B – Title
Session C – Title
Session D – Title
25%
13%42%
10%
37%
9/13
Fuzzy Clustering of Documents
C-MEANS ALGORITHM generate initial (random) clusters centres repeat
for each example calculate membership weights
for each cluster recompute new centre
until the difference of the clusters between two iterations drops under some threshold
j
m
tcenterdisttcenterdistk
j
k
tu 12
),(),(
1)(
t
mk
t
mk
k tu
ttucenter
distancefuzzinessclusterexample
dmkt
10/13
Fuzzy Clustering of Documents
DISTANCE MEASURE
VECTOR SPACE- Usual similarity measure: cosine similarity
C-MEANS EXPLICITLY NEEDS DISTANCE (DISSIMILARITY), NOT SIMILARITY:
- There are many possibilities:
- None has ideal properties.- Experimental evaluation shows no significant difference. - We used
1,0cos),(21
2121
xx
xxxx Θsim
.1,cos
1),(
,0,1cos1sin),(,0,1cos1),(
21
221
21
Θdist
ΘΘdistΘdist
xx
xxxx
.sinΘ
11/13
Fuzzy Clustering of Documents
COMPARISON OF CRISP & FUZZY CLUSTERING
12/13
Fuzzy Clustering of Documents
TIME COMPLEXITY
If dimensionality of the vector is much higher than the number of clusters then comparable to k-means (this holds for document clustering).
)())(( then if
))(( :means-c)( : means-k
vkniOkvkniOO(k)O(v)
kvkniOvkniO
cc
c
k
vectoroflity dimensionaclusters ofnumber vectorsofnumber iterations ofnumber
vkni
13/13
Fuzzy Clustering of Documents
FURTHER WORK
EVALUATION- Test scenarios- Benchmarks- Using data from past conferences
USER INTERFACE- Web interface for semi-automatic conference schedule creation
ALGORITHMS FINE-TUNING
…
DISCUSSION
CONTACTSmatjaz.jursic@ijs.si, vid.podpecan@ijs.si,
nada.lavrac@ijs.si
THANK YOU FOR YOUR ATTENTION