Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | buck-miles |
View: | 238 times |
Download: | 1 times |
Clustering
Supervised vs. Unsupervised LearningExamples of clustering in Web IRCharacteristics of clusteringClustering algorithmsCluster Labeling
1
Supervised vs. Unsupervised Learning
Supervised Learning Goal: A program that performs a task as good as humans. TASK – well defined (the target function) EXPERIENCE – training data provided by a human PERFORMANCE – error/accuracy on the task
Unsupervised Learning Goal: To find some kind of structure in the data. TASK – vaguely defined No EXPERIENCE No PERFORMANCE (but, there are some evaluations metrics)
2
What is Clustering?
Clustering is the most common form of Unsupervised Learning
Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects
It can be used in IR: To improve recall in search applications For better navigation of search results
3
Example 1: Improving Recall
Cluster hypothesis - Documents with similar text are relatedThus, when a query matches a document D, also return other documents in the cluster containing D.
4
Example 2: Better Navigation
5
Clustering Characteristics
Flat versus Hierarchical Clustering Flat means dividing objects in groups (clusters) Hierarchical means organize clusters in a subsuming hierarchy
Evaluating Clustering Internal Criteria
The intra-cluster similarity is high (tightness) The inter-cluster similarity is low (separateness)
External Criteria Did we discover the hidden classes? (we need gold
standard data for this evaluation)
6
Clustering for Web IR
Representation for clustering Document representation
Vector space? Normalization? Need a notion of similarity/distance
How many clusters? Fixed a priori? Completely data driven?
Avoid “trivial” clusters - too large or small
7
Recall documents as vectors
Each doc j is a vector of tfidf values, one component for each term.Can normalize to unit length.
So we have a vector space terms are axes - aka features n docs live in this space even with stemming, may have 20,000+ dimensions
8
ijijin
i ji
ji
j
jj idftfw
w
w
d
dd
,,
1 ,
, where
What makes documents related?
Ideal: semantic similarity.Practical: statistical similarity
We will use cosine similarity. Documents as vectors.
We will describe algorithms in terms of cosine similarity.
9
n
i kiw
jiw
jdsim
dd
kd
kj
1 ,,)(
:, normalized of similarity Cosine
,
This is known as the normalized inner product.
Intuition for relatedness
10
t 1
D2
D1
D3
D4t 2
x
y
Documents that are “close together” in vector space talk about the same things.
Clustering Algorithms
Partitioning “flat” algorithms Usually start with a random (partial) partitioning Refine it iteratively
k-means clustering Model based clustering (we will not cover it)
Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive (we will not cover it)
11
Partitioning “flat” algorithms
Partitioning method: Construct a partition of n documents into a set of k clusters
Given: a set of documents and the number k
Find: a partition of k clusters that optimizes the chosen partitioning criterion
12
Watch animation of k-means
K-means
Assumes documents are real-valued vectors.Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:
Reassignment of instances to clusters is based on distance to the current cluster centroids.
13
cx
xc
||
1(c)μ
K-Means Algorithm
14
Let d be the distance measure between instances.
Select k random instances {s1, s2,… sk} as seeds.
Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cj such that d(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj
sj = (cj)
K-means: Different Issues
When to stop? When a fixed number of iterations is reached When centroid positions do not change
Seed Choice Results can vary based on random seed selection. Try out multiple starting points
15
Example showingsensitivity to seeds
A B
D E
C
F
If you start with B and E as centroidsyou converge to {A,B,C}and {D,E,F}If you start with D and Fyou converge to {A,B,D,E} {C,F}
Hierarchical clustering
Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples.
16
animal
vertebrate
fish reptile amphib. mammal worm insect crustacean
invertebrate
Hierarchical Agglomerative Clustering
We assume there is a similarity function that determines the similarity of two instances.
17
Start with all instances in their own cluster.Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj
Algorithm:
Watch animation of HAC
What is the most similar cluster?
Single-link Similarity of the most cosine-similar (single-link)
Complete-link Similarity of the “furthest” points, the least cosine-similar
Group-average agglomerative clustering Average cosine between pairs of elements
Centroid clustering Similarity of clusters’ centroids
18
Single link clustering
19
1) Use maximum similarity of pairs:
),(max),(,
yxsimccsimji cycx
ji
2) After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
)),(),,(max()),(( kjkikji ccsimccsimcccsim
Complete link clustering
20
1) Use minimum similarity of pairs:
),(min),(,
yxsimccsimji cycx
ji
2) After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
)),(),,(min()),(( kjkikji ccsimccsimcccsim
Major issue - labeling
After clustering algorithm finds clusters - how can they be useful to the end user?
Need a concise label for each cluster In search results, say “Animal” or “Car” in the jaguar example. In topic trees (Yahoo), need navigational cues.
Often done by hand, a posteriori.
21
How to Label Clusters
Show titles of typical documents Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which may not fully represent
cluster
Show words/phrases prominent in cluster More likely to fully represent cluster Use distinguishing words/phrases But harder to scan
22
Not covered in this lecture
Complexity: Clustering is computationally expensive. Implementations need
careful balancing of needs.
How to decide how many clusters are best?
Evaluating the “goodness” of clustering There are many techniques, some focus on implementation issues
(complexity/time), some on the quality of
23