Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 0 times |
1
I256
Applied Natural Language Processing
Fall 2009
Lecture 15
(Text) clustering
Barbara Rosario
2
Outline
• Motivation and applications for text clustering• Hard vs. soft clustering• Flat vs. hierarchical clustering• Similarity measures• Flat
– K-means
• Hierarchical– Agglomerative Clustering
3
Text Clustering
• Finds overall similarities among groups of documents
• Finds overall similarities among groups of tokens (words, adjectives…)
• Goal is to place similar objects in the same groups and to assign dissimilar objects to different groups
4
Motivation
• Smoothing for statistical language models– Generalization
• Forming bins (by inducing the bins from the data)
From Michael Collins’s slides (MIT 6.864 NLP course)
5
Motivation
• Aid for Question-Answering and Information Retrieval
From Michael Collins’s slides (MIT 6.864 NLP course)
6
Word Similarity
Find semantically related words by combining similarity
evidence from multiple indicators
From Michael Collins’s slides (MIT 6.864 NLP course)
7
Word clustering
From Michael Collins’s slides (MIT 6.864 NLP course)
8
Distributional Clustering of English Words - Pereira, Tishby and Lee, ACL 93
Clustering of nouns
9
Distributional Clustering of English Words - Pereira, Tishby and Lee, ACL 93
10
Clustering of adjectives• Cluster adjectives based on the nouns
they modify
• Multiple syntactic clues for modification
Predicting the semantic orientation of adjectives,
V Hatzivassiloglou, KR McKeown, EACL 1997
11
Document clustering
Classification
12
Scatter/Gather: Clustering a Large Text Collection
Cutting, Pedersen, Tukey & Karger 92, 93
Hearst & Pedersen 95
• Cluster sets of documents into general “themes”, like a table of contents
• Display the contents of the clusters by showing topical terms and typical titles
• User chooses subsets of the clusters and re-clusters the documents within • Resulting new groups have different “themes”
13
From http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
14
S/G Example: query on “star”
Encyclopedia text
14 sports
8 symbols 47 film, tv
68 film, tv (p) 7 music
97 astrophysics
67 astronomy(p) 12 steller phenomena
10 flora/fauna 49 galaxies, stars
29 constellations
7 miscelleneous
Clustering and re-clustering is entirely automated
15
16
17
18
Motivation: Visualization & EDA• Exploratory data analysis (EDA) (related to
visualization)– Get a feeling for what the data look like– Try to find overall trends or patterns in text collections
19
Visualization
• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.
• “Project” these onto a 2D graphical representation
• Looks neat, but difficult to detect patterns– Usefulness debatable
20
Motivation:Clustering for Information Retrieval
• The cluster hypothesis states the fundamental assumption we make when using clustering in information retrieval.– Cluster hypothesis. Documents in the same
cluster behave similarly with respect to relevance to information needs.
• Tends to place similar docs together
21
Search result clustering
• Instead of lists, clusters the search results, so that similar documents appear together.
• It is often easier to scan a few coherent groups than many individual documents. – Particularly useful if a search term has
different word senses. – Vivísimo search engine (http://vivisimo.com)
22
23
Motivation: unsupervised classification
• Classification when labeled data is not available – Also called unsupervised classification– Results of clustering depends only on the
natural division in the data, not on any pre-existing categorization scheme
24
Classification
Class1
Class2
25
Clustering
26
Clustering
27
Methods
• Hard/soft clustering
• Flat/hierarchical clustering
• Similarity measures
• Merging methods
28
Text Clustering
Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseeu
Term 1
Term 2
29
Text Clustering
Term 1
Term 2
Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseeu
30
Hard/soft Clustering
– Hard Clustering -- Each object belongs to a single cluster
– Soft Clustering -- Each object is probabilistically assigned to clusters
31
Soft clustering
• A variation of many clustering methods
• Instead of assigning each data sample to one and only one cluster, it calculates probabilities of membership for all clusters– A sample might belong to cluster A with
probability 0.4 and to cluster B with probability 0.6
• More appropriate for NLP tasks
32
Flat Vs. Hierarchical
• Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other
• Hierarchical clustering produces a hierarchy of nodes– Leaves are the single objects of the clustered
set– Node represents the cluster that contains all
the nodes of its descendant
33
From http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
34
Flat Vs. Hierarchical
• Flat– Preferable is efficiency is a consideration or data sets
are very large– K-means is a very simple methods that should
probably be used fist on anew data set because its results are often sufficient
– K-means assumes a simple Euclidean representation save so cannot be used for many data set, for example nominal data like colors
– In such cases use EM (expectation minimization)
35
Flat Vs. Hierarchical
• Hierarchical– Preferable for detailed data analysis– Provide more information than flat clustering– Does not require us to pre-specify the number
of clusters– Less efficient: the most common hierarchical
clustering algorithms have a complexity that is at least quadratic in the number of documents compared to the linear complexity of most flat clustering methods
36
Clustering issues
• Two main issues
• Similarity measure
• How to cluster data point together (o not)– Clustering algorithms– Merging criteria
37
Similarity
• Vector-space representation and similarity computation
• Select important distributional properties of a word
• Create a vector of length n for each word to be classified
• Viewing the n-dimensional vector as a point in an n-dimensional space, cluster points that are near one another
38
Similarity
From Michael Collins’s slides (MIT 6.864 NLP course)
39
Similarity
From Michael Collins’s slides (MIT 6.864 NLP course)
40
Similarity
From Michael Collins’s slides (MIT 6.864 NLP course)
41
Similarity
From Michael Collins’s slides (MIT 6.864 NLP course)
42
Similarity
From Michael Collins’s slides (MIT 6.864 NLP course)
43
Pair-wise Document Similarity
nova galaxy heat h’wood film role diet fur
1 3 1
5 2
2 1 5
4 1
ABCD
How to compute document similarity?
44
Pair-wise Document Similarity(no normalization for simplicity)
nova galaxy heat h’wood film role diet fur
1 3 1
5 2
2 1 5
4 1
ABCD
t
iii
t
t
wwDDsim
wwwD
wwwD
12121
2,22212
1,12111
),(
...,,
...,,
9)11()42(),(
0),(
0),(
0),(
0),(
11)32()51(),(
DCsim
DBsim
CBsim
DAsim
CAsim
BAsim
45
Pair-wise Document Similarity(cosine normalization)
normalized cosine
)()(
),(
edunnormaliz ),(
...,,
...,,
1
22
1
21
121
21
12121
2,22212
1,12111
t
ii
t
ii
t
iii
t
iii
t
t
ww
wwDDsim
wwDDsim
wwwD
wwwD
46
Document/Document Matrix
....
.....
.....
....
....
...
21
2212
1121
21
nnn
t
t
t
ddD
ddD
ddD
DDD
jiij DDd to of similarity
47
Similarity
• And many other similarity measures!
48
Flat Clustering: K-means
• K-means is the most important flat clustering algorithm.
• Objective is to minimize the average squared Euclidean distance of documents from their cluster centers where a cluster center is defined as the mean or centroid μ of the documents in a cluster ω:
49
K-Means Clustering
• Decide on a pair-wise similarity measure
1. Compute K centroids
2. Assign each document to nearest center, forming new clusters
3. Unless terminate condition, repeat 1-2
50
K-means algorithmA K-means example
for K = 2 in R2
From http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
51
K-means algorithm
• Convergence of the position of the two centroids
From http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
52
K-means
• Residual sum of squares or RSS: measure of how well the centroids represent the members of their clusters – RSS: squared distance of each vector from its
centroid summed over all vectors– RSS is the objective function in K-means and
our goal is to minimize it
53
Model-based clustering
• Model-based clustering assumes that the data were generated by a model and tries to recover the original model from the data. (Flat)
• The model that we recover from the data then defines clusters and an assignment of documents to clusters.
• EM (expectation-maximization)
54
Hierarchical Clustering
• Agglomerative or bottom-up:– Initialization: Start with each sample in its own cluster– Merge the two closest clusters– Each iteration: Find two most similar clusters and merge
them– Termination: All the objects are in the same cluster
• Divisive or top-down:– Start with all elements in one cluster– Partition one of the current clusters in two– Repeat until all samples are in singleton clusters
55
Agglomerative Clustering
A B C D E F G HI
56
Agglomerative Clustering
A B C D E F G HI
57
Agglomerative Clustering
A B C D E F G HI
58
Merging nodes/Clustering function
• Each node is a combination of the documents combined below it
• We represent the merged nodes as a vector of term weights
• This vector is referred to as the cluster centroid
59
Clustering functionsaka Merging criteria
• Extending the distance measure from samples to sets of samples
Similarity of 2 most similar members
Similarity of 2 least similar members
Average similarity between members
From Michael Collins’s slides (MIT 6.864 NLP course)
60
Single-link merging criteria
Merge closest pair of clusters:Single-link: clusters are close if any of their points are
dist(A,B) = min dist(a,b) for aA, bB
each word type isa single-point cluster
merge
61
Fast, but tend to get long, stringy, meandering clusters ...
Bottom-Up Clustering – Single-Link
62
Bottom-Up Clustering – Complete-Link
Again, merge closest pair of clusters:Complete-link: dist(A,B) = max dist(a,b) for aA, bB
distancebetweenclusters
63
Bottom-Up Clustering – Complete-Link
distancebetweenclusters
Slow to find closest pair – need quadratically many distances
64
Choosing k
• How to select an appropriate level of granularity?
• Too small, and clusters provide insufficient generalization
• Too large, and they are inappropriately generalized
65
Choosing k
• In both hierarchical and k-means/medians, we need to be told where to stop, i.e., how many clusters to form
• This is partially alleviated by visual inspection of the hierarchical tree (the dendrogram)
• It would be nice if we could find an optimal k from the data
• We can do this by trying different values of k and seeing which produces the best separation among the resulting clusters.
• And there are some theoretical measures
66
How to evaluate clusters?
• In practice, it’s hard to do– Different algorithms’ results look good and
bad in different ways– It’s difficult to distinguish their outcomes
• In theory, define an evaluation function– Typically choose something easy to measure
(e.g., the sum of the average distance in each class)
67
How to evaluate clusters?
• Perform task-based evaluation• Test the resulting clusters intuitively, i.e.,
inspect them and see if they make sense. Not advisable.
• Have an expert generate clusters manually, and test the automatically generated ones against them.
• Test the clusters against a predefined classification if there is one
From Michael Collins’s slides (MIT 6.864 NLP course)
68
Resources
• FCLUSTER - A tool for fuzzy cluster analysis
• LNKnet Pattern Classification Software • Principal Direction Divisive Partitioning • k-means clustering • Text Clustering
– http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf (Chapter 16 and 17)