+ All Categories
Home > Documents > 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

Date post: 19-Dec-2015
Category:
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
68
1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario
Transcript
Page 1: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

1

I256

Applied Natural Language Processing

Fall 2009

Lecture 15

(Text) clustering

Barbara Rosario

Page 2: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

2

Outline

• Motivation and applications for text clustering• Hard vs. soft clustering• Flat vs. hierarchical clustering• Similarity measures• Flat

– K-means

• Hierarchical– Agglomerative Clustering

Page 3: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

3

Text Clustering

• Finds overall similarities among groups of documents

• Finds overall similarities among groups of tokens (words, adjectives…)

• Goal is to place similar objects in the same groups and to assign dissimilar objects to different groups

Page 4: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

4

Motivation

• Smoothing for statistical language models– Generalization

• Forming bins (by inducing the bins from the data)

From Michael Collins’s slides (MIT 6.864 NLP course)

Page 5: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

5

Motivation

• Aid for Question-Answering and Information Retrieval

From Michael Collins’s slides (MIT 6.864 NLP course)

Page 6: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

6

Word Similarity

Find semantically related words by combining similarity

evidence from multiple indicators

From Michael Collins’s slides (MIT 6.864 NLP course)

Page 7: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

7

Word clustering

From Michael Collins’s slides (MIT 6.864 NLP course)

Page 8: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

8

Distributional Clustering of English Words - Pereira, Tishby and Lee, ACL 93

Clustering of nouns

Page 9: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

9

Distributional Clustering of English Words - Pereira, Tishby and Lee, ACL 93

Page 10: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

10

Clustering of adjectives• Cluster adjectives based on the nouns

they modify

• Multiple syntactic clues for modification

Predicting the semantic orientation of adjectives,

V Hatzivassiloglou, KR McKeown, EACL 1997

Page 11: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

11

Document clustering

Classification

Page 12: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

12

Scatter/Gather: Clustering a Large Text Collection

Cutting, Pedersen, Tukey & Karger 92, 93

Hearst & Pedersen 95

• Cluster sets of documents into general “themes”, like a table of contents

• Display the contents of the clusters by showing topical terms and typical titles

• User chooses subsets of the clusters and re-clusters the documents within • Resulting new groups have different “themes”

Page 14: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

14

S/G Example: query on “star”

Encyclopedia text

14 sports

8 symbols 47 film, tv

68 film, tv (p) 7 music

97 astrophysics

67 astronomy(p) 12 steller phenomena

10 flora/fauna 49 galaxies, stars

29 constellations

7 miscelleneous

Clustering and re-clustering is entirely automated

Page 15: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

15

Page 16: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

16

Page 17: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

17

Page 18: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

18

Motivation: Visualization & EDA• Exploratory data analysis (EDA) (related to

visualization)– Get a feeling for what the data look like– Try to find overall trends or patterns in text collections

Page 19: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

19

Visualization

• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.

• “Project” these onto a 2D graphical representation

• Looks neat, but difficult to detect patterns– Usefulness debatable

Page 20: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

20

Motivation:Clustering for Information Retrieval

• The cluster hypothesis states the fundamental assumption we make when using clustering in information retrieval.– Cluster hypothesis. Documents in the same

cluster behave similarly with respect to relevance to information needs.

• Tends to place similar docs together

Page 21: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

21

Search result clustering

• Instead of lists, clusters the search results, so that similar documents appear together.

• It is often easier to scan a few coherent groups than many individual documents. – Particularly useful if a search term has

different word senses. – Vivísimo search engine (http://vivisimo.com)

Page 22: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

22

Page 23: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

23

Motivation: unsupervised classification

• Classification when labeled data is not available – Also called unsupervised classification– Results of clustering depends only on the

natural division in the data, not on any pre-existing categorization scheme

Page 24: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

24

Classification

Class1

Class2

Page 25: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

25

Clustering

Page 26: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

26

Clustering

Page 27: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

27

Methods

• Hard/soft clustering

• Flat/hierarchical clustering

• Similarity measures

• Merging methods

Page 28: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

28

Text Clustering

Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseeu

Term 1

Term 2

Page 29: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

29

Text Clustering

Term 1

Term 2

Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseeu

Page 30: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

30

Hard/soft Clustering

– Hard Clustering -- Each object belongs to a single cluster

– Soft Clustering -- Each object is probabilistically assigned to clusters

Page 31: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

31

Soft clustering

• A variation of many clustering methods

• Instead of assigning each data sample to one and only one cluster, it calculates probabilities of membership for all clusters– A sample might belong to cluster A with

probability 0.4 and to cluster B with probability 0.6

• More appropriate for NLP tasks

Page 32: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

32

Flat Vs. Hierarchical

• Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other

• Hierarchical clustering produces a hierarchy of nodes– Leaves are the single objects of the clustered

set– Node represents the cluster that contains all

the nodes of its descendant

Page 34: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

34

Flat Vs. Hierarchical

• Flat– Preferable is efficiency is a consideration or data sets

are very large– K-means is a very simple methods that should

probably be used fist on anew data set because its results are often sufficient

– K-means assumes a simple Euclidean representation save so cannot be used for many data set, for example nominal data like colors

– In such cases use EM (expectation minimization)

Page 35: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

35

Flat Vs. Hierarchical

• Hierarchical– Preferable for detailed data analysis– Provide more information than flat clustering– Does not require us to pre-specify the number

of clusters– Less efficient: the most common hierarchical

clustering algorithms have a complexity that is at least quadratic in the number of documents compared to the linear complexity of most flat clustering methods

Page 36: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

36

Clustering issues

• Two main issues

• Similarity measure

• How to cluster data point together (o not)– Clustering algorithms– Merging criteria

Page 37: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

37

Similarity

• Vector-space representation and similarity computation

• Select important distributional properties of a word

• Create a vector of length n for each word to be classified

• Viewing the n-dimensional vector as a point in an n-dimensional space, cluster points that are near one another

Page 38: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

38

Similarity

From Michael Collins’s slides (MIT 6.864 NLP course)

Page 39: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

39

Similarity

From Michael Collins’s slides (MIT 6.864 NLP course)

Page 40: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

40

Similarity

From Michael Collins’s slides (MIT 6.864 NLP course)

Page 41: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

41

Similarity

From Michael Collins’s slides (MIT 6.864 NLP course)

Page 42: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

42

Similarity

From Michael Collins’s slides (MIT 6.864 NLP course)

Page 43: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

43

Pair-wise Document Similarity

nova galaxy heat h’wood film role diet fur

1 3 1

5 2

2 1 5

4 1

ABCD

How to compute document similarity?

Page 44: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

44

Pair-wise Document Similarity(no normalization for simplicity)

nova galaxy heat h’wood film role diet fur

1 3 1

5 2

2 1 5

4 1

ABCD

t

iii

t

t

wwDDsim

wwwD

wwwD

12121

2,22212

1,12111

),(

...,,

...,,

9)11()42(),(

0),(

0),(

0),(

0),(

11)32()51(),(

DCsim

DBsim

CBsim

DAsim

CAsim

BAsim

Page 45: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

45

Pair-wise Document Similarity(cosine normalization)

normalized cosine

)()(

),(

edunnormaliz ),(

...,,

...,,

1

22

1

21

121

21

12121

2,22212

1,12111

t

ii

t

ii

t

iii

t

iii

t

t

ww

wwDDsim

wwDDsim

wwwD

wwwD

Page 46: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

46

Document/Document Matrix

....

.....

.....

....

....

...

21

2212

1121

21

nnn

t

t

t

ddD

ddD

ddD

DDD

jiij DDd to of similarity

Page 47: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

47

Similarity

• And many other similarity measures!

Page 48: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

48

Flat Clustering: K-means

• K-means is the most important flat clustering algorithm.

• Objective is to minimize the average squared Euclidean distance of documents from their cluster centers where a cluster center is defined as the mean or centroid μ of the documents in a cluster ω:

Page 49: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

49

K-Means Clustering

• Decide on a pair-wise similarity measure

1. Compute K centroids

2. Assign each document to nearest center, forming new clusters

3. Unless terminate condition, repeat 1-2

Page 50: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

50

K-means algorithmA K-means example

for K = 2 in R2

From http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

Page 51: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

51

K-means algorithm

• Convergence of the position of the two centroids

From http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

Page 52: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

52

K-means

• Residual sum of squares or RSS: measure of how well the centroids represent the members of their clusters – RSS: squared distance of each vector from its

centroid summed over all vectors– RSS is the objective function in K-means and

our goal is to minimize it

Page 53: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

53

Model-based clustering

• Model-based clustering assumes that the data were generated by a model and tries to recover the original model from the data. (Flat)

• The model that we recover from the data then defines clusters and an assignment of documents to clusters.

• EM (expectation-maximization)

Page 54: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

54

Hierarchical Clustering

• Agglomerative or bottom-up:– Initialization: Start with each sample in its own cluster– Merge the two closest clusters– Each iteration: Find two most similar clusters and merge

them– Termination: All the objects are in the same cluster

• Divisive or top-down:– Start with all elements in one cluster– Partition one of the current clusters in two– Repeat until all samples are in singleton clusters

Page 55: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

55

Agglomerative Clustering

A B C D E F G HI

Page 56: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

56

Agglomerative Clustering

A B C D E F G HI

Page 57: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

57

Agglomerative Clustering

A B C D E F G HI

Page 58: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

58

Merging nodes/Clustering function

• Each node is a combination of the documents combined below it

• We represent the merged nodes as a vector of term weights

• This vector is referred to as the cluster centroid

Page 59: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

59

Clustering functionsaka Merging criteria

• Extending the distance measure from samples to sets of samples

Similarity of 2 most similar members

Similarity of 2 least similar members

Average similarity between members

From Michael Collins’s slides (MIT 6.864 NLP course)

Page 60: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

60

Single-link merging criteria

Merge closest pair of clusters:Single-link: clusters are close if any of their points are

dist(A,B) = min dist(a,b) for aA, bB

each word type isa single-point cluster

merge

Page 61: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

61

Fast, but tend to get long, stringy, meandering clusters ...

Bottom-Up Clustering – Single-Link

Page 62: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

62

Bottom-Up Clustering – Complete-Link

Again, merge closest pair of clusters:Complete-link: dist(A,B) = max dist(a,b) for aA, bB

distancebetweenclusters

Page 63: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

63

Bottom-Up Clustering – Complete-Link

distancebetweenclusters

Slow to find closest pair – need quadratically many distances

Page 64: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

64

Choosing k

• How to select an appropriate level of granularity?

• Too small, and clusters provide insufficient generalization

• Too large, and they are inappropriately generalized

Page 65: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

65

Choosing k

• In both hierarchical and k-means/medians, we need to be told where to stop, i.e., how many clusters to form

• This is partially alleviated by visual inspection of the hierarchical tree (the dendrogram)

• It would be nice if we could find an optimal k from the data

• We can do this by trying different values of k and seeing which produces the best separation among the resulting clusters.

• And there are some theoretical measures

Page 66: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

66

How to evaluate clusters?

• In practice, it’s hard to do– Different algorithms’ results look good and

bad in different ways– It’s difficult to distinguish their outcomes

• In theory, define an evaluation function– Typically choose something easy to measure

(e.g., the sum of the average distance in each class)

Page 67: 1 I256 Applied Natural Language Processing Fall 2009 Lecture 15 (Text) clustering Barbara Rosario.

67

How to evaluate clusters?

• Perform task-based evaluation• Test the resulting clusters intuitively, i.e.,

inspect them and see if they make sense. Not advisable.

• Have an expert generate clusters manually, and test the automatically generated ones against them.

• Test the clusters against a predefined classification if there is one

From Michael Collins’s slides (MIT 6.864 NLP course)


Recommended