+ All Categories
Home > Technology > Eacl 2006 Pedersen

Eacl 2006 Pedersen

Date post: 11-May-2015
Category:
Upload: university-of-minnesota-duluth
View: 522 times
Download: 0 times
Share this document with a friend
Popular Tags:
126
EACL-2006 Tutorial EACL-2006 Tutorial 1 Language Independent Language Independent Methods of Clustering Methods of Clustering Similar Contexts (with Similar Contexts (with applications) applications) Ted Pedersen Ted Pedersen University of Minnesota, University of Minnesota, Duluth Duluth http://www.d.umn.edu/~tpeders http://www.d.umn.edu/~tpeders e e [email protected] [email protected]
Transcript
Page 1: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 11

Language Independent Language Independent Methods of Clustering Similar Methods of Clustering Similar Contexts (with applications)Contexts (with applications)

Ted PedersenTed PedersenUniversity of Minnesota, Duluth University of Minnesota, Duluth

http://www.d.umn.edu/~tpedersehttp://www.d.umn.edu/[email protected]@d.umn.edu

Page 2: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 22

Language Independent Language Independent MethodsMethods

Do not utilize syntactic informationDo not utilize syntactic information No parsers, part of speech taggers, etc. requiredNo parsers, part of speech taggers, etc. required

Do not utilize dictionaries or other manually Do not utilize dictionaries or other manually created lexical resourcescreated lexical resources

Based on lexical features selected from Based on lexical features selected from corpora corpora Assumption: word segmentation can be done by Assumption: word segmentation can be done by

looking for white spaces between stringslooking for white spaces between strings No manually annotated data of any kind, No manually annotated data of any kind,

methods are completely unsupervised in methods are completely unsupervised in the strictest sensethe strictest sense

Page 3: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 33

Clustering Similar ContextsClustering Similar Contexts

A A contextcontext is a short unit of text is a short unit of text often a phrase to a paragraph in length, often a phrase to a paragraph in length,

although it can be longeralthough it can be longer Input: N contextsInput: N contexts Output: K clustersOutput: K clusters

Where each member of a cluster is a Where each member of a cluster is a context that is more similar to each other context that is more similar to each other than to the contexts found in other clustersthan to the contexts found in other clusters

Page 4: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 44

ApplicationsApplications

Headed contexts (contain target word)Headed contexts (contain target word) Name DiscriminationName Discrimination Word Sense DiscriminationWord Sense Discrimination

Headless contexts Headless contexts Email OrganizationEmail Organization Document ClusteringDocument Clustering Paraphrase identificationParaphrase identification

Clustering Sets of Related WordsClustering Sets of Related Words

Page 5: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 55

Tutorial OutlineTutorial Outline Identifying lexical featuresIdentifying lexical features

Measures of association & tests of significanceMeasures of association & tests of significance Context representationsContext representations

First & second orderFirst & second order Dimensionality reductionDimensionality reduction

Singular Value DecompositionSingular Value Decomposition ClusteringClustering

Partitional techniquesPartitional techniques Cluster stoppingCluster stopping Cluster labelingCluster labeling

Hands On ExercisesHands On Exercises

Page 6: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 66

General InfoGeneral Info

Please fill out short surveyPlease fill out short survey Break from 4:00-4:30pmBreak from 4:00-4:30pm Finish at 6pmFinish at 6pm

Reception tonight at 7pm at Castle (?)Reception tonight at 7pm at Castle (?) Slides and video from tutorial will be posted (I will Slides and video from tutorial will be posted (I will

send you email when that is ready)send you email when that is ready) Questions are welcomeQuestions are welcome

Now, or via email to me or SenseClusters list.Now, or via email to me or SenseClusters list. Comments, observations, criticisms are all Comments, observations, criticisms are all

welcomewelcome Knoppix CD, will give you Linux and SenseClusters Knoppix CD, will give you Linux and SenseClusters

when computer is booted from the CD.when computer is booted from the CD.

Page 7: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 77

SenseClustersSenseClusters

A package for clustering contextsA package for clustering contexts http://senseclusters.sourceforge.nethttp://senseclusters.sourceforge.net SenseClusters Live! (Knoppix CD)SenseClusters Live! (Knoppix CD)

Integrates with various other toolsIntegrates with various other tools Ngram Statistics PackageNgram Statistics Package CLUTOCLUTO SVDPACKCSVDPACKC

Page 8: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 88

Many thanks…Many thanks… Amruta Purandare (M.S., 2004)Amruta Purandare (M.S., 2004)

Founding developer of SenseClusters (2002-Founding developer of SenseClusters (2002-2004)2004)

Now PhD student in Intelligent Systems at the Now PhD student in Intelligent Systems at the University of Pittsburgh University of Pittsburgh http://www.cs.pitt.edu/~amruta/http://www.cs.pitt.edu/~amruta/

Anagha Kulkarni (M.S., 2006, expected)Anagha Kulkarni (M.S., 2006, expected) Enhancing SenseClusters since Fall 2004!Enhancing SenseClusters since Fall 2004! http://www.d.umn.edu/~kulka020/http://www.d.umn.edu/~kulka020/

National Science Foundation (USA) for National Science Foundation (USA) for supporting Amruta, Anagha and me via supporting Amruta, Anagha and me via CAREER award #0092784CAREER award #0092784

Page 9: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 99

Background and Background and MotivationsMotivations

Page 10: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 1010

Headed and Headless Headed and Headless ContextsContexts

A headed context includes a target A headed context includes a target wordword Our goal is to cluster the target words Our goal is to cluster the target words

based on their surrounding contexts based on their surrounding contexts Target word is center of context and our Target word is center of context and our

attentionattention A headless context has no target wordA headless context has no target word

Our goal is to cluster the contexts based on Our goal is to cluster the contexts based on their similarity to each othertheir similarity to each other

The focus is on the context as a wholeThe focus is on the context as a whole

Page 11: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 1111

Headed Contexts (input)Headed Contexts (input)

I can hear the ocean in that I can hear the ocean in that shell.shell. My operating system My operating system shell shell is bash.is bash. The The shellsshells on the shore are lovely. on the shore are lovely. The The shell shell command line is flexible.command line is flexible. The oyster The oyster shellshell is very hard and is very hard and

black.black.

Page 12: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 1212

Headed Contexts (output)Headed Contexts (output)

Cluster 1: Cluster 1: My operating system My operating system shell shell is bash.is bash. The The shell shell command line is flexible.command line is flexible.

Cluster 2:Cluster 2: The The shellsshells on the shore are lovely. on the shore are lovely. The oyster The oyster shellshell is very hard and black. is very hard and black. I can hear the ocean in that I can hear the ocean in that shell.shell.

Page 13: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 1313

Headless Contexts (input)Headless Contexts (input)

The new version of Linux is more stable The new version of Linux is more stable and better support for cameras.and better support for cameras.

My Chevy Malibu has had some front end My Chevy Malibu has had some front end troubles.troubles.

Osborne made on of the first personal Osborne made on of the first personal computers.computers.

The brakes went out, and the car flew into The brakes went out, and the car flew into the house. the house.

With the price of gasoline, I think I’ll be With the price of gasoline, I think I’ll be taking the bus more often!taking the bus more often!

Page 14: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 1414

Headless Contexts (output)Headless Contexts (output)

Cluster 1:Cluster 1: The new version of Linux is more stable and better The new version of Linux is more stable and better

support for cameras.support for cameras. Osborne made one of the first personal computers.Osborne made one of the first personal computers.

Cluster 2: Cluster 2: My Chevy Malibu has had some front end troubles.My Chevy Malibu has had some front end troubles. The brakes went out, and the car flew into the house. The brakes went out, and the car flew into the house. With the price of gasoline, I think I’ll be taking the With the price of gasoline, I think I’ll be taking the

bus more often!bus more often!

Page 15: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 1515

Web Search as ApplicationWeb Search as Application

Web search results are headed contextsWeb search results are headed contexts Search term is target word (found in snippets)Search term is target word (found in snippets)

Web search results are often disorganized Web search results are often disorganized – two people sharing same name, two – two people sharing same name, two organizations sharing same abbreviation, organizations sharing same abbreviation, etc. often have their pages “mixed up” etc. often have their pages “mixed up”

If you click on search results or follow links If you click on search results or follow links in pages found, you will encounter in pages found, you will encounter headless contexts too…headless contexts too…

Page 16: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 1616

Name DiscriminationName Discrimination

Page 17: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 1717

George Millers!George Millers!

Page 18: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 1818

Page 19: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 1919

Page 20: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 2020

Page 21: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 2121

Page 22: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 2222

Page 23: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 2323

Email Foldering as Email Foldering as ApplicationApplication

Email (public or private) is made up Email (public or private) is made up of headless contextsof headless contexts Short, usually focused…Short, usually focused…

Cluster similar email messages Cluster similar email messages together together Automatic email folderingAutomatic email foldering Take all messages from sent-mail file or Take all messages from sent-mail file or

inbox and organize into categoriesinbox and organize into categories

Page 24: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 2424

Page 25: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 2525

Page 26: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 2626

Clustering News as Clustering News as ApplicationApplication

News articles are headless contextsNews articles are headless contexts Entire article or first paragraphEntire article or first paragraph Short, usually focusedShort, usually focused

Cluster similar articles togetherCluster similar articles together

Page 27: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 2727

Page 28: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 2828

Page 29: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 2929

Page 30: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 3030

What is it to be “similar”?What is it to be “similar”?

You shall know a word by the company it keepsYou shall know a word by the company it keeps Firth, 1957 (Firth, 1957 (Studies in Linguistic AnalysisStudies in Linguistic Analysis))

Meanings of words are (largely) determined by their Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis)distributional patterns (Distributional Hypothesis) Harris, 1968 (Harris, 1968 (Mathematical Structures of LanguageMathematical Structures of Language))

Words that occur in similar contexts will have Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis)similar meanings (Strong Contextual Hypothesis) Miller and Charles, 1991 (Miller and Charles, 1991 (Language and Cognitive Language and Cognitive

ProcessesProcesses)) Various extensions…Various extensions…

Similar contexts will have similar meanings, etc.Similar contexts will have similar meanings, etc. Names that occur in similar contexts will refer to the same Names that occur in similar contexts will refer to the same

underlying person, etc.underlying person, etc.

Page 31: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 3131

General MethodologyGeneral Methodology

Represent contexts to be clustered using Represent contexts to be clustered using first or second order feature vectorsfirst or second order feature vectors Lexical featuresLexical features

Reduce dimensionality to make vectors Reduce dimensionality to make vectors more tractable and/or understandablemore tractable and/or understandable Singular value decomposition Singular value decomposition

Cluster the context vectorsCluster the context vectors Find the number of clustersFind the number of clusters Label the clustersLabel the clusters

Evaluate and/or use the contexts!Evaluate and/or use the contexts!

Page 32: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 3232

Identifying Lexical Identifying Lexical FeaturesFeatures

Measures of Association and Measures of Association and

Tests of SignificanceTests of Significance

Page 33: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 3333

What are features?What are features?

Features represent the (hopefully) salient Features represent the (hopefully) salient characteristics of the contexts to be characteristics of the contexts to be clusteredclustered

Eventually we will represent each context Eventually we will represent each context as a vector, where the dimensions of the as a vector, where the dimensions of the vector are associated with featuresvector are associated with features

Vectors/contexts that include many of Vectors/contexts that include many of the same features will be similar to each the same features will be similar to each otherother

Page 34: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 3434

Where do features come Where do features come from? from?

In unsupervised clustering, it is common In unsupervised clustering, it is common for the feature selection data to be the for the feature selection data to be the same data that is to be clusteredsame data that is to be clustered This is not cheating, since data to be clustered This is not cheating, since data to be clustered

does not have any labeled classes that can be does not have any labeled classes that can be used to assist feature selectionused to assist feature selection

It may also be necessary, since we may need It may also be necessary, since we may need to cluster all available data, and not hold out to cluster all available data, and not hold out some for a separate feature identification stepsome for a separate feature identification step

Email or news articlesEmail or news articles

Page 35: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 3535

Feature SelectionFeature Selection

““Test” data – the contexts to be clusteredTest” data – the contexts to be clustered Assume that the feature selection data is the Assume that the feature selection data is the

same as the test data, unless otherwise same as the test data, unless otherwise indicated indicated

““Training” data – a separate corpus of held Training” data – a separate corpus of held out feature selection data (that will not be out feature selection data (that will not be clustered)clustered) may need to use if you have a small number of may need to use if you have a small number of

contexts to cluster (e.g., web search results)contexts to cluster (e.g., web search results) This sense of “training” due to Schütze (1998)This sense of “training” due to Schütze (1998)

Page 36: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 3636

Lexical FeaturesLexical Features

Unigram – a single word that occurs more Unigram – a single word that occurs more than a given number of timesthan a given number of times

Bigram – an ordered pair of words that Bigram – an ordered pair of words that occur together more often than expected occur together more often than expected by chanceby chance Consecutive or may have intervening wordsConsecutive or may have intervening words

Co-occurrence – an unordered bigramCo-occurrence – an unordered bigram Target Co-occurrence – a co-occurrence Target Co-occurrence – a co-occurrence

where one of the words is the target wordwhere one of the words is the target word

Page 37: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 3737

BigramsBigrams

fine wine (window size of 2)fine wine (window size of 2) baseball batbaseball bat house house of of representatives (window size of 3)representatives (window size of 3) president president of theof the republic (window size of 4) republic (window size of 4) apple orchardapple orchard

Selected using a small window size (2-4 Selected using a small window size (2-4 words), trying to capture a regular (localized) words), trying to capture a regular (localized) pattern between two words (collocation?)pattern between two words (collocation?)

Page 38: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 3838

Co-occurrencesCo-occurrences

tropics watertropics water boat fishboat fish law presidentlaw president train traveltrain travel

Usually selected using a larger window (7-Usually selected using a larger window (7-10 words) of context, hoping to capture 10 words) of context, hoping to capture pairs of related words rather than pairs of related words rather than collocationscollocations

Page 39: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 3939

Bigrams and Co-Bigrams and Co-occurrencesoccurrences

Pairs of words tend to be much less Pairs of words tend to be much less ambiguous than unigramsambiguous than unigrams ““bank” versus “river bank” and “bank bank” versus “river bank” and “bank

card”card” ““dot” versus “dot com” and “dot product”dot” versus “dot com” and “dot product”

Three grams and beyond occur much Three grams and beyond occur much less frequently (Ngrams very Zipfian)less frequently (Ngrams very Zipfian)

Unigrams are noisy, but bountifulUnigrams are noisy, but bountiful

Page 40: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 4040

““occur together more often occur together more often than expected by chance…”than expected by chance…”

Observed frequencies for two words Observed frequencies for two words occurring together and alone are stored in a occurring together and alone are stored in a 2x2 matrix2x2 matrix Throw out bigrams that include one or two stop Throw out bigrams that include one or two stop

wordswords Expected values are calculated, based on the Expected values are calculated, based on the

model of independence and observed valuesmodel of independence and observed values How often would you expect these words to occur How often would you expect these words to occur

together, if they only occurred together by together, if they only occurred together by chance?chance?

If two words occur “significantly” more often than If two words occur “significantly” more often than the expected value, then the words do not occur the expected value, then the words do not occur together by chance.together by chance.

Page 41: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 4141

2x2 Contingency Table2x2 Contingency Table

IntelligencIntelligencee

!!IntelligencIntelligenc

ee

ArtificialArtificial 100100 400400

!Artificial!Artificial

300300 100,000100,000

Page 42: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 4242

2x2 Contingency Table2x2 Contingency Table

IntelligencIntelligencee

!!IntelligencIntelligenc

ee

ArtificialArtificial 100100 300300 400400

!Artificial!Artificial 200200 99,40099,400 99,60099,600

300300 99,70099,700 100,000100,000

Page 43: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 4343

2x2 Contingency Table2x2 Contingency Table

IntelligencIntelligencee

!!IntelligencIntelligenc

ee

ArtificialArtificial 100.0100.0

000.12000.12300.0300.0

398.8398.8400400

!Artificial!Artificial 200.0200.0

298.8298.899,400.099,400.0

99,301.299,301.299,60099,600

300300 99,70099,700 100,000100,000

Page 44: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 4444

Measures of AssociationMeasures of Association

2

1,

22

2

1,

2

),(

)],(),([

)),(

),(log*),((

ji ji

jiji

ji

ji

jiji

wwexpected

wwexpectedwwobservedX

wwexpected

wwobservedwwobservedG

Page 45: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 4545

Measures of AssociationMeasures of Association

78.8191

88.7502

2

X

G

Page 46: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 4646

Interpreting the Scores…Interpreting the Scores…

G^2 and X^2 are asymptotically G^2 and X^2 are asymptotically approximated by the chi-squared approximated by the chi-squared distribution…distribution…

This means…if you fix the marginal totals This means…if you fix the marginal totals of a table, randomly generate internal cell of a table, randomly generate internal cell values in the table, calculate the G^2 or values in the table, calculate the G^2 or X^2 scores for each resulting table, and X^2 scores for each resulting table, and plot the distribution of the scores, you plot the distribution of the scores, you *should* get …*should* get …

Page 47: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 4747

Page 48: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 4848

Interpreting the Scores…Interpreting the Scores…

Values above a certain level of Values above a certain level of significance can be considered significance can be considered grounds for rejecting the null grounds for rejecting the null hypothesis hypothesis H0: the words in the bigram are H0: the words in the bigram are

independentindependent 3.841 is associated with 95% confidence 3.841 is associated with 95% confidence

that the null hypothesis should be that the null hypothesis should be rejectedrejected

Page 49: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 4949

Measures of AssociationMeasures of Association

There are numerous measures of There are numerous measures of association that can be used to association that can be used to identify bigram and co-occurrence identify bigram and co-occurrence featuresfeatures

Many of these are supported in the Many of these are supported in the Ngram Statistics Package (NSP)Ngram Statistics Package (NSP) http://www.d.umn.edu/~tpederse/nsp.hthttp://www.d.umn.edu/~tpederse/nsp.ht

mlml

Page 50: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 5050

Measures Supported in NSPMeasures Supported in NSP

Log-likelihood Ratio (ll)Log-likelihood Ratio (ll) True Mutual Information (tmi)True Mutual Information (tmi)

Pearson’s Chi-squared Test (x2)Pearson’s Chi-squared Test (x2) Pointwise Mutual Information (pmi)Pointwise Mutual Information (pmi) Phi coefficient (phi)Phi coefficient (phi) T-test (tscore)T-test (tscore) Fisher’s Exact Test (leftFisher, rightFisher)Fisher’s Exact Test (leftFisher, rightFisher) Dice Coefficient (dice)Dice Coefficient (dice) Odds Ratio (odds)Odds Ratio (odds)

Page 51: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 5252

SummarySummary Identify lexical features based on frequency counts or Identify lexical features based on frequency counts or

measures of association – either in the data to be measures of association – either in the data to be clustered or in a separate set of feature selection dataclustered or in a separate set of feature selection data Language independentLanguage independent

Unigrams usually only selected by frequencyUnigrams usually only selected by frequency Remember, no labeled data from which to learn, so Remember, no labeled data from which to learn, so

somewhat less effective as features than in supervised casesomewhat less effective as features than in supervised case Bigrams and co-occurrences can also be selected by Bigrams and co-occurrences can also be selected by

frequency, or better yet measures of associationfrequency, or better yet measures of association Bigrams and co-occurrences need not be consecutiveBigrams and co-occurrences need not be consecutive Stop words should be eliminatedStop words should be eliminated Frequency thresholds are helpful (e.g., unigram/bigram that Frequency thresholds are helpful (e.g., unigram/bigram that

occurs once may be too rare to be useful)occurs once may be too rare to be useful)

Page 52: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 5353

Related WorkRelated Work

Moore, 2004 (EMNLP) follow-up to Dunning and Pedersen on Moore, 2004 (EMNLP) follow-up to Dunning and Pedersen on log-likelihood and exact testslog-likelihood and exact tests

http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdfhttp://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdf

Pedersen, 1996 (SCSUG) explanation of exact tests, and Pedersen, 1996 (SCSUG) explanation of exact tests, and comparison to log-likelihoodcomparison to log-likelihoodhttp://arxiv.org/abs/cmp-lg/9608010http://arxiv.org/abs/cmp-lg/9608010

(also see Pedersen, Kayaalp, and Bruce, AAAI-1996)(also see Pedersen, Kayaalp, and Bruce, AAAI-1996)

Dunning, 1993 (Dunning, 1993 (Computational LinguisticsComputational Linguistics) introduces log-) introduces log-likelihood ratio for collocation identificationlikelihood ratio for collocation identification

http://acl.ldc.upenn.edu/J/J93/J93-1003.pdfhttp://acl.ldc.upenn.edu/J/J93/J93-1003.pdf

Page 53: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 5454

Context RepresentationsContext Representations

First and Second Order First and Second Order MethodsMethods

Page 54: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 5555

Once features selected…Once features selected…

We will have a set of unigrams, bigrams, We will have a set of unigrams, bigrams, co-occurrences or target co-occurrences co-occurrences or target co-occurrences that we believe are somehow interesting that we believe are somehow interesting and usefuland useful We also have any frequency and measure of We also have any frequency and measure of

association score that have been used in their association score that have been used in their selectionselection

Convert contexts to be clustered into a Convert contexts to be clustered into a vector representation based on these vector representation based on these featuresfeatures

Page 55: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 5656

First Order RepresentationFirst Order Representation

Each context is represented by a Each context is represented by a vector with M dimensions, each of vector with M dimensions, each of which indicates whether or not a which indicates whether or not a particular feature occurred in that particular feature occurred in that contextcontext Value may be binary, a frequency count, Value may be binary, a frequency count,

or an association scoreor an association score Context by Feature representationContext by Feature representation

Page 56: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 5757

ContextsContexts

Cxt1: There was an island curse of Cxt1: There was an island curse of black magic cast by that voodoo child. black magic cast by that voodoo child.

Cxt2: Harold, a known voodoo child, Cxt2: Harold, a known voodoo child, was gifted in the arts of black magic.was gifted in the arts of black magic.

Cxt3: Despite their military might, it Cxt3: Despite their military might, it was a serious error to attack.was a serious error to attack.

Cxt4: Military might is no defense Cxt4: Military might is no defense against a voodoo child or an island against a voodoo child or an island curse.curse.

Page 57: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 5858

Unigram Feature Set Unigram Feature Set

island 1000island 1000 black 700black 700 curse 500curse 500 magic 400magic 400 child 200child 200

(assume these are frequency counts (assume these are frequency counts obtained from some corpus…)obtained from some corpus…)

Page 58: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 5959

First Order Vectors of First Order Vectors of UnigramsUnigrams

islandisland blackblack cursecurse magicmagic childchild

Cxt1Cxt1 11 11 11 11 11

Cxt2Cxt2 00 11 00 11 11

Cxt3Cxt3 00 00 00 00 00

Cxt4Cxt4 11 00 11 00 11

Page 59: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 6060

Bigram Feature SetBigram Feature Set

island curse 189.2island curse 189.2 black magic 123.5black magic 123.5 voodoo child 120.0voodoo child 120.0 military might 100.3military might 100.3 serious error 89.2serious error 89.2 island child 73.2island child 73.2 voodoo might 69.4voodoo might 69.4 military error 54.9military error 54.9 black child 43.2black child 43.2 serious curse 21.2serious curse 21.2

(assume these are log-likelihood scores based on frequency (assume these are log-likelihood scores based on frequency counts from some corpus)counts from some corpus)

Page 60: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 6161

First Order Vectors of First Order Vectors of BigramsBigrams

blackblack

magicmagicisland island curse curse

militarmilitary y

might might

seriouserious errors error

voodovoodoo childo child

Cxt1Cxt1 11 11 00 00 11

Cxt2Cxt2 11 00 00 00 11

Cxt3Cxt3 00 00 11 11 00

Cxt4Cxt4 00 11 11 00 11

Page 61: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 6262

First Order VectorsFirst Order Vectors

Can have binary values or weights Can have binary values or weights associated with frequency, etc.associated with frequency, etc.

Forms a context by feature matrixForms a context by feature matrix May optionally be smoothed/reduced May optionally be smoothed/reduced

with Singular Value Decomposition with Singular Value Decomposition More on that later…More on that later…

The contexts are ready for clustering…The contexts are ready for clustering… More on that later…More on that later…

Page 62: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 6363

Second Order FeaturesSecond Order Features

First order features encode the occurrence of a First order features encode the occurrence of a feature in a contextfeature in a context Feature occurrence represented by binary valueFeature occurrence represented by binary value

Second order features encode something Second order features encode something ‘extra’ about a feature that occurs in a context‘extra’ about a feature that occurs in a context Feature occurrence represented by word co-Feature occurrence represented by word co-

occurrencesoccurrences Feature occurrence represented by context Feature occurrence represented by context

occurrences occurrences

Page 63: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 6464

Second Order Second Order RepresentationRepresentation

First, build word by word matrix from featuresFirst, build word by word matrix from features Based on bigrams or co-occurrencesBased on bigrams or co-occurrences First word is row, second word is column, cell is scoreFirst word is row, second word is column, cell is score (optionally) reduce dimensionality w/SVD(optionally) reduce dimensionality w/SVD Each row forms a vector of first order co-occurrencesEach row forms a vector of first order co-occurrences

Second, replace each word in a context with its Second, replace each word in a context with its row/vector as found in the word by word matrixrow/vector as found in the word by word matrix

Average all the word vectors in the context to Average all the word vectors in the context to create the second order representationcreate the second order representation Due to SchDue to Schüütze (1998), related to LSI/LSAtze (1998), related to LSI/LSA

Page 64: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 6565

Word by Word MatrixWord by Word Matrix

magicmagic cursecurse mightmight errorerror childchild

blackblack 123.5123.5 00 00 00 43.243.2

islandisland 00 189.2189.2 00 00 73.273.2

militarmilitaryy

00 00 100.3100.3 54.954.9 00

seriouseriouss

00 21.221.2 00 89.289.2 00

voodovoodooo

00 00 69.469.4 00 120.0120.0

Page 65: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 6666

Word by Word MatrixWord by Word Matrix

……can also be used to identify sets of related wordscan also be used to identify sets of related words In the case of bigrams, rows represent the first word In the case of bigrams, rows represent the first word

in a bigram and columns represent the second wordin a bigram and columns represent the second word Matrix is asymmetricMatrix is asymmetric

In the case of co-occurrences, rows and columns are In the case of co-occurrences, rows and columns are equivalentequivalent Matrix is symmetricMatrix is symmetric

The vector (row) for each word represent a set of The vector (row) for each word represent a set of first order features for that wordfirst order features for that word

Each word in a context to be clustered for which a Each word in a context to be clustered for which a vector exists (in the word by word matrix) is vector exists (in the word by word matrix) is replaced by that vector in that contextreplaced by that vector in that context

Page 66: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 6767

There was anThere was an island island curse of curse of blackblack magic cast by that magic cast by that

voodoo voodoo child. child.

magicmagic cursecurse mightmight errorerror childchild

blackblack 123.5123.5 00 00 00 43.243.2

islandisland 00 189.2189.2 00 00 73.273.2

voodovoodooo

00 00 69.469.4 00 120.0120.0

Page 67: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 6868

Second Order Co-Second Order Co-OccurrencesOccurrences

Word vectors for “black” and “island” Word vectors for “black” and “island” show similarity as both occur with show similarity as both occur with “child” “child”

““black” and “island” are second black” and “island” are second order co-occurrence with each other, order co-occurrence with each other, since both occur with “child” but not since both occur with “child” but not with each other (i.e., “black island” is with each other (i.e., “black island” is not observed)not observed)

Page 68: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 6969

Second Order Second Order RepresentationRepresentation

There was an There was an [curse, child][curse, child] curse of curse of [magic, child][magic, child] magic cast by that magic cast by that [might, child][might, child] child child

[curse, child][curse, child] + + [magic, child][magic, child] + + [might, child][might, child]

Page 69: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 7070

There was anThere was an island island curse of curse of blackblack magic cast by that magic cast by that

voodoo voodoo child. child.

magicmagic cursecurse mightmight errorerror childchild

Cxt1Cxt1 41.241.2 63.163.1 24.424.4 00 78.878.8

Page 70: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 7171

Second Order Second Order RepresentationRepresentation

Results in a Context by Feature Results in a Context by Feature (Word) Representation(Word) Representation

Cell values Cell values do notdo not indicate if feature indicate if feature occurred in context. Rather, they occurred in context. Rather, they show the strength of association of show the strength of association of that feature with other words that that feature with other words that occur with a word in the context.occur with a word in the context.

Page 71: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 7272

SummarySummary

First order representations are intuitive, First order representations are intuitive, but…but… Can suffer from sparsityCan suffer from sparsity Contexts represented based on the features Contexts represented based on the features

that occur in those contextsthat occur in those contexts Second order representations are harder Second order representations are harder

to visualize, but…to visualize, but… Allow a word to be represented by the words it Allow a word to be represented by the words it

co-occurs with (i.e., the company it keeps)co-occurs with (i.e., the company it keeps) Allows a context to be represented by the Allows a context to be represented by the

words that occur with the words in the context words that occur with the words in the context Helps combat sparsity…Helps combat sparsity…

Page 72: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 7373

Related WorkRelated Work

Pedersen and Bruce 1997 (EMNLP) presented first order Pedersen and Bruce 1997 (EMNLP) presented first order method of discriminationmethod of discrimination

http://acl.ldc.upenn.edu/W/W97/W97-0322.pdfhttp://acl.ldc.upenn.edu/W/W97/W97-0322.pdf

Schütze 1998 (Schütze 1998 (Computational LinguisticsComputational Linguistics) introduced second ) introduced second order method order method

http://acl.ldc.upenn.edu/J/J98/J98-1004.pdfhttp://acl.ldc.upenn.edu/J/J98/J98-1004.pdf

Purandare and Pedersen 2004 (CoNLL) compared first and Purandare and Pedersen 2004 (CoNLL) compared first and second order methodssecond order methods

http://acl.ldc.upenn.edu/hlt-naacl2004/conll04/pdf/purandare.http://acl.ldc.upenn.edu/hlt-naacl2004/conll04/pdf/purandare.pdfpdf

First order better if you have lots of dataFirst order better if you have lots of data Second order better with smaller amounts of dataSecond order better with smaller amounts of data

Page 73: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 7474

Dimensionality ReductionDimensionality Reduction

Singular Value DecompositionSingular Value Decomposition

Page 74: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 7575

MotivationMotivation

First order matrices are very sparseFirst order matrices are very sparse Context by featureContext by feature Word by word Word by word

NLP data is noisyNLP data is noisy No stemming performedNo stemming performed synonymssynonyms

Page 75: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 7676

Many Methods Many Methods

Singular Value Decomposition (SVD)Singular Value Decomposition (SVD) SVDPACKC SVDPACKC http://www.netlib.org/svdpack/http://www.netlib.org/svdpack/

Multi-Dimensional Scaling (MDS)Multi-Dimensional Scaling (MDS) Principal Components Analysis (PCA)Principal Components Analysis (PCA) Independent Components Analysis (ICA)Independent Components Analysis (ICA) Linear Discriminant Analysis (LDA)Linear Discriminant Analysis (LDA) etc…etc…

Page 76: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 7777

Effect of SVDEffect of SVD

SVD reduces a matrix to a given SVD reduces a matrix to a given number of dimensions This may number of dimensions This may convert a word level space into a convert a word level space into a semantic or conceptual spacesemantic or conceptual space If “dog” and “collie” and “wolf” are If “dog” and “collie” and “wolf” are

dimensions/columns in a word co-dimensions/columns in a word co-occurrence matrix, after SVD they may occurrence matrix, after SVD they may be a single dimension that represents be a single dimension that represents “canines”“canines”

Page 77: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 7878

Effect of SVDEffect of SVD

The dimensions of the matrix after The dimensions of the matrix after SVD are principal components that SVD are principal components that represent the meaning of conceptsrepresent the meaning of concepts Similar columns are grouped together Similar columns are grouped together

SVD is a way of smoothing a very SVD is a way of smoothing a very sparse matrix, so that there are very sparse matrix, so that there are very few zero valued cells after SVDfew zero valued cells after SVD

Page 78: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 7979

How can SVD be used?How can SVD be used? SVD on first order contexts will reduce a SVD on first order contexts will reduce a

context by feature representation down context by feature representation down to a smaller number of featuresto a smaller number of features Latent Semantic Analysis typically performs Latent Semantic Analysis typically performs

SVD on a feature by context representation, SVD on a feature by context representation, where the contexts are reducedwhere the contexts are reduced

SVD used in creating second order SVD used in creating second order context representationscontext representations Reduce word by word matrix Reduce word by word matrix

Page 79: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 8080

Word by Word MatrixWord by Word Matrixapplapplee

bloobloodd

cellcellss

ibmibm datdataa

boxbox tissutissuee

graphicgraphicss

memormemoryy

orgaorgann

plasmplasmaa

pcpc 22 00 00 11 33 11 00 00 00 00 00

bodybody 00 33 00 00 00 00 22 00 00 22 11

diskdisk 11 00 00 22 00 33 00 11 22 00 00

petripetri 00 22 11 00 00 00 22 00 11 00 11

lablab 00 00 33 00 22 00 22 00 22 11 33

salessales 00 00 00 22 33 00 00 11 22 00 00

linuxlinux 22 00 00 11 33 22 00 11 11 00 00

debtdebt 00 00 00 22 33 44 00 22 00 00 00

Page 80: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 8181

Singular Value Singular Value DecompositionDecomposition

A=UDV’A=UDV’

Page 81: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 8282

UU.35.35 .09.09 -.2-.2 .52.52 -.0-.0

99.40.40 .02.02 .63.63 .20.20 -.00-.00 -.02-.02

.05.05 -.4-.499

.59.59 .44.44 .08.08 -.0-.099

-.4-.444

-.04-.04 -.6-.6 -.02-.02 -.01-.01

.35.35 .13.13 .39.39 -.6-.600

.31.31 .41.41 -.2-.222

.20.20 -.39-.39 .00.00 .03.03

.08.08 -.4-.455

.25.25 -.0-.022

.17.17 .09.09 .83.83 .05.05 -.26-.26 -.01-.01 .00.00

.29.29 -.6-.688

-.4-.455

-.3-.344

-.3-.311

.02.02 -.2-.211

.01.01 .43.43 -.02-.02 -.07-.07

.37.37 -.0-.011

-.3-.311

.09.09 .72.72 -.4-.488

-.0-.044

.03.03 .31.31 -.00-.00 .08.08

.46.46 .11.11 -.0-.088

.24.24 -.0-.011

.39.39 .05.05 .08.08 .08.08 -.00-.00 -.01-.01

.56.56 .25.25 .30.30 -.0-.077

-.4-.499

-.5-.522

.14.14 -.3-.3 -.30-.30 .00.00 -.07-.07

Page 82: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 8383

DD9.199.19

6.366.36

3.993.99

3.253.25

2.522.52

2.302.30

1.261.26

0.660.66

0.000.00

0.000.00

0.000.00

Page 83: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 8484

VV.21.21 .08.08 -.04-.04 .28.28 .04.04 .86.86 -.05-.05 -.05-.05 -.31-.31 -.12-.12 .03.03

.04.04 -.37-.37 .57.57 .39.39 .23.23 -.04-.04 .26.26 -.02-.02 .03.03 .25.25 .44.44

.11.11 -.39-.39 -.27-.27 -.32-.32 -.30-.30 .06.06 .17.17 .15.15 -.41-.41 .58.58 .07.07

.37.37 .15.15 .12.12 -.12-.12 .39.39 -.17-.17 -.13-.13 .71.71 -.31-.31 -.12-.12 .03.03

.63.63 -.01-.01 -.45-.45 .52.52 -.09-.09 -.26-.26 .08.08 -.06-.06 .21.21 .08.08 -.0-.022

.49.49 .27.27 .50.50 -.32-.32 -.45-.45 .13.13 .02.02 -.01-.01 .31.31 .12.12 -.0-.033

.09.09 -.51-.51 .20.20 .05.05 -.05-.05 .02.02 .29.29 .08.08 -.04-.04 -.31-.31 -.7-.711

.25.25 .11.11 .15.15 -.12-.12 .02.02 -.32-.32 .05.05 -.59-.59 -.62-.62 -.23-.23 .07.07

.28.28 -.23-.23 -.14-.14 -.45-.45 .64.64 .17.17 -.04-.04 -.32-.32 .31.31 .12.12 -.0-.033

.04.04 -.26-.26 .19.19 .17.17 -.06-.06 -.07-.07 -.87-.87 -.10-.10 -.07-.07 .22.22 -.2-.200

.11.11 -.47-.47 -.12-.12 -.18-.18 -.27-.27 .03.03 -.18-.18 .09.09 .12.12 -.58-.58 .50.50

Page 84: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 8585

Word by Word Matrix After Word by Word Matrix After SVDSVD

appleapple bloodblood cellscells ibmibm datdataa

tissutissuee

graphicgraphicss

memormemoryy

orgaorgann

plasmplasmaa

pcpc .73.73 .00.00 .11.11 1.31.3 2.02.0 .01.01 .86.86 .77.77 .00.00 .09.09

bodybody .00.00 1.21.2 1.31.3 .00.00 .33.33 1.61.6 .00.00 .85.85 .84.84 1.51.5

diskdisk .76.76 .00.00 .01.01 1.31.3 2.12.1 .00.00 .91.91 .72.72 .00.00 .00.00

germgerm .00.00 1.11.1 1.21.2 .00.00 .49.49 1.51.5 .00.00 .86.86 .77.77 1.41.4

lablab .21.21 1.71.7 2.02.0 .35.35 1.71.7 2.52.5 .18.18 1.71.7 1.21.2 2.32.3

salessales .73.73 .15.15 .39.39 1.31.3 2.22.2 .35.35 .85.85 .98.98 .17.17 .41.41

linuxlinux .96.96 .00.00 .16.16 1.71.7 2.72.7 .03.03 1.11.1 1.01.0 .00.00 .13.13

debtdebt 1.21.2 .00.00 .00.00 2.12.1 3.23.2 .00.00 1.51.5 1.11.1 .00.00 .00.00

Page 85: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 8686

Second Order RepresentationSecond Order Representation

These two contexts share no words in common, yet they These two contexts share no words in common, yet they are similar! are similar! diskdisk and and linux linux both occur with “Apple”, both occur with “Apple”, “IBM”, “data”, “graphics”, and “memory” “IBM”, “data”, “graphics”, and “memory”

The two contexts are similar because they share many The two contexts are similar because they share many second order co-occurrencessecond order co-occurrences

appleapple bloobloodd

cellscells ibmibm datadata tissutissuee

graphicgraphicss

memormemoryy

orgaorgann

PlasmPlasmaa

diskdisk .76.76 .00.00 .01.01 1.31.3 2.12.1 .00.00 .91.91 .72.72 .00.00 .00.00

linuxlinux .96.96 .00.00 .16.16 1.71.7 2.72.7 .03.03 1.11.1 1.01.0 .00.00 .13.13

• I got a new disk today!

• What do you think of linux?

Page 86: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 8787

Relationship to LSARelationship to LSA

Latent Semantic Analysis uses feature Latent Semantic Analysis uses feature by context first order representation by context first order representation Indicates all the contexts in which a Indicates all the contexts in which a

feature occursfeature occurs Use SVD to reduce dimensions (contexts)Use SVD to reduce dimensions (contexts) Cluster features based on similarity of Cluster features based on similarity of

contexts in which they occurcontexts in which they occur Represent sentences using an average of Represent sentences using an average of

feature vectors feature vectors

Page 87: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 8888

Feature by Context Feature by Context RepresentationRepresentation

Cxt1Cxt1 Cxt2Cxt2 Cxt3Cxt3 Cxt4Cxt4

black magicblack magic 11 11 00 11

island curseisland curse 11 00 00 11

military military mightmight

00 00 11 00

serious serious errorerror

00 00 11 00

voodoo voodoo childchild

11 11 00 11

Page 88: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 8989

ReferencesReferences

Deerwester, S. and Dumais, S.T. and Furnas, G.W. Deerwester, S. and Dumais, S.T. and Furnas, G.W. and Landauer, T.K. and Harshman, R., Indexing by and Landauer, T.K. and Harshman, R., Indexing by Latent Semantic Analysis, Journal of the American Latent Semantic Analysis, Journal of the American Society for Information Science, vol. 41, 1990Society for Information Science, vol. 41, 1990

Landauer, T. and Dumais, S., A Solution to Plato's Landauer, T. and Dumais, S., A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Acquisition, Induction and Representation of Knowledge, Psychological Review, vol. 104, 1997Knowledge, Psychological Review, vol. 104, 1997

SchSchüütze, H, Automatic Word Sense Discrimination, tze, H, Automatic Word Sense Discrimination, Computational Linguistics, vol. 24, 1998Computational Linguistics, vol. 24, 1998

Berry, M.W. and Drmac, Z. and Jessup, Berry, M.W. and Drmac, Z. and Jessup, E.R.,Matrices, Vector Spaces, and Information E.R.,Matrices, Vector Spaces, and Information Retrieval, SIAM Review, vol 41, 1999Retrieval, SIAM Review, vol 41, 1999

Page 89: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 9090

ClusteringClustering

Partitional MethodsPartitional Methods

Cluster StoppingCluster Stopping

Cluster LabelingCluster Labeling

Page 90: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 9191

Many many methods…Many many methods…

Cluto supports a wide range of different Cluto supports a wide range of different clustering methodsclustering methods AgglomerativeAgglomerative

Average, single, complete link…Average, single, complete link… PartitionalPartitional

K-means (Direct)K-means (Direct) HybridHybrid

Repeated bisectionsRepeated bisections SenseClusters integrates with ClutoSenseClusters integrates with Cluto

http://www-users.cs.umn.edu/~karypis/cluto/http://www-users.cs.umn.edu/~karypis/cluto/

Page 91: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 9292

General MethodologyGeneral Methodology

Represent contexts to be clustered in Represent contexts to be clustered in first or second order vectorsfirst or second order vectors

Cluster the context vectors directlyCluster the context vectors directly vcluster vcluster

… … or convert to similarity matrix and or convert to similarity matrix and then clusterthen cluster sclusterscluster

Page 92: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 9797

Partitional MethodsPartitional Methods

Randomly create centroids equal to the Randomly create centroids equal to the number of clusters you wish to findnumber of clusters you wish to find

Assign each context to nearest centroidAssign each context to nearest centroid After all contexts assigned, re-compute After all contexts assigned, re-compute

centroidscentroids ““best” location decided by criterion functionbest” location decided by criterion function

Repeat until stable clusters foundRepeat until stable clusters found Centroids don’t shift from iteration to Centroids don’t shift from iteration to

iterationiteration

Page 93: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 9898

Partitional MethodsPartitional Methods

Advantages : fastAdvantages : fast DisadvantagesDisadvantages

Results can be dependent on the initial Results can be dependent on the initial placement of centroidsplacement of centroids

Must specify number of clusters ahead Must specify number of clusters ahead of timeof time

maybe not…maybe not…

Page 94: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 9999

Vectors to be clusteredVectors to be clustered

Page 95: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 100100

Random Initial Centroids Random Initial Centroids (k=2)(k=2)

Page 96: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 101101

Assignment of ClustersAssignment of Clusters

Page 97: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 102102

Recalculation of CentroidsRecalculation of Centroids

Page 98: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 103103

Reassignment of ClustersReassignment of Clusters

Page 99: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 104104

Recalculation of CentroidRecalculation of Centroid

Page 100: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 105105

Reassignment of ClustersReassignment of Clusters

Page 101: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 106106

Partitional Criterion Partitional Criterion FunctionsFunctions

Intra-Cluster (Internal) Intra-Cluster (Internal) similarity/distancesimilarity/distance How close together are members of a How close together are members of a

cluster?cluster? Closer together is betterCloser together is better

Inter-Cluster (External) Inter-Cluster (External) similarity/distancesimilarity/distance How far apart are the different clusters?How far apart are the different clusters? Further apart is better Further apart is better

Page 102: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 107107

Intra Cluster SimilarityIntra Cluster Similarity

Ball of String (I1)Ball of String (I1) How far is each member from each How far is each member from each

other memberother member Flower (I2)Flower (I2)

How far is each member of cluster from How far is each member of cluster from centroidcentroid

Page 103: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 108108

Contexts to be ClusteredContexts to be Clustered

Page 104: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 109109

Ball of String Ball of String (I1 Internal Criterion Function)(I1 Internal Criterion Function)

Page 105: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 110110

FlowerFlower(I2 Internal Criterion Function)(I2 Internal Criterion Function)

Page 106: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 111111

Inter Cluster SimilarityInter Cluster Similarity

The Fan (E1)The Fan (E1) How far is each centroid from the How far is each centroid from the

centroid of the entire collection of centroid of the entire collection of contextscontexts

Maximize that distanceMaximize that distance

Page 107: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 112112

The FanThe Fan(E1 External Criterion Function)(E1 External Criterion Function)

Page 108: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 113113

Hybrid Criterion FunctionsHybrid Criterion Functions

Balance internal and external Balance internal and external similaritysimilarity H1 = I1/E1H1 = I1/E1 H2 = I2/E1H2 = I2/E1

Want internal similarity to increase, Want internal similarity to increase, while external similarity decreaseswhile external similarity decreases

Want internal distances to decrease, Want internal distances to decrease, while external distances increasewhile external distances increase

Page 109: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 114114

Cluster StoppingCluster Stopping

Page 110: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 115115

Cluster StoppingCluster Stopping

Many Clustering Algorithms require Many Clustering Algorithms require that the user specify the number of that the user specify the number of clusters prior to clusteringclusters prior to clustering

But, the user often doesn’t know the But, the user often doesn’t know the number of clusters, and in fact number of clusters, and in fact finding that out might be the goal of finding that out might be the goal of clusteringclustering

Page 111: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 116116

Criterion Functions Can HelpCriterion Functions Can Help

Run partitional algorithm for k=1 to deltaKRun partitional algorithm for k=1 to deltaK DeltaK is a user estimated or automatically DeltaK is a user estimated or automatically

determined upper bound for the number of clustersdetermined upper bound for the number of clusters Find the value of k at which the criterion Find the value of k at which the criterion

function does not significantly increase at k+1function does not significantly increase at k+1 Clustering can stop at this value, since no Clustering can stop at this value, since no

further improvement in solution is apparent further improvement in solution is apparent with additional clusters (increases in k)with additional clusters (increases in k)

Page 112: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 117117

SenseCluster’s Approach SenseCluster’s Approach to Cluster Stoppingto Cluster Stopping

Will be subject of Demo at EACLWill be subject of Demo at EACL Demo Session 2Demo Session 2

5th April, 14:30-16:005th April, 14:30-16:00

Ted Pedersen and Anagha Kulkarni: Ted Pedersen and Anagha Kulkarni: Selecting the "Right" Number of Senses Selecting the "Right" Number of Senses Based on Clustering Criterion FunctionsBased on Clustering Criterion Functions

Page 113: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 118118

H2 versus kH2 versus kT. Blair – V. Putin – S. HusseinT. Blair – V. Putin – S. Hussein

Page 114: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 119119

PK2PK2 Based on Hartigan, 1975Based on Hartigan, 1975 When ratio approaches 1, clustering is at a When ratio approaches 1, clustering is at a

plateauplateau Select value of k which is closest to but outside of Select value of k which is closest to but outside of

standard deviation intervalstandard deviation interval

)1(2

)(2)(2

kH

kHkPK

Page 115: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 120120

PK2 predicts 3 sensesPK2 predicts 3 sensesT. Blair – V. Putin – S. HusseinT. Blair – V. Putin – S. Hussein

Page 116: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 121121

PK3PK3 Related to Salvador and Chan, 2004Related to Salvador and Chan, 2004 Inspired by Dice CoefficientInspired by Dice Coefficient Values close to 1 mean clustering is improving …Values close to 1 mean clustering is improving … Select value of k which is closest to but outside Select value of k which is closest to but outside

of standard deviation intervalof standard deviation interval

)1(2)1(2

)(2*2)(3

kHkH

kHkPK

Page 117: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 122122

PK3 predicts 3 sensesPK3 predicts 3 sensesT. Blair – V. Putin – S. HusseinT. Blair – V. Putin – S. Hussein

Page 118: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 123123

ReferencesReferences Hartigan, J. Clustering Algorithms, Wiley, 1975 Hartigan, J. Clustering Algorithms, Wiley, 1975

basis for SenseClusters stopping method PK2basis for SenseClusters stopping method PK2 Mojena, R., Hierarchical Grouping Methods and Stopping Mojena, R., Hierarchical Grouping Methods and Stopping

Rules: An Evaluation, The Computer Journal, vol 20, 1977 Rules: An Evaluation, The Computer Journal, vol 20, 1977 basis for SenseClusters stopping method PK1basis for SenseClusters stopping method PK1

Milligan, G. and Cooper, M., An Examination of Procedures Milligan, G. and Cooper, M., An Examination of Procedures for Determining the Number of Clusters in a Data Set, for Determining the Number of Clusters in a Data Set, Psychometrika, vol. 50, 1985Psychometrika, vol. 50, 1985 Very extensive comparison of cluster stopping methodsVery extensive comparison of cluster stopping methods

Tibshirani, R. and Walther, G. and Hastie, T., Estimating the Tibshirani, R. and Walther, G. and Hastie, T., Estimating the Number of Clusters in a Dataset via the Gap Statistic,Journal Number of Clusters in a Dataset via the Gap Statistic,Journal of the Royal Statistics Society (Series B), 2001of the Royal Statistics Society (Series B), 2001

Pedersen, T. and Kulkarni, A. Selecting the "Right" Number Pedersen, T. and Kulkarni, A. Selecting the "Right" Number of Senses Based on Clustering Criterion Functions, of Senses Based on Clustering Criterion Functions, Proceedings of the Posters and Demo Program of the Proceedings of the Posters and Demo Program of the Eleventh Conference of the European Chapter of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics, 2006Association for Computational Linguistics, 2006 Describes SenseClusters stopping methodsDescribes SenseClusters stopping methods

Page 119: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 124124

Cluster LabelingCluster Labeling

Page 120: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 125125

Cluster LabelingCluster Labeling

Once a cluster is discovered, how can Once a cluster is discovered, how can you generate a description of the you generate a description of the contexts of that cluster automatically? contexts of that cluster automatically?

In the case of contexts, you might be In the case of contexts, you might be able to identify significant lexical able to identify significant lexical features from the contents of the features from the contents of the clusters, and use those as a clusters, and use those as a preliminary labelpreliminary label

Page 121: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 126126

Results of ClusteringResults of Clustering

Each cluster consists of some Each cluster consists of some number of contextsnumber of contexts

Each context is a short unit of textEach context is a short unit of text Apply measures of association to the Apply measures of association to the

contents of each cluster to determine contents of each cluster to determine N most significant bigramsN most significant bigrams

Use those bigrams as a label for the Use those bigrams as a label for the clustercluster

Page 122: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 127127

Label TypesLabel Types

The N most significant bigrams for The N most significant bigrams for each cluster will act as a descriptive each cluster will act as a descriptive labellabel

The M most significant bigrams that The M most significant bigrams that are unique to each cluster will act as are unique to each cluster will act as a discriminating labela discriminating label

Page 123: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 132132

Baseline AlgorithmBaseline Algorithm

Baseline Algorithm – group all Baseline Algorithm – group all instances into one cluster, this will instances into one cluster, this will reach “accuracy” equal to majority reach “accuracy” equal to majority classifierclassifier

What if the clustering said everything What if the clustering said everything should be in the same cluster?should be in the same cluster?

Page 124: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 133133

Baseline PerformanceBaseline Performance

S1S1 S2S2 S3S3 TotalTotalss

C1C1 00 00 00 00

C2C2 00 00 00 00

C3C3 8080 3535 5555 170170

TotalTotalss

8080 3535 5555 170170

S3S3 S2S2 S1S1 TotalTotalss

C1C1 00 00 00 00

C2C2 00 00 00 00

C3C3 5555 3535 8080 170170

TotalTotalss

5555 3535 8080 170170

(0+0+55)/170 = .32 if C3 is S1 (0+0+55)/170 = .32 if C3 is S1 (0+0+80)/170 = .47(0+0+80)/170 = .47 if C3 is S3if C3 is S3

Page 125: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 138138

Things to TryThings to Try

Feature IdentificationFeature Identification Type of FeatureType of Feature Measures of associationMeasures of association

Context Representation (1Context Representation (1stst or 2 or 2ndnd order) order) Automatic Stopping (or not)Automatic Stopping (or not) SVD (or not)SVD (or not) Clustering Algorithm and Criterion FunctionClustering Algorithm and Criterion Function EvaluationEvaluation LabelingLabeling

Page 126: Eacl 2006 Pedersen

EACL-2006 TutorialEACL-2006 Tutorial 149149

Thank you!Thank you!

Questions or comments on tutorial or Questions or comments on tutorial or SenseClusters are welcome at any time SenseClusters are welcome at any time

[email protected]@d.umn.edu SenseClusters is freely available via SenseClusters is freely available via

LIVE CD, the Web, and in source code LIVE CD, the Web, and in source code formform

http://senseclusters.sourceforge.nethttp://senseclusters.sourceforge.net SenseClusters papers available at:SenseClusters papers available at:

http://www.d.umn.edu/~tpederse/senseclusters-pubs.htmlhttp://www.d.umn.edu/~tpederse/senseclusters-pubs.html


Recommended