WordMeaningandSimilarity
WordSenses andWordRelations
SlidesareadaptedfromDanJurafsky
Reminder:lemmaandwordform
• Alemma orcitationform• Samestem,partofspeech,roughsemantics
• Awordform• The“inflected”wordasitappearsintext
Wordform Lemmabanks banksung singduermes dormir
Lemmashavesenses
• Onelemma“bank”canhavemanymeanings:• …a bank can hold the investments in a custodial account…
• “…as agriculture burgeons on the east bank the river will shrink even more”
• Sense(orwordsense)• Adiscreterepresentation
ofanaspectofaword’smeaning.
• Thelemmabank herehastwosenses
1
2
Sense1:
Sense2:
Homonymy
Homonyms:wordsthatshareaformbuthaveunrelated,distinctmeanings:
• bank1:financialinstitution,bank2:slopingland• bat1:clubforhittingaball,bat2:nocturnalflyingmammal
1. Homographs (bank/bank,bat/bat)2. Homophones:
1. Write andright2. Piece andpeace
HomonymycausesproblemsforNLPapplications
• Informationretrieval• “bat care”
• MachineTranslation• bat:murciélago (animal)orbate (forbaseball)
• Text-to-Speech• bass (stringedinstrument)vs.bass (fish)
Polysemy
• 1.Thebankwasconstructedin1875outoflocalredbrick.• 2.Iwithdrewthemoneyfromthebank• Arethosethesamesense?
• Sense2:“Afinancialinstitution”• Sense1:“Thebuildingbelongingtoafinancialinstitution”
• Apolysemous wordhasrelatedmeanings• Mostnon-rarewordshavemultiplemeanings
• Lotsoftypesofpolysemyaresystematic• School, university, hospital• Allcanmeantheinstitutionorthebuilding.
• Asystematicrelationship:• Building Organization
• Othersuchkindsofsystematicpolysemy:Author (Jane Austen wrote Emma)
WorksofAuthor(I love Jane Austen)Tree (Plums have beautiful blossoms)
Fruit (I ate a preserved plum)
MetonymyorSystematicPolysemy:Asystematicrelationshipbetweensenses
Howdoweknowwhenawordhasmorethanonesense?
• The“zeugma”test:Twosensesofserve?• Which flights serve breakfast?• Does Lufthansa serve Philadelphia?• ?DoesLufthansaservebreakfastandSanJose?
• Sincethisconjunctionsoundsweird,• wesaythatthesearetwodifferentsensesof“serve”
Synonyms• Wordthathavethesamemeaninginsomeorallcontexts.
• filbert/hazelnut• couch/sofa• big/large• automobile/car• vomit/throwup• Water/H20
• Twolexemesaresynonyms• iftheycanbesubstitutedforeachotherinallsituations• Ifsotheyhavethesamepropositionalmeaning
Synonyms
• Buttherearefew(orno)examplesofperfectsynonymy.• Evenifmanyaspectsofmeaningareidentical• Stillmaynotpreservetheacceptabilitybasedonnotionsofpoliteness,slang,register,genre,etc.
• Example:• Water/H20• Big/large• Brave/courageous
Synonymyisarelationbetweensensesratherthanwords
• Considerthewordsbig andlarge• Aretheysynonyms?
• Howbig isthatplane?• WouldIbeflyingonalarge orsmallplane?
• Howabouthere:• MissNelson becameakindofbigsistertoBenjamin.• ?MissNelson becameakindoflarge sistertoBenjamin.
• Why?• big hasasensethatmeansbeingolder,orgrownup• large lacksthissense
Antonyms
• Sensesthatareoppositeswithrespecttoonefeatureofmeaning• Otherwise,theyareverysimilar!
dark/light short/long fast/slow rise/fallhot/cold up/down in/out
• Moreformally:antonymscan• defineabinaryopposition
orbeatoppositeendsofascale• long/short, fast/slow
• Bereversives:• rise/fall, up/down
HyponymyandHypernymy
• Onesenseisahyponym ofanotherifthefirstsenseismorespecific,denotingasubclassoftheother• car isahyponymofvehicle• mango isahyponymoffruit
• Converselyhypernym/superordinate (“hyperissuper”)• vehicle isahypernym ofcar• fruit isahypernym ofmango
Superordinate/hyper vehicle fruit furnitureSubordinate/hyponym car mango chair
Hyponymymoreformally• Extensional:
• Theclassdenotedbythesuperordinateextensionallyincludestheclassdenotedbythehyponym
• Entailment:• AsenseAisahyponymofsenseBifbeinganAentailsbeingaB
• Hyponymyisusuallytransitive• (AhypoBandBhypoCentailsAhypoC)
• Anothername:theIS-Ahierarchy• AIS-A B(orAISA B)• Bsubsumes A
HyponymsandInstances
• WordNet hasbothclasses andinstances.• Aninstance isanindividual,apropernounthatisauniqueentity
• San Francisco isaninstance ofcity• Butcity isaclass• city isahyponym ofmunicipality...location...
15
WordMeaningandSimilarity
WordSenses andWordRelations
WordMeaningandSimilarity
WordNet andotherOnlineThesauri
ApplicationsofThesauriandOntologies
• InformationExtraction• InformationRetrieval• QuestionAnswering• Bioinformaticsand MedicalInformatics• MachineTranslation
WordNet 3.0
• Ahierarchicallyorganizedlexicaldatabase• On-linethesaurus+aspectsofadictionary
• Someotherlanguagesavailableorunderdevelopment• (Arabic,Finnish,German,Portuguese…)
Category UniqueStringsNoun 117,798Verb 11,529Adjective 22,479Adverb 4,481
Sensesof“bass”inWordnet
Howis“sense”definedinWordNet?• The synset (synonymset),thesetofnear-synonyms,
instantiatesasenseorconcept,withagloss• Example:chumpasanounwiththegloss:
“apersonwhoisgullibleandeasytotakeadvantageof”
• Thissenseof“chump”issharedby9words:chump1, fool2, gull1, mark9, patsy1, fall guy1, sucker1, soft touch1, mug2
• Eachofthese senseshavethissamegloss• (Notevery sense;sense2ofgullistheaquaticbird)
WordNet Hypernym Hierarchyfor“bass”
WordNet NounRelations
WordNet 3.0
• Whereitis:• http://wordnetweb.princeton.edu/perl/webwn
• Libraries• Python:WordNet fromNLTK• http://www.nltk.org/Home
• Java:• JWNL,extJWNL onsourceforge
Synset
• MeSH (MedicalSubjectHeadings)• 177,000entrytermsthatcorrespondto26,142biomedical“headings”
• HemoglobinsEntryTerms:Eryhem, FerrousHemoglobin,HemoglobinDefinition:Theoxygen-carryingproteinsofERYTHROCYTES.Theyarefoundinallvertebratesandsomeinvertebrates.Thenumberofglobinsubunitsinthehemoglobinquaternarystructurediffersbetweenspecies.Structuresrangefrommonomerictoavarietyofmultimeric arrangements
MeSH:MedicalSubjectHeadingsthesaurusfromtheNationalLibraryofMedicine
TheMeSH Hierarchy
• a
26
UsesoftheMeSH Ontology
• Providesynonyms(“entryterms”)• E.g.,glucoseanddextrose
• Providehypernyms (fromthehierarchy)• E.g.,glucoseISAmonosaccharide
• IndexinginMEDLINE/PubMED database• NLM’sbibliographicdatabase:• 20millionjournalarticles• Eacharticlehand-assigned10-20MeSH terms
WordMeaningandSimilarity
WordNet andotherOnlineThesauri
WordMeaningandSimilarity
WordSimilarity:ThesaurusMethods
WordSimilarity
• Synonymy:abinaryrelation• Twowordsareeithersynonymousornot
• Similarity(or distance):aloosermetric• Twowordsaremoresimilariftheysharemorefeaturesofmeaning
• Similarityisproperlyarelationbetweensenses• Theword“bank”isnotsimilartotheword“slope”• Bank1 issimilartofund3
• Bank2 issimilartoslope5
• Butwe’llcomputesimilarityoverbothwordsandsenses
Whywordsimilarity
• Informationretrieval• Questionanswering• Machinetranslation• Naturallanguagegeneration• Languagemodeling• Automaticessaygrading• Plagiarismdetection• Documentclustering
Wordsimilarityandwordrelatedness
• Weoftendistinguishwordsimilarity fromwordrelatedness• Similar words:near-synonyms• Relatedwords:canberelatedanyway• car, bicycle: similar• car, gasoline: related,notsimilar
Twoclassesofsimilarityalgorithms
• Thesaurus-basedalgorithms• Arewords“nearby”inhypernym hierarchy?• Dowordshavesimilarglosses(definitions)?
• Distributionalalgorithms• Dowordshavesimilardistributionalcontexts?
Pathbasedsimilarity
• Twoconcepts(senses/synsets)aresimilariftheyareneareachotherinthethesaurushierarchy• =haveashortpathbetweenthem• conceptshavepath1tothemselves
Refinementstopath-basedsimilarity
• pathlen(c1,c2) =1+numberofedgesintheshortestpathinthehypernym graphbetweensensenodesc1 andc2
• rangesfrom0to1(identity)
• simpath(c1,c2) =
• wordsim(w1,w2) = max simpath(c1,c2)c1Îsenses(w1),c2Îsenses(w2)
1pathlen(c1,c2 )
Example:path-basedsimilaritysimpath(c1,c2) = 1/pathlen(c1,c2)
simpath(nickel,coin)=1/2 = .5simpath(fund,budget)=1/2 = .5simpath(nickel,currency)=1/4 = .25simpath(nickel,money)=1/6 = .17simpath(coinage,Richter scale)=1/6 = .17
Problemwithbasicpath-basedsimilarity
• Assumeseachlinkrepresentsauniformdistance• Butnickel tomoney seemstoustobecloserthannickel tostandard
• Nodeshighinthehierarchyareveryabstract• Weinsteadwantametricthat
• Representsthecostofeachedgeindependently• Wordsconnectedonlythroughabstractnodes• arelesssimilar
Informationcontentsimilaritymetrics
• Let’sdefineP(c) as:• Theprobabilitythatarandomlyselectedwordinacorpusisaninstanceofconceptc
• Formally:thereisadistinctrandomvariable,rangingoverwords,associatedwitheachconceptinthehierarchy• foragivenconcept,eachobservednouniseither
• amemberofthatconceptwithprobabilityP(c)• notamemberofthatconceptwithprobability1-P(c)
• Allwordsaremembersoftherootnode(Entity)• P(root)=1
• Theloweranodeinhierarchy,theloweritsprobability
Resnik 1995.Usinginformationcontenttoevaluatesemanticsimilarityinataxonomy.IJCAI
Informationcontentsimilarity
• Trainbycountinginacorpus• Eachinstanceofhill countstowardfrequencyofnaturalelevation,geologicalformation,entity,etc• Letwords(c) bethesetofallwordsthatarechildrenofnodec
• words(“geo-formation”)= {hill,ridge,grotto,coast,cave,shore,natural elevation}• words(“naturalelevation”)={hill,ridge}
P(c) =count(w)
w∈words(c)∑
N
geological-formation
shore
hill
naturalelevation
coast
cave
grottoridge
…
entity
Informationcontentsimilarity• WordNet hierarchyaugmentedwithprobabilitiesP(c)
D.Lin.1998.AnInformation-TheoreticDefinitionofSimilarity.ICML1998
Informationcontent:definitions
• Informationcontent:IC(c) = -log P(c)
• Mostinformativesubsumer(Lowestcommonsubsumer)LCS(c1,c2) = Themostinformative(lowest)nodeinthehierarchysubsumingbothc1 andc2
Usinginformationcontentforsimilarity:theResnik method
• Thesimilaritybetweentwowordsisrelatedtotheircommoninformation
• Themoretwowordshaveincommon,themoresimilartheyare
• Resnik:measurecommoninformationas:• Theinformationcontentofthemostinformative(lowest)subsumer (MIS/LCS)ofthetwonodes
• simresnik(c1,c2) = -log P( LCS(c1,c2) )
PhilipResnik.1995.UsingInformationContenttoEvaluateSemanticSimilarityinaTaxonomy.IJCAI1995.PhilipResnik.1999.SemanticSimilarityinaTaxonomy:AnInformation-BasedMeasureanditsApplicationtoProblemsofAmbiguityinNaturalLanguage.JAIR11,95-130.
Dekang Linmethod
• Intuition:SimilaritybetweenAandBisnotjustwhattheyhaveincommon
• Themoredifferences betweenAandB,thelesssimilartheyare:• Commonality:themoreAandBhaveincommon,themoresimilartheyare• Difference:themoredifferencesbetweenAandB,thelesssimilar
• Commonality:IC(common(A,B))• Difference:IC(description(A,B))-IC(common(A,B)
Dekang Lin.1998.AnInformation-TheoreticDefinitionofSimilarity.ICML
Dekang Linsimilaritytheorem• ThesimilaritybetweenAandBismeasuredbytheratio
betweentheamountofinformationneededtostatethecommonalityofAandBandtheinformationneededtofullydescribewhatAandBare
simLin(A,B)∝IC(common(A,B))IC(description(A,B))
• Lin(alteringResnik)definesIC(common(A,B))as2xinformationoftheLCS
simLin(c1,c2 ) =2 logP(LCS(c1,c2 ))logP(c1)+ logP(c2 )
Linsimilarityfunction
simLin(A,B) =2 logP(LCS(c1,c2 ))logP(c1)+ logP(c2 )
simLin(hill, coast) =2 logP(geological-formation)logP(hill)+ logP(coast)
=2 ln0.00176
ln0.0000189+ ln0.0000216= .59
The(extended)Lesk Algorithm
• Athesaurus-basedmeasurethatlooksatglosses• Twoconceptsaresimilariftheirglossescontainsimilarwords
• Drawingpaper:paper thatisspeciallypreparedforuseindrafting• Decal:theartoftransferringdesignsfromspeciallypreparedpaper toawoodorglassormetalsurface
• Foreachn-wordphrasethat’sinbothglosses• Addascoreofn2
• Paperandspeciallypreparedfor1+22 =5• Computeoverlapalsoforotherrelations• glossesofhypernyms andhyponyms
Summary:thesaurus-basedsimilarity
simpath (c1,c2 ) =1
pathlen(c1,c2 )
simresnik (c1,c2 ) = − logP(LCS(c1,c2 )) simlin (c1,c2 ) =2 logP(LCS(c1,c2 ))logP(c1)+ logP(c2 )
sim jiangconrath (c1,c2 ) =1
logP(c1)+ logP(c2 )− 2 logP(LCS(c1,c2 ))
simeLesk (c1,c2 ) = overlap(gloss(r(c1)),gloss(q(c2 )))r,q∈RELS∑
Librariesforcomputingthesaurus-basedsimilarity
• NLTK• http://nltk.github.com/api/nltk.corpus.reader.html?highlight=similarity-nltk.corpus.reader.WordNetCorpusReader.res_similarity
• WordNet::Similarity• http://wn-similarity.sourceforge.net/• Web-basedinterface:
• http://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgi
48
Evaluatingsimilarity• Extrinsic(task-based,end-to-end)Evaluation:
• QuestionAnswering• SpellChecking• Essaygrading
• IntrinsicEvaluation:• Correlationbetweenalgorithm andhumanwordsimilarityratings• Wordsim353:353nounpairsrated0-10.sim(plane,car)=5.77
• TakingTOEFLmultiple-choicevocabularytests• Levied is closest in meaning to:imposed, believed, requested, correlated
WordMeaningandSimilarity
WordSimilarity:ThesaurusMethods