Post on 11-May-2015
transcript
Computational Lexical Semantics
Om Damani, IIT Bombay
Study of Word Meaning
Word Sense Disambiguation Word Similarity WordNet Relations
Do we really know the meaning of meaning We will just take the dictionary definition as
meaning
Word Sense Disambiguation (WSD)
WSD Applications: Search, _____, ______
Sense Inventory
Wordnet, Dictionary etc. Plant in English Wordnet (#senses ??): Noun Senses:
plant, works, industrial plant (buildings for carrying on industrial labor) "they built a large plant to manufacture automobiles"
plant, flora, plant life ((botany) a living organism lacking the power of locomotion)
plant (an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience)
plant (something planted secretly for discovery by another) "the police used a plant to trick the thieves"; "he claimed that the evidence against him was a plant"
Sense Inventory ..
Plant (Verb Senses): plant, set (put or set (seeds, seedlings, or plants) into the ground)
"Let's plant flowers in the garden" implant, engraft, embed, imbed, plant (fix or set securely or
deeply) "He planted a knee in the back of his opponent"; "The dentist implanted a tooth in the gum"
establish, found, plant, constitute, institute (set up or lay the groundwork for) "establish a new department"
plant (place into a river) "plant fish" plant (place something or someone in a certain position in order
to secretly observe or deceive) "Plant a spy in Moscow"; "plant bugs in the dissident's apartment"
plant, implant (put firmly in the mind) "Plant a thought in the students' minds"
How many Senses of सच्चा� Noun: सत्यवा�दी�, सच्चा�, सत्यभा�षी�, सत्यवाक्ता� - वाह जो� सत्य बो�लता� ह� "आधु�नि�क
समा�जो मा� भा� सत्यवा�दिदीय� क� कमा� �ह� ह / यथा�था"वा�दी� ह��# क# क�रण कई ल�ग श्य�मा क# दुश्मा� बो� गए ह+"
Adjective(6) सत्यवा�दी�, सच्चा�, सत्यभा�षी�, सत्यवाक्ता� - जो� सत्य बो�लता� ह� "य�धिधुधि-र एक सत्यवा�दी�
व्यक्तिक्ता था#" ईमा��दी�र, छलह��, नि�ष्कपट, नि�4कपट, रिरजो�, ऋजो�, दीय��तादी�र, सच्चा�, अप शु��, सत्यपर -
क्ति9त्त मा� सद्वृ<त्तित्त य� अच्छी� ��यता रख�#वा�ल�, 9�र� य� छल-कपट � कर�#वा�ल� "ईमा��दी�र व्यक्तिक्ता सम्मा�� क� प�त्र ह�ता� ह "
वा�स्तानिवाक, यथा�था", सच्चा�, सह�, असल�, वा�स्तावा, अक�ल्पनि�क, अकल्पिल्पता, अकE ट, प्रकG ता - जो� वा�स्तावा मा� ह� य� हुआ ह� य� निबोल्क� ल ठीJक "मा+�# अभा�-अभा� एक अनिवाश्वस��य पर वा�स्तानिवाक घट�� स��� ह "
सच्चा�, असल� - जो� झूEठी� य� बो��वाट� � ह� "वाह भा�रता मा�N क� सच्चा� सपEता ह " खर�, 9�ख�, सच्चा� - जो� ईमा��दी�र�, नि�ष्पक्षता�, न्य�य आदिदी क# आधु�र पर ह� "हमा� खर�
सQदी� कर�� 9�निहए" खर�, सच्चा�, स�धु� - निबो�� निकस� बोह��# य� समाझूQता� क# य�नि� स�धु� "वाह इता�� खर� �ह� ह
जिजोता�� दिदीख�ता� ह “ How do you know these are different senses
Hint: think translation
How many Senses of आदीमा� आदीमा�, प�रुषी, मादी", �र - �र जो�निता क� मा��ष्य "आदीमा� और औरता क� शु�र�रिरक
सWर9��एN त्तिभान्न ह�ता� ह+" मा��वा, आदीमा�, इWस��, इन्स��, इ�स��, मा��ष्य, मा���षी, मा���स, मा��षी, �र - वाह निद्वृपदी
प्र�ण� जो� अप�# बो�जिYबोल क# क�रण सबो प्र�त्तिणय� मा� श्रे#- ह और जिजोसक# अWताग"ता हमा,आप और सबो ल�ग ह+ " आदीमा� अप�� बो�जिY क# क�रण सभा� प्र�त्तिणय� मा� श्रे#- ह "
व्यक्तिक्ता, मा��स, आदीमा�, शुख़्स, शुख्स, जो�, बोWदी�, बोन्दी� - मा��ष्य जो�निता य� समाEह मा� स# क�ई एक "इस क�र मा� दी� ह� आदीमा� बो ठी सकता# ह+"
�Qकर, स#वाक, दी�स, अ��9र, ख़ा�दिदीमा, मा�ल�जि^मा, मा�ल�जिजोमा, आदीमा�, टहल�आ, प�षी"दी, ल_डा�, अ��ग, अ��9�रक, अ��9�र�, अ��य�य�, प�बोWदी, प�बोन्दी, �फर, अत्तिभा9र, भाGत्य, गण, अत्तिभासर, अत्तिभास�र� - वाह जो� स#वा� करता� ह� "मा#र� आदीमा� एक हफ्ता# क# क्तिलए घर गय� ह "
पनिता, मादी", शुQहर, घरवा�ल�, धिमाय�N, आदीमा�, ख़ासमा, खसमा, स्वा�मा�, अधु�शु, ��था, क�Wता, कW ता, परिरण#ता�, वा�रधियता�, दीधियता - स्त्र� क� दृधिd स# उसक� निवावा�निहता प�रुषी "शु�ल� क� आदीमा� निकस��� करक# परिरवा�र क� प�ल�-प�षीण करता� ह “
How do you know these are different senses Hint: think translation
WSD: Problem Statement
Given a string of words (sentence, phrase, set of key-words), and a set of senses for each word, decide the appropriate sense for each word.
Example: Translate ‘Where can I get spare parts for textile plant ?’ to Hindi
Solution: ??
Solution Approaches
Solution depends on what resources do you have: Definition, Gloss Topic/Category label for each sense definition Selectional preference for each sense Sense Marked Corpora Parallel Sense-Marked Corpora
Combinatorial Explosion Problem I saw a man who is 98 years old and can still
walk and tell jokes See(26), man(11), year(4), old(8), can(5).
Still(4), walk(10), tell(8), joke(3). 4,39,29,600 sense combinations Solution: Viterbi ??
Dictionary-Based WSD
Dictionary-Based WSDThe bank did not give loan to him though he offered to mortgage his boat.
bank Gloss
Example
a financial institution that accepts deposits and gives loan
“he cashed a check at the bank”, “that bank holds the mortgage on my home”
bank Gloss
Example
the slope beside a body of water
“they pulled the boat up on the bank”, “he watched the currents from the river bank ”
The bank did not give loan to him though he offered to mortgage his boat.
How to improve the LESK further Give an example where the algo fails – say for bank
“The bank did not give loan to him though he offered his boat as collateral.”
Problem: collateral is related to the bank but the relation does not come out clearly
Solution: See if the definition of bank and definition of collateral share a term: Collateral: security pledged for loan repayment
Problem: Can you give an example where the new algorithm fails too
LESK AlgorithmFunction Lesk (word, sentence) returns best sense of word context := set of words in sentence; for each sense in senses of word do sense.signature := GetSignature (sense); sense.relevance := ComputeRelevance
( sense.signature, context ); end best-sense := MaxRelevantSense () ; if ( best-sense.relevance == 0 ) best-sense := GetDefaultSense (word); return best-sense;
GetSignature ( sense ): Get all words in example and gloss of senseComputeRelevance ( signature, context ): number of common words
GetSignature ( sense )
All words in example and gloss of sense All words in gloss of sense
All words in gloss of all words in the gloss of the given sense All words in gloss of all words in gloss of all words in gloss
…..
Problem: Including the right sense of each word in gloss needs WSD Including all senses of all words in gloss will lead to sense-
drift Possible Solution: All context words in a sense
marked corpora
Ideal Signature
For each word, get a Vector of all the words in the language
Work with a |V|x|V| Matrix Iterate over it, till it converges
ComputeRelevance( signature, context ) number of common words
Favors longer definitions | Set-Intersection | / | Set-Union |
Define Relevance between two words Synonyms Specialization, Generalization has to be
accounted for – canoe and boat Sum of Relevance between all word pairs Weigh different terms differently – maybe
based on TF-IDF score
GetDefaultSense ( word )
The most frequent sense The most frequent sense in a given domain The most frequent sense as per the topic of
the document
Power of the LESK Schema
Signature can even be a topic/domain code: finance, poetry, geo-physics
All variations of ComputeRelevance function are still applicable
Possible Improvements
LESK gives equal weightage to all senses - ‘right’ sense should be given more weight Iterative fashion – one at a time – most certain
first Page Rank like algo
Give more weightage to Gloss than to Example
Page-Rank-LESK
Fundamental Limitation of Dictionary Based Methods Depends too much on the exact word
Another dictionary may use different gloss and example
Use the context words from a tagged corpus as signature
Supervised Learning
Lesk-like methods depend too much on the exact word Another dictionary may use different gloss and
example Use a sense-tagged corpora Employ a machine learning algorithm
Supervised Learning
Machine can only learn what we ask it to Collocation feature
Relative position (2 words to the left) Words and POS “An electric guitar and bass player stand off to one side, not
really part of the scene, ...” [wi−2,POSi−2,wi−1,POSi−1,wi+1,(20.2) POSi+1,wi+2,POSi+2] [guitar, NN, and, CC, player, NN, stand, VB]
Bag-of-words feature [fishing, big, sound, player, fly, rod, pound, double, runs, playing,
guitar, band] [0,0,0,1,0,0,0,0,0,0,1,0]
Still the data sparsity problem remainsAssumption: features are conditionally independent given the word sense
a simple binary bag of words vector defined over a vocabulary of 20 words would have --- possible feature vectors.
Naïve Bayes Classifier
Computing Naïve Bayes Probabilities
if a collocational feature such as [wi.2 = guitar] occurred 3 times for sensebass1, and sense bass1 itself occurred 60 times in training, theMLE estimate is P( f j |s)= 0.05.
it’s hard for humans to examine Naïve Bayes’s workings and understand its decisions. Hence use Decision lists
Decision Lists
Rule Sense⇒fish within window ⇒ bass1
striped bass ⇒ bass1
guitar within window ⇒ bass2
bass player ⇒ bass2
piano within window ⇒ bass2
tenor within window ⇒ bass2
sea bass ⇒ bass1
How to Create Decision Lists
Which feature has the most discrimination power
Seems same as max
P (Sense | f) Need Decision Trees
Selectional Restrictions and Selectional Preferences “In our house, everybody has a career and none of
them includes washing dishes,” he says. In her kitchen, Ms. Chen works efficiently, cooking
several simple dishes. Wash[+WASHABLE], cook[+EDIBLE] Used more often for elimination than selection Problem: Gold-rush fell apart in 1931, perhaps
because people realized you can’t eat gold for lunch if you’re hungry.
Solution: Use these preferences as features/probabilities
Selectional Preference Strength eat edible. ⇒ be ??⇒ Strength: P(c) vs P(c|v) Kullback-Leibler
Divergence (Relative Entropy)
selectional association : contribution of that class to general selectional preference of the verb
Selection Association
a probabilistic measure of the strength of association between a predicate and a class dominating the argument to the predicate
Verb, Semantic Class, Assoc, Semantic Class, Assoc
read WRITING 6.80 ACTIVITY -.20 write WRITING 7.26 COMMERCE 0 see ENTITY 5.79 METHOD -0.01
How do we use Selection Association for WSD Use as a relatedness model select the sense with highest selectional
association between one of its ancestor hypernyms and the predicate.
Minimally Supervised WSD
Supervised: needs sense tagged corpora Dictionary based: needs large examples and
gloss Supervised approaches do better but are
much more expensive Can we get best of both words
Bootstrapping
Seed-set L0 of labeled instances, a much larger unlabeled corpus V0
Train a decision-list classifier on seed-set L0 Uses this classifier to label the corpus V0 Add to the training set examples in V0 that it
is confident about Iterate { retrain decision-list classifier }
Bootstrapping Success Depends On Choosing the initial seed-set
One sense per collocation One sense per discourse
Samples of bass sentences extracted from the WSJ using the simple correlates play and fish. We need more good teachers – right now, there are only a
half a dozen who can play the free bass with ease. And it all started when fishermen decided the striped bass
in Lake Mead were too skinny.
Choosing the ‘confidence’ criterion
WSD: Summary
It is a hard problem In part because it is not a well-defined
problem Or it cannot be well-defined Because making sense of ‘Sense’ is hard
Hindi Wordnet
Wordnet - A lexical database Inspired by the English WordNet Built conceptually Synset (synonym set) is the basic building
block.
Entry in Hindi Wordnet
Synset {ग�य,गऊ, ग य�, धु#��}
{gaaya ,gauu, gaiyaa, dhenu}, Cow
Gloss Text definition
स�गवा�ल� एक शु�क�ह�र� मा�दी� 9Qप�य�(siingwaalaa eka shaakaahaarii maadaa choupaayaa)
(a horny, herbivorous, four-legged female animal)
Example sentence
निहन्दू ल�ग ग�य क� ग� मा�ता� कहता# ह+ एवाW उसक� पEजो� करता# ह+।(hinduu loga gaaya ko go maataa kahate hain evam usakii puujaa karate hain)
(The Hindus considers cow as mother and worship it.)
Subgraph for Noun
ग�य, गऊ (gaaya ,gauu) Cow
9Qप�य�,पशु�(chaupaayaa, pashu)Four-legged animal
स�गवा�ल� एक शु�क�ह�र� मा�दी� 9Qप�य�(siingwaalaa eka sakaahaarii maadaa choupaayaa)A horny, herbivorous, four-legged female animal)
पग�र��� ( paguraanaa) ruminate
बो ल (baila) Ox
क�माधु#��kaamadhenuA kind of cow
मा �� ग�यmainii gaayaA kind of cow
था� (thana) udder
पENछ(puunchh ) Tail
शु�क�ह�र� (shaakaahaarii) herbivorous
Hypernym
Attribute
Hyponym
Gloss
Ability Verb
meronym
Antonym
Subgraph for Verb
र���,रुदी� कर�� (ronaa, rudan karanaa)
to weep
भा�वा�त्तिभाव्यक्तिक्ता कर�� (bhaavaabhivyakti karanaa)
to express
हNस�� (hansanaa)
to laugh
आNसE बोह��� (aansuu bahaanaa)
to weep
क्तिससक�� (sisakanaa)
to sob
रुल��� (rulaanaa)
to make cry
Hypernym
Antonym
Gloss Troponym
Causative Verb
Entailment
Marathi Wordnet (Noun)
ख�डा
र��
बो�ग
आWबो�लिंklबोE
माूEळ
मा�ळ#,ख�डा,फ�Wद्या�,प��# इत्य�दीp�� य�क्ता अस� वा�स्पनितानिवाशु#षी:"झू�डा# पय�"वारण शु�Y करण्य�9# क�मा करता�ता"
झू�डा, वाGक्ष, तारू
वा�स्पता�
MERONYMY
HOLONYMY
H Y P E R N Y M Y
H Y P O N Y M YGLOSS
Word Similarity
In Lesk and other algo, we need to measure how related two words are
Simplest measure: pathLength - #edges in shortest path between sense nodes c1 and c2
sim (c1,c2) = −log pathlen(c1,c2)
wordsim(w1,w2) = max (c1 senses(∈ w1), c2 senses(∈ w2)) sim(c1,c2)
Path Length: Limitations
All edges are not equal Compare medium of
exchange and standard with coin and nickel
Need a distance measure on edges
Information Content Word Similarity
LCS(c1,c2) = the lowest common subsumer, i.e., the lowest node in thehierarchy that subsumes (is a hypernym of) both c1 and c2
sim (c1,c2) = −log P(LCS(c1,c2))
IC Similarity: Limitations
A concept is not similar to itself using the previous defn
Word similarity is not about Information Contents. It is about commanality vs differences:
Overlap based Similarity
Previous methods may not work for words belonging to different classes: car and petrol
similarity(A,B) = overlap(gloss(A), gloss(B)) + overlap(gloss(hypo(A)), gloss(hypo(B)))+ overlap(gloss(A), gloss(hypo(B))) + overlap(gloss(hypo(A)),gloss(B))
WORD SIMILARITY: DISTRIBUTIONAL METHODSpointwise mutual information
Similarity using Feature Vectors
Cosine Distance
Dot product favors long vectors
Conclusion
Lot of care is needed in defining similarity measures
Impressive results can be obtained once similarity is carefully defined
Backup
OrgLESK : Taking Signature of Context Words into Account for Relatedness Disambiguating “pine cone”
Neither ‘pine’ nor ‘cone’ appears in each other definitions
pine 1 a evergreen tree with needle-shaped leaves and
solid wood 2 waste away through sorrow or illness
cone 1 solid body which narrows to a point 2 something of this shape whether solid or hollow 3 fruit of certain evergreen trees
Does the Improvement Really Work Problem: Collateral has not one but many senses:
Noun: collateral (a security pledged for the repayment of a loan) Adjective S: (adj) collateral, indirect (descended from a common ancestor
but through different lines) "cousins are collateral relatives"; "an indirect descendant of the Stuarts"
S: (adj) collateral, confirmative, confirming, confirmatory, corroborative, corroboratory, substantiating, substantiative, validating, validatory, verificatory, verifying (serving to support or corroborate) "collateral evidence"
S: (adj) collateral (accompany, concomitant) "collateral target damage from a bombing run"
S: (adj) collateral (situated or running side by side) "collateral ridges of mountains“
Solution ??