TOPIC 7: WORD SENSE DISAMBIGUATION (WSD)
NATURAL LANGUAGE PROCESSING (NLP)
CS-724
WondwossenMulugeta (PhD) email: [email protected]
1
Topics2
Topics Subtopics7: Word Sense Disambiguation (WSD)
Practical Problem, Word Sense relations, Disambiguation Approaches, Knowledge Based WSD, Machine Readable Dictionary, Lesk Algorithm, Selectional Preference for WSD
Sense Inventory
What is a “sense” of a word?
Homonyms (disconnected meanings)
Duck : to move downwards quickly
Duck : Small web-footed bird
Polysemes (related meanings with joint origin (history of word))
bank: financial institution as corporation
bank: a building housing of such an institution
Where could we know the sense of a word
Dictionaries
Lexical databases
3
Word Senses
Words could mean many things…..
Word sense is the intended meaning of a word in a
“given context”
With respect to a dictionary
chair = a seat for one person, with a support for the back;
chair = the position of professor;
With respect to the translation in a second language
chair = ወንበር
chair = ሊቀመንበር
With respect to the context where it occur (discrimination)
“Sit on a chair” “Take a seat on this chair”
“The chair of the Math Department” “The chair of the meeting”
4
Lexical Ambiguity
Most words in natural languages have multiple possible meanings. “pen” (noun)
The dog is in the pen.
The ink is in the pen.
“take” (verb) Take one pill every morning.
Take the first right past the stoplight.
Syntax helps distinguish meanings for different parts of speech of an ambiguous word. “conduct” (noun or verb)
John’s conduct in class is unacceptable. (conduct as noun)
John will conduct the orchestra on Thursday. (conduct as verb)
5
6
Conceptual Model of WSD
Sense knowledge can either be lexical knowledge defined
in dictionaries, or world knowledge learned from training
corpora.
7
7
Lexical Knowledge
Lexical knowledge is usually analyzed with a dictionary. It can be either symbolic, or empirical. It is the foundation of unsupervised WSD approaches.
Learned World Knowledge
World knowledge is too complex to be verbalized completely. So it is a smart strategy to automatically acquire world knowledge from the context of training corpora on demand by machine learning techniques
Trend (Hybrid)
Use the interaction of multiple knowledge sources to approach WSD.
Knowledge Source for WSD8
Motivation for Word Sense Disambiguation (WSD)
Understanding exact meaning is important in many
NLP application areas:
Question Answering
Information Retrieval
Machine Translation
Text Mining
Phone Help Systems
The issue is understanding how people do the
disambiguation task
How do you do it as local language speaker of any
of the Ethiopian Languages?
9
Approaches to WSD
Knowledge-Based Disambiguation use of external lexical resources such as dictionaries and thesauri
discourse properties (see relationship between words)
Supervised Disambiguation based on a labeled training set
the learning system has: a training set of feature-encoded inputs AND
their appropriate sense label (category)
Unsupervised Disambiguation
based on unlabeled corpora
The learning system has: a training set of feature-encoded inputs BUT
NOT their appropriate sense label (category)
10
WordNet
A detailed database of semantic relationships between words.
This can be built for any language
There are WordNets for English, French, Germany,…many more
English WordNet is developed by famous cognitive psychologist George Miller and a team at Princeton University.
About 144,000 English words.
Nouns, adjectives, verbs, and adverbs grouped into about 109,000 synonym sets called synsets.
11
WordNet Synset Relationships
Antonym: front back
Attribute: benevolence good (noun to adjective)
Pertainym: alphabetical alphabet (adjective to noun)
Similar: unquestioning absolute
Cause: kill die
Holonym: chapter text (part to whole)
Meronym: computer cpu (whole to part)
Hyponym: plant tree (specialization)
Hypernym: apple fruit (generalization)
12
Learning for WSD
Assume part-of-speech (POS), e.g. noun, verb, adjective, for the target word is determined.
Treat as a classification problem with the appropriate potential senses for the target word given its POS as the categories.
Encode context using a set of features to be used for disambiguation.
Train a classifier on labeled data encoded using these features.
Use the trained classifier to disambiguate future instances of the target word given their contextual features.
What could be a Contextual Feature?
13
Contextual Features
1. Surrounding bag of words.
2. POS of neighboring words
3. Local collocations
4. Syntactic relations
Experimental evaluations indicate that all of these
features are useful; and the best results comes from
integrating all of these cues in the disambiguation
process.
14
Surrounding Bag of Words
Unordered individual words near the ambiguous word.
Words in the same sentence.
May include words in the previous sentence or surrounding paragraph.
Gives general topical cues of the context.
May use feature selection to determine a smaller set of words that help discriminate possible senses.
May just remove common “stop words” such as articles, prepositions, etc.
15
POS of Neighboring Words
Use part-of-speech of immediately neighboring words.
Provides evidence of local syntactic context.
The concept of n-gram applies here.
Typical to include features for:
P-3, P-2, P-1, P, P1, P2, P3
16
Local Collocations
Review the specific lexical context immediately adjacent to the word.
For example, to determine if “interest” as a noun refers to “readiness to give attention” or “money paid for the use of money”, the following collocations are useful: “in the interest of”
“an interest in”
“interest rate”
“accrued interest”
Typical to include: Single word context: C-1,-1 , C1,1, C-2,-2, C2,2
Two word context: C-2,-1, C-1,1 ,C1,2
Three word context: C-3,-1, C-2,1, C-1,2, C1,3
17
Syntactic Relations
Ambiguous Verbs For an ambiguous verb, it is very useful to know its direct
object.
May also be useful to know its subject:
Ambiguous Nouns
For an ambiguous noun, it is useful to know what verb it is an
object of
May also be useful to know what verb it is the subject of
Ambiguous Adjectives
For an ambiguous adjective, it useful to know the noun it is
modifying.
18
Lesk Algorithm
Lesk Algorithm is developed by (Michael Lesk 1986) for the
purpose of identifying senses of words in context using
definition overlapAlgorithm:
1. Retrieve from MRD all sense definitions of the words to be
disambiguated
2. Determine the definition overlap for all possible sense combinations
3. Choose senses that lead to highest overlap
Example: disambiguate PINE CONE
• PINE 1. kinds of evergreen tree with needle-shaped leaves
2. waste away through sorrow or illness
• CONE
1. solid body which narrows to a point2. something of this shape whether solid or hollow
3. fruit of certain evergreen trees
Pine#1 Cone#1 = 0Pine#2 Cone#1 = 0Pine#1 Cone#2 = 1Pine#2 Cone#2 = 0Pine#1 Cone#3 = 2Pine#2 Cone#3 = 0
19
Lesk Algorithm for More than Two Words?
I saw a man who is 98 years old and can still walk and tell jokes
nine open class words: see(26), man(11), year(4), old(8), can(5), still(4),
walk(10), tell(8), joke(3)
43,929,600 sense combinations! How to find the optimal sense
combination?
Simulated annealing
Define a function E = combination of word senses in a given text.
Find the combination of senses that leads to highest definition overlap
(redundancy)
1. Start with E = the most frequent sense for each word
2. At each iteration, replace the sense of a random word in the set with a
different sense, and measure E
3. Stop iterating when there is no change in the configuration of senses
20
Lesk Algorithm: A Simplified Version
Original Lesk definition: measure overlap between
sense definitions for all words in context
Identify simultaneously the correct senses for all words in
context
Simplified Lesk (Kilgarriff & Rosensweig 2000):
measure overlap between sense definitions of a
word and current context
Identify the correct sense for one word at a time
Search space significantly reduced
21
Lesk Algorithm: A Simplified Version
Example: disambiguate PINE in
“Pine cones hanging in a tree”
• PINE
1. kinds of evergreen tree with needle-shaped leaves
2. waste away through sorrow or illness
Pine#1 Sentence = 1Pine#2 Sentence = 0
• Algorithm for simplified Lesk:
1.Retrieve from MRD all sense definitions of the word to be disambiguated
2.Determine the overlap between each sense definition and the current context
3.Choose the sense that leads to highest overlap
22
23
Selectional Preferences I
Most verbs prefer arguments of a particular type (e.g., the things that bark are dogs). Such regularities are called selectional preferences or selectional restrictions.
Selectional preferences are useful for a couple of reasons: If a word is missing from our machine-readable
dictionary, aspects of its meaning can be inferred from selectional restrictions.
Selectional preferences can be used to rank different parses of a sentence.
Selectional Preferences
Selectional Preferences is a mechanism to capture or constrain the possible meanings of words in a given context
E.g. “Wash a dish” vs. “Cook a dish”
WASH-OBJECT vs. COOK-FOOD
Capture information about possible relations between semantic classes where words are associated with these classes
24
Acquiring Selectional Preferences
Preferences could be acquired from annotated
corpora
Circular relationship with the WSD problem
Need WSD to build the annotated corpus
Need selectional preferences to derive WSD
Preferences could also be built from raw corpora
Frequency counts
Information theory measures
Class-to-class relations
25
Learning Word-to-Word Relations
An indication of the semantic fit between two words
1. Frequency counts
Pairs of words connected by a syntactic relations
2. Conditional probabilities
Condition on one of the words
),,( 21 RWWCount
),(
),,( ),|(
2
2121
RWCount
RWWCountRWWP
26
Learning Selectional Preferences
Determine the contribution of a word sense based on the assumption of equal
sense distributions:
e.g. “plant” has two senses 50% occurrences are sense 1, 50% are sense 2
Example: learning restrictions for the verb “to drink”
Find high-scoring verb-object pairs
Co-occ score Verb Object
11.75 drink tea
11.75 drink Pepsi
11.75 drink champagne
10.53 drink liquid
10.2 drink beer
9.34 drink wine
27
Other Approaches to WSD
Unsupervised sense clustering
Cluster words based on the neighboring words
Semi-supervised learning
Bootstrap from a small number of labeled examples to
exploit unlabeled data
Dictionary based methods
Lesk algorithm
28
Evaluation Metrics
Fixed training and test sets, same for each system.
System can decline to provide a sense tag for a word
if it is sufficiently uncertain.
Measured quantities:
A: number of words assigned senses
C: number of words assigned correct senses
T: total number of test words
Metrics:
Precision = C/A
Recall = C/T
29
End of Topic 730