Download - TOPIC 7: WORD SENSE DISAMBIGUATION (WSD)nlpcs724.weebly.com/.../cs724_nlp_topic_7-word_sense_disambiguat… · Word Senses Words could mean many things….. Word sense is the intended

TOPIC 7: WORD SENSE DISAMBIGUATION (WSD)

NATURAL LANGUAGE PROCESSING (NLP)

CS-724

WondwossenMulugeta (PhD) email: [email protected]

1

Topics2

Topics Subtopics7: Word Sense Disambiguation (WSD)

Practical Problem, Word Sense relations, Disambiguation Approaches, Knowledge Based WSD, Machine Readable Dictionary, Lesk Algorithm, Selectional Preference for WSD

Sense Inventory

What is a “sense” of a word?

Homonyms (disconnected meanings)

Duck : to move downwards quickly

Duck : Small web-footed bird

Polysemes (related meanings with joint origin (history of word))

bank: financial institution as corporation

bank: a building housing of such an institution

Where could we know the sense of a word

Dictionaries

Lexical databases

3

Word Senses

Words could mean many things…..

Word sense is the intended meaning of a word in a

“given context”

With respect to a dictionary

chair = a seat for one person, with a support for the back;

chair = the position of professor;

With respect to the translation in a second language

chair = ወንበር

chair = ሊቀመንበር

With respect to the context where it occur (discrimination)

“Sit on a chair” “Take a seat on this chair”

“The chair of the Math Department” “The chair of the meeting”

4

Lexical Ambiguity

Most words in natural languages have multiple possible meanings. “pen” (noun)

The dog is in the pen.

The ink is in the pen.

“take” (verb) Take one pill every morning.

Take the first right past the stoplight.

Syntax helps distinguish meanings for different parts of speech of an ambiguous word. “conduct” (noun or verb)

John’s conduct in class is unacceptable. (conduct as noun)

John will conduct the orchestra on Thursday. (conduct as verb)

5

6

Conceptual Model of WSD

Sense knowledge can either be lexical knowledge defined

in dictionaries, or world knowledge learned from training

corpora.

7

7

Lexical Knowledge

Lexical knowledge is usually analyzed with a dictionary. It can be either symbolic, or empirical. It is the foundation of unsupervised WSD approaches.

Learned World Knowledge

World knowledge is too complex to be verbalized completely. So it is a smart strategy to automatically acquire world knowledge from the context of training corpora on demand by machine learning techniques

Trend (Hybrid)

Use the interaction of multiple knowledge sources to approach WSD.

Knowledge Source for WSD8

Motivation for Word Sense Disambiguation (WSD)

Understanding exact meaning is important in many

NLP application areas:

Question Answering

Information Retrieval

Machine Translation

Text Mining

Phone Help Systems

The issue is understanding how people do the

disambiguation task

How do you do it as local language speaker of any

of the Ethiopian Languages?

9

Approaches to WSD

Knowledge-Based Disambiguation use of external lexical resources such as dictionaries and thesauri

discourse properties (see relationship between words)

Supervised Disambiguation based on a labeled training set

the learning system has: a training set of feature-encoded inputs AND

their appropriate sense label (category)

Unsupervised Disambiguation

based on unlabeled corpora

The learning system has: a training set of feature-encoded inputs BUT

NOT their appropriate sense label (category)

10

WordNet

A detailed database of semantic relationships between words.

This can be built for any language

There are WordNets for English, French, Germany,…many more

English WordNet is developed by famous cognitive psychologist George Miller and a team at Princeton University.

About 144,000 English words.

Nouns, adjectives, verbs, and adverbs grouped into about 109,000 synonym sets called synsets.

11

WordNet Synset Relationships

Antonym: front back

Attribute: benevolence good (noun to adjective)

Pertainym: alphabetical alphabet (adjective to noun)

Similar: unquestioning absolute

Cause: kill die

Holonym: chapter text (part to whole)

Meronym: computer cpu (whole to part)

Hyponym: plant tree (specialization)

Hypernym: apple fruit (generalization)

12

Learning for WSD

Assume part-of-speech (POS), e.g. noun, verb, adjective, for the target word is determined.

Treat as a classification problem with the appropriate potential senses for the target word given its POS as the categories.

Encode context using a set of features to be used for disambiguation.

Train a classifier on labeled data encoded using these features.

Use the trained classifier to disambiguate future instances of the target word given their contextual features.

What could be a Contextual Feature?

13

Contextual Features

1. Surrounding bag of words.

2. POS of neighboring words

3. Local collocations

4. Syntactic relations

Experimental evaluations indicate that all of these

features are useful; and the best results comes from

integrating all of these cues in the disambiguation

process.

14

Surrounding Bag of Words

Unordered individual words near the ambiguous word.

Words in the same sentence.

May include words in the previous sentence or surrounding paragraph.

Gives general topical cues of the context.

May use feature selection to determine a smaller set of words that help discriminate possible senses.

May just remove common “stop words” such as articles, prepositions, etc.

15

POS of Neighboring Words

Use part-of-speech of immediately neighboring words.

Provides evidence of local syntactic context.

The concept of n-gram applies here.

Typical to include features for:

P-3, P-2, P-1, P, P1, P2, P3

16

Local Collocations

Review the specific lexical context immediately adjacent to the word.

For example, to determine if “interest” as a noun refers to “readiness to give attention” or “money paid for the use of money”, the following collocations are useful: “in the interest of”

“an interest in”

“interest rate”

“accrued interest”

Typical to include: Single word context: C-1,-1 , C1,1, C-2,-2, C2,2

Two word context: C-2,-1, C-1,1 ,C1,2

Three word context: C-3,-1, C-2,1, C-1,2, C1,3

17

Syntactic Relations

Ambiguous Verbs For an ambiguous verb, it is very useful to know its direct

object.

May also be useful to know its subject:

Ambiguous Nouns

For an ambiguous noun, it is useful to know what verb it is an

object of

May also be useful to know what verb it is the subject of

Ambiguous Adjectives

For an ambiguous adjective, it useful to know the noun it is

modifying.

18

Lesk Algorithm

Lesk Algorithm is developed by (Michael Lesk 1986) for the

purpose of identifying senses of words in context using

definition overlapAlgorithm:

1. Retrieve from MRD all sense definitions of the words to be

disambiguated

2. Determine the definition overlap for all possible sense combinations

3. Choose senses that lead to highest overlap

Example: disambiguate PINE CONE

• PINE 1. kinds of evergreen tree with needle-shaped leaves

2. waste away through sorrow or illness

• CONE

1. solid body which narrows to a point2. something of this shape whether solid or hollow

3. fruit of certain evergreen trees

Pine#1 Cone#1 = 0Pine#2 Cone#1 = 0Pine#1 Cone#2 = 1Pine#2 Cone#2 = 0Pine#1 Cone#3 = 2Pine#2 Cone#3 = 0

19

Lesk Algorithm for More than Two Words?

I saw a man who is 98 years old and can still walk and tell jokes

nine open class words: see(26), man(11), year(4), old(8), can(5), still(4),

walk(10), tell(8), joke(3)

43,929,600 sense combinations! How to find the optimal sense

combination?

Simulated annealing

Define a function E = combination of word senses in a given text.

Find the combination of senses that leads to highest definition overlap

(redundancy)

1. Start with E = the most frequent sense for each word

2. At each iteration, replace the sense of a random word in the set with a

different sense, and measure E

3. Stop iterating when there is no change in the configuration of senses

20

Lesk Algorithm: A Simplified Version

Original Lesk definition: measure overlap between

sense definitions for all words in context

Identify simultaneously the correct senses for all words in

context

Simplified Lesk (Kilgarriff & Rosensweig 2000):

measure overlap between sense definitions of a

word and current context

Identify the correct sense for one word at a time

Search space significantly reduced

21

Lesk Algorithm: A Simplified Version

Example: disambiguate PINE in

“Pine cones hanging in a tree”

• PINE

1. kinds of evergreen tree with needle-shaped leaves

2. waste away through sorrow or illness

Pine#1 Sentence = 1Pine#2 Sentence = 0

• Algorithm for simplified Lesk:

1.Retrieve from MRD all sense definitions of the word to be disambiguated

2.Determine the overlap between each sense definition and the current context

3.Choose the sense that leads to highest overlap

22

23

Selectional Preferences I

Most verbs prefer arguments of a particular type (e.g., the things that bark are dogs). Such regularities are called selectional preferences or selectional restrictions.

Selectional preferences are useful for a couple of reasons: If a word is missing from our machine-readable

dictionary, aspects of its meaning can be inferred from selectional restrictions.

Selectional preferences can be used to rank different parses of a sentence.

Selectional Preferences

Selectional Preferences is a mechanism to capture or constrain the possible meanings of words in a given context

E.g. “Wash a dish” vs. “Cook a dish”

WASH-OBJECT vs. COOK-FOOD

Capture information about possible relations between semantic classes where words are associated with these classes

24

Acquiring Selectional Preferences

Preferences could be acquired from annotated

corpora

Circular relationship with the WSD problem

Need WSD to build the annotated corpus

Need selectional preferences to derive WSD

Preferences could also be built from raw corpora

Frequency counts

Information theory measures

Class-to-class relations

25

Learning Word-to-Word Relations

An indication of the semantic fit between two words

1. Frequency counts

Pairs of words connected by a syntactic relations

2. Conditional probabilities

Condition on one of the words

),,( 21 RWWCount

),(

),,( ),|(

2

2121

RWCount

RWWCountRWWP

26

Learning Selectional Preferences

Determine the contribution of a word sense based on the assumption of equal

sense distributions:

e.g. “plant” has two senses 50% occurrences are sense 1, 50% are sense 2

Example: learning restrictions for the verb “to drink”

Find high-scoring verb-object pairs

Co-occ score Verb Object

11.75 drink tea

11.75 drink Pepsi

11.75 drink champagne

10.53 drink liquid

10.2 drink beer

9.34 drink wine

27

Other Approaches to WSD

Unsupervised sense clustering

Cluster words based on the neighboring words

Semi-supervised learning

Bootstrap from a small number of labeled examples to

exploit unlabeled data

Dictionary based methods

Lesk algorithm

28

Evaluation Metrics

Fixed training and test sets, same for each system.

System can decline to provide a sense tag for a word

if it is sufficiently uncertain.

Measured quantities:

A: number of words assigned senses

C: number of words assigned correct senses

T: total number of test words

Metrics:

Precision = C/A

Recall = C/T

29

End of Topic 730