USING WORDNET TO RETRIEVE WORDS FROM THEIR MEANINGS İlknur Durgar El-Kahlout and Kemal Oflazer...

Post on 01-Apr-2015

214 views 0 download

Tags:

transcript

USING WORDNET TO RETRIEVE WORDS FROM THEIR MEANINGS

İlknur Durgar El-Kahlout and Kemal Oflazer

Sabancı Universityİstanbul, Turkey

Problem

For a given definition, find the appropriate word (or words)

Traditional dictionary is of no use From a dictionary, find an appropriate

word that has a “similar” definition

Examples User definition:

Akımı ölçmek için kullanılan alet(A device that is used to measure the currenta)

In the dictionary:akımölçer: elektrik akımının şiddetini

ölçmeye yarayan araç, ampermetre(ammeter: a device that measures the intensity

of electrical current, amperemeter)

?

Applications Computer-assisted language

learning Solving crossword puzzles Reverse dictionary

Outline Problem statement Meaning-to-Word System (MTW) Our Approach Methods Results Result Summary Conclusion

Problem Statement Find the “similarity” between two

definitionsAkımı ölçmek için kullanılan alet

(A device that is used to measure the current)

Elektrik akımının şiddetini ölçmeye yarayan araç, ampermetre

(a device that measures the intensity of electrical current, amperemeter)

Meaning-to-Word (MTW) addresses the problem of finding

the appropriate word (or words), whose meaning “matches” the given definition

Two subproblems finding words whose definitions are

"similar" to the query in some sense ranking the candidate words using a

variety of ways

User Definition

Search in Dictionary

Rank Candidates

query

candidates

List of words

Information Flow in MTW

Available Resources

Turkish Monolingual Dictionary About 50.000 entries

Turkish WordNet About 11.000 synsets

User Definition

Search in Dictionary

Rank Candidates

query

candidates

List of words

Normalization

Normalization

Normalization

Tokenization Stemming Stop Word Elimination

User Definition

Search in Dictionary

Rank Candidates

query

candidates

List of words

Query Processing

Query Processing

Query Processing Subset Generation

Search with different set of words Select informative words from user’s

queryQuery: daha önce hiç evlenmemiş kişi (a person who

has never been married)

{önce, evlen, kişi} (before, marry, person)

{evlen, kişi}, {önce, kişi}, {önce, evlen} (marry, person) (before, person) (before, marry)

{evlen}, {önce}, {kişi} (marry) (before) (person)

Query Processing

Subset Sorting Unordered list of subsets are

insufficient Rank the generated subsets

1) By the number of words{önce, evlen, kişi} (before, marry, person)

{evlen, kişi} (marry, person)

2) By the sum of frequency logarithm{evlen, kişi} (marry, person)

{önce, kişi} (before, person)

User Definition

Search in Dictionary

Rank Candidates

query

candidates

List of words

Searching for Meanings

Searching for Meanings Two methods

Stem Matching Query Expansion (using WordNet)

Stem Matching Morphological normalization of

words Find meanings that contain

morphological variants of the original definition

Stem Matching (Ex.)(A device that is used to measure the current)

{ akımı ölçmek için kullanılan alet }

ak (white) ölç(measure) için(to) kullan(use) alet (device)

akım(current) iç(drink) kul (slave)

akı (flux)

Colored stems are the matching ones

Stem Matching

(A device that is used to measure the current)

akımı ölçmek için kullanılan alet

elektrik akımının şiddetini ölçmeye yarayan araç, ampermetre

(a device that measures the intensity of electrical current, amperemeter)

Stem Matching

(A device that is used to measure the current)

akımı ölçmek için kullanılan alet

elektrik akımının şiddetini ölçmeye yarayan araç, ampermetre

(a device that measures the intensity of electrical current, amperemeter)

Drawbacks Generate noisy stems ilim (science, my city) ilim (science), il (city)

Conflate two words with very different meanings to the same stem

ilim (science, my city), ilde (in the city) il (city)

Cannot find relations between similar words

kimse (someone) kişi (person)

bölüm (part) kısım (portion)

Stem Matching

Using Query Expansion Two different approaches:

Expand query with relations (synonyms, specializations, generalizations)

Expand query with unexpanded query’s relevant answers

WordNet synonyms are used in MTW

{besin, gıda} (food, nourishment) {iyileş, düzel} (to get better) /{iyileş, geliş} (to

improve)

Query Expansion (Ex.)(A device that is used to measure the current)

{ akımı ölçmek için kullanılan alet }

ak (white) ölç(measure) için(to) kullan(use) alet (device)

akım(current) iç(drink) kul (slave)

akı (flux)

beyaz faydalan araç

debi yararlan gereç

akış köle

Query Expansion (Ex.)(A device that is used to measure the current)

akımı ölçmek için kullanılan alet

elektrik akımının şiddetini ölçmeye yarayan araç, ampermetre

(a device that measures the intensity of electrical current, amperemeter)

Query Expansion (Ex.)(A device that is used to measure the current)

akımı ölçmek için kullanılan alet

elektrik akımının şiddetini ölçmeye yarayan araç, ampermetre

(a device that measures the intensity of electrical current, amperemeter)

User Definition

Search in Dictionary

Rank Candidates

query

candidates

List of words

Ranking

Ranking Very important part of MTW

Having the right answer in the retrieved set is not enough

Aim is to have the right answer at top of the retrieved set (Ex: in first top 50 answers)

Ranking Simple but effective methods

Number of matched words Subset informativeness - frequency of

words in the subset Ratio of number of matched words to

the number of words in the candidate dictionary definition

Longest Common Subsequence - order of the matched words

Some Statistics Training sets:

50 queries from users 50 queries from a dictionary

Test sets: 50 queries from users 50 queries from a separate dictionary

Test set 1 (user)

Training set 1

Test set 2 (dict.)

Training set 2

# of queries 50 50 50 50

Avg. # of query words

5.66 4.64 9.24 13.98

Max. # of query words

17 12 23 45

Min. # of query words

2 1 1 6

Rank Test set 1

Training set 1

Test set 2

Training set 2

1-10 13 (26%)

18 (36%)

45 (90%)

41 (82%)

11-50 7 (14%) 12 (24%)

2 (4%) 5 (10%)

>50 19 (38%)

10 (20%)

3 (6%) 4 (8%)

Not found

11 (22%)

10 (20%)

0 (0%) 0 (0%)

Stem Matching all stems included

Low % in top 10 in user queries but very high results in dictionary queries

Stem Matching

Rank Test set 1

Training set 1

Test set 2

Training set 2

1-10 14 (28%)

21 (42%)

46 (92%)

43 (86%)

11-50 5 (10%) 9 (18%) 1 (2%) 5 (10%)

>50 18 (36%)

9 (18%) 3 (6%) 2 (4%)

Not found

13 (26%)

11 (22%)

0 (0%) 0 (0%)

longest stem included (heuristics)

Improvement in user queries, slightly better performance in dictionary queries

Query Expansion (WordNet)

Rank Test set 1

Training set 1

Test set 2

Training set 2

1-10 14(28%)

24 (48%)

45 (90%)

41 (82%)

11-50 9 (18%) 9 (18%) 2 (4%) 5 (10%)

>50 18 (36%)

12 (24%)

3 (6%) 4 (8%)

Not found

9 (18%) 5 (10%) 0 (0%) 0 (0%)

all stems included

Better results in user queries, no change in dictionary queries

Query Expansion (WordNet)

Rank Test set 1

Training set 1

Test set 2

Training set 2

1-10 14 (28%)

24 (48%)

41 (82%)

39 (78%)

11-50 6 (12%) 8 (16%) 5 (10%) 6 (12%)

>50 21 (42%)

13 (26%)

1 (2%) 5 (10%)

Not found

9 (18%) 5 (10%) 0 (0%) 0 (0%)

longest stem included (heuristics)

Better performance than ‘longest stem matching’ in user queries, but worse performance in dictionary queries

Result Summary Stem Matching (longest stem

included) 60% success in real user queries 96% success in dictionary queries

Query Expansion (all stems included) 68% success in real user queries 92% success in dictionary queries

Conclusion We have implemented a ‘Meaning to

Word’ system for Turkish Results on unseen data are rather

satisfactory Query expansion is better

Although, it cannot find the words for all queries

68% of real user queries and 90% of dictionary queries are found in the first 50 results

THANK YOU !