Yong-Gu Lee 2007-08-17 yonggulee@hotmail

An Effective Word Sense Disambiguation Model Using Automatic Sense

Tagging Based on Dictionary Information

Yong-Gu Lee

[email protected]

Contents

Introduction Related Works Research Goals Effective Word Sense Disambiguation

Model and Evaluation Conclusion

Introduction

Word Sense Disambiguation (WSD) The problem of selecting a sense for a word from a

set of predefined possibilities. “Intermediate task” which is not an end in itself, but

rather is necessary at one level or another. Obviously essential for language understanding

applications. Machine translation Information retrieval and hypertext navigation Content and thematic analysis Speech processing and text processing

Related works(1/3)

Approaches to WSD Knowledge-Based Disambiguation

use of external lexical resources such as dictionaries and thesauri

discourse properties Corpus-based Disambiguation Hybrid Disambiguation

Related works(2/3)

Corpus-based Disambiguation Supervised Disambiguation

based on a labeled training set the learning system has:

a training set of feature-encoded inputs AND their appropriate sense label (category)

Unsupervised Disambiguation based on unlabeled corpora The learning system has:

a training set of feature-encoded inputs BUT NOT their appropriate sense label (category)

Related works(3/3) Lexical Resources for WSD

Machine readable format Machine Readable Dictionaries (MRD) : Longman, Oxford, et

c Thesauri and semantic networks : Roget Thesaurus, Wordne

t, etc Sense tagged data

Senseval-1,2,3(www.senseval.org) Provides sense annotated data for many languages, for several tas

ks Languages: English, Romanian, Chinese, Basque, Spanish, etc. Tasks: Lexical Sample, All words, etc.

SemCor, Hector, etc

Research Motivation Manual sense tagging

Labor-intensive and high cost Limitation of available sense tagged corpus

Except English, other languages have a few corpus for WSD.

Coverage of sense tagged words Some corpus has only one or a few words whose

sense was tagged. “Line” corpus, “interest” corpus, etc

If using supervised disambiguation method, the only word that appeared in the sense tagged corpus is disambiguated.

Research Goals

Minimize or eliminate the cost of manual labeling. Automatic sense tagging using MRD and heuristic

rules Improve the performance of word sense disa

mbiguation. Using supervised disambiguation

Naïve Bayes classifier

Effective Word Sense Disambiguation Model

Automatic Tagging Technique Experimental Environment Evaluation of Automatic Tagging

Technique Evaluation of Sense Classification Evaluation of Fusion Method

An Outline Diagram for the Proposed Research

Key WordExtraction

CollectionCollocationExtraction

Auto SenseTagging Module

Context Extractionof Target Word

Naïve BayesClassifier

Test Context

Evaluation

Automatic Sense Tagging and Training Sense Classification

ClassifyWord Sense

Training Set

SenseTagging

Dictionary

Automatic Tagging Technique

Dictionary Information-based Method Collocation Overlap-based Method Data Fusion Method

Dictionary Information-based Method + Collocation Overlap-based Method

Dictionary Information-based Method(1/2)

Extract necessary information from dictionary. Heuristic 1: One Sense per Collocation / One Sense per Dis

course Telephone line, 景氣展望 /Gyeonggi-jeonmang(econo

mic prospect) Heuristic 2: Using of corresponding Chinese characters

감자 /Gamja : 柑子 (Potato)/ 減資 (Reduction of capital)

Heuristic 3: Co-occurrence of synonym, antonym and related terms.

Heuristic 4: Occurrence of the derived words

Dictionary Information-based Method(2/2)

Heuristic 5: Co-occurrence of key feature that is extracted from definition of target word entry like Lesk(1986).

Algorithm:1. Retrieve from MRD all sense definitions of the

words to be disambiguated2. Determine the overlap between each sense

definition and the current context 3. Choose senses that lead to highest overlap

Collocation Overlap-based Method

Semantic similarity metric using the collocation overlap Algorithm:

1. Retrieve keywords from MRD all sense definitions of the words to be disambiguated

2. Extract collocation words of the keywords from test collection by threshold

3. Extract collocation words of the target words from the test collection

4. Determine the overlap of each collocation words(2, 3)

5. Choose senses that lead to highest overlap

Feature Selection

By document frequency Test Collection -> docDF Definitions as documents -> dicDF docDF <= 5000 & dicDF <= 300

Sense Classification : Naïve Bayes Classifier*

Algorithm:

* source:

Manning and Schütze. 1999. Foundations of Statistical Natural Language Processing.

Experimental Environment(1/2)

Test Collection Includes all the articles(127,641) in three Korean

daily newspapers for the year 2004 Use part-of-speech tagger and lexical analysis

Evaluation Accuracy

Target Word for WSD

wordsNo of

SensesNo of

ArticlesTotal

Frequency

감자 /Gamja 2 622 1,115

경기 /Gyeonggi 4 18,484 37,763

기간 /Gigan 2 11,255 15,803

신병 /Sinbyeong 3 360 469

신장 /Sinjang 4 703 952

연기 /Yongi 4 3,227 5,147

인도 /Indo 5 2,022 2,750

지구 /Jigu 2 4,017 9,372

지원 /Jiwon 3 12,577 21,320

Evaluation of Automatic Sense Tagging

Dictionary Information-based Method By Rule

No Information TypeAll Target Words

Total Correct Accuracy

1 Collocation 3,229 2,931 0.9077

2 Chinese characters 74 74 1.0000

3

Synonym 2,107 1,598 0.7584

Antonym 237 195 0.8228

Related Terms 846 791 0.9350

4 Derived Words 1,078 1,071 0.9935

5 Definitions 128,520 60,810 0.5091

SUM 136,091 67,470 0.4958

Results of Feature Selection- words

wordsAll Information Type


감자 /Gamja 802 800 0.9975

경기 /Gyeonggi 6,200 4,833 0.7795

기간 /Gigan 2,128 1,271 0.5973

신병 /Sinbyeong 299 265 0.8863

신장 /Sinjang 653 471 0.7213

연기 /Yongi 4,732 4,169 0.8810

인도 /Indo 3,956 2,274 0.5748

지구 /Jigu 2,207 2,124 0.9624

지원 /Jiwon 4,826 1,870 0.3875

SUM 25,803 18,077 0.7006

Results of Feature Selection - Rule



1 Collocation 1,603 1,548 0.9657


3

Synonym 1,650 1,556 0.9430

Antonym 237 195 0.8228


4 Derived Words 1,078 1,071 0.9935

5 Definitions 20,315 12,842 0.6321

SUM 25,803 18,077 0.7006

0.0000

0.2000

0.4000

0.6000

0.8000

1.0000

1.2000

Collocation Chinese Synonym Antonym RelatedTerm

DerivedTerm

Definitions

Information Type

Auto

Tag

ging

Acc

urac

y

NoFeature Selection

Evaluation of Automatic Sense Tagging

Collocation Co-occurrence-based Method Performance by threshold

Rank Total Correct Accuracy

Top10 6,155 3,727 0.6055

Top30 9,258 5,215 0.5633

Top50 11,544 6,264 0.5426

Top100 13,432 6,751 0.5026

All 19,436 7,796 0.4009

0

5000

10000

15000

20000

25000

Top10 Top30 Top50 Top100 All

Rank

No o

f Tra

inin

g Se

t by

Auto

Tag

ging

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Total

Correct

Accuracy

Auto Tagging Result of Top 30

By Target WordsWord Main Source Total Correct Accuracy

감자 /Gamja Definitions 273 251 0.9194

경기 /Gyeonggi Definitions 3,540 2,951 0.8336

기간 /Gigan Synonym, Definitions 1,205 365 0.3029

신병 /Sinbyeong Definitions 112 67 0.5982

신장 /Sinjang Definitions 101 77 0.7624

연기 /Yongi Definitions 520 435 0.8365

인도 /Indo Antonym, Definitions 277 195 0.7040

지구 /Jigu Definitions 609 546 0.8966

지원 /Jiwon Related Words, Definitions 2,621 328 0.1251

Sum 9,258 5,215 0.5633

Auto Tagging Result of Top 30

By Information type

Information TypeAll Target Words


Synonym 544 402 0.7390

Antonym 129 119 0.9225


Definitions 8,355 4,528 0.5420

SUM 9,258 5,215 0.5633

Comparison of Two Auto Tagging Methods

0.0000

0.2000

0.4000

0.6000

0.8000

1.0000

1.2000

Target Words

Accu

racy Dictionary

Informationbase Method

CollocationOverlap baseMethod

Build a Classifier

Train set : 600 Test set : others Window size: 50byte length Rule for making train set

There are errors in the automatic sense tagging.

For reducing errors and improving tagging accuracy of train set, information type of the high accuracy is firstly used.

Sense Classification- Dictionary Information-based Method

WordAll Information Type


감자 /Gamja 1,270 1,139 0.8966

경기 /Gyeonggi 37,763 30,897 0.8182

기간 /Gigan 15,803 9,278 0.5871

신병 /Sinbyeong 469 386 0.8230

신장 /Sinjang 953 671 0.7043

연기 /Yongi 5,147 4,302 0.8359

인도 /Indo 2,750 1,212 0.4408

지구 /Jigu 9,375 8,373 0.8932

지원 /Jiwon 21,321 8,345 0.3914

SUM 94,851 64,604 0.6811

Sense Classification- Collocation Overlap-based Method

By rank

Rank Total Correct Accuracy

Top10 94,851 57,201 0.6031

Top30 94,851 58,891 0.6209

Top50 94,851 56,871 0.5996

Top100 94,851 56,916 0.6001

All 94,851 53,218 0.5611

Sense Classification- Collocation Overlap-based Method

By target wordsWord Total Correct Accuracy

감자 /Gamja 1,270 1,016 0.8000

경기 /Gyeonggi 37,763 30,787 0.8153

기간 /Gigan 15,803 10,239 0.6479

신병 /Sinbyeong 469 290 0.6183

신장 /Sinjang 953 622 0.6527

연기 /Yongi 5,147 2,834 0.5507

인도 /Indo 2,750 1,683 0.6118

지구 /Jigu 9,375 8,499 0.9065

지원 /Jiwon 21,321 2,922 0.1371

SUM 94,851 58,891 0.6209

Comparison of Two Sense Classifications

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

1.0000

Target Words

Acc

uracy

DictionaryInformaionbased Method

CollocationOverlap basedMethod

Data Fusion of Two Auto Tagging Methods

Dictionary Information base Method : Using all the information type except definitions

Collocation Overlap base Method : Using only the information type of Top10

Results of the Auto Tagging Method in Data Fusion - Words

Word Total Correct Accuracy

감자 /Gamja 503 485 0.9642

경기 /Gyeonggi 3,086 2,582 0.8367

기간 /Gigan 2,189 1,507 0.6884

신병 /Sinbyeong 96 56 0.5833

신장 /Sinjang 305 271 0.8885

연기 /Yongi 950 888 0.9347

인도 /Indo 367 273 0.7439

지구 /Jigu 939 918 0.9776

지원 /Jiwon 2,917 1,657 0.5680

SUM 11,352 8,637 0.7608

Results of the Auto Tagging Method in Data Fusion – Information Type



1 Collocation 1,603 1,548 0.9657


3

Synonym 2,064 1,856 0.8992

Antonym 336 290 0.8631


4 Derived Words 1,078 1,071 0.9935

5 Definitions 5,219 2,891 0.5539

SUM 11,352 8,637 0.7608

Comparison of the Three Auto Tagging Methods

Auto Tagging Method Total Correct Accuracy

Dictionary Information base Method

25,803 18,077 0.7006

Collocation Overlap base Method

9,258 5,215 0.5633

Fusion Method 11,352 8,637 0.7608

Sense Classification in Data Fusion - Words

Word Total Correct Accuracy

감자 /Gamja 1,270 1,087 0.8559

경기 /Gyeonggi 37,763 32,128 0.8508

기간 /Gigan 15,803 13,055 0.8261

신병 /Sinbyeong 469 121 0.2580

신장 /Sinjang 953 702 0.7366

연기 /Yongi 5,147 4,437 0.8621

인도 /Indo 2,750 1,251 0.4547

지구 /Jigu 9,375 8,205 0.8752

지원 /Jiwon 21,321 11,251 0.5277

SUM 94,851 72,237 0.7616

Comparison of Three WSD Methods

WSD Method TotalCorrec

tAccurac

y

Improvement(

%)

Fusion Method 94,851 72,237 0.7616 -

Dictionary Information base Method

94,851 64,604 0.6811 11.82

Collocation Overlap base Method

94,851 58,891 0.6209 22.66

Conclusion(1/2)

The performance of the automatic tagging technique differed depending on the type of information sources in the dictionary.

In case of the frequently used keywords extracted from the dictionary, to apply feature selection method is needed.

Conclusion(2/2)

The word sense disambiguation model using the automatic tagging method based on dictionary information showed a comparable performance to the supervised learning method using manual tagging information.

The WSD model using data fusion technique combing two automatic tagging methods outperforms the model using a single tagging method.

Q&A

Date post:	03-Jan-2016
Category:	Documents
Upload:	cassady-beasley
View:	35 times
Download:	3 times

Yong-Gu Lee 2007-08-17 yonggulee@hotmail

Documents