Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | audra-scott |
View: | 217 times |
Download: | 4 times |
An Effective Word Sense Disambiguation Model Using Automatic Sense
Tagging Based on Dictionary Information
Yong-Gu Lee
Contents
Introduction Related Works Research Goals Effective Word Sense Disambiguation
Model and Evaluation Conclusion
Introduction
Word Sense Disambiguation (WSD) The problem of selecting a sense for a word from a
set of predefined possibilities. “Intermediate task” which is not an end in itself, but
rather is necessary at one level or another. Obviously essential for language understanding
applications. Machine translation Information retrieval and hypertext navigation Content and thematic analysis Speech processing and text processing
Related works(1/3)
Approaches to WSD Knowledge-Based Disambiguation
use of external lexical resources such as dictionaries and thesauri
discourse properties Corpus-based Disambiguation Hybrid Disambiguation
Related works(2/3)
Corpus-based Disambiguation Supervised Disambiguation
based on a labeled training set the learning system has:
a training set of feature-encoded inputs AND their appropriate sense label (category)
Unsupervised Disambiguation based on unlabeled corpora The learning system has:
a training set of feature-encoded inputs BUT NOT their appropriate sense label (category)
Related works(3/3) Lexical Resources for WSD
Machine readable format Machine Readable Dictionaries (MRD) : Longman, Oxford, et
c Thesauri and semantic networks : Roget Thesaurus, Wordne
t, etc Sense tagged data
Senseval-1,2,3(www.senseval.org) Provides sense annotated data for many languages, for several tas
ks Languages: English, Romanian, Chinese, Basque, Spanish, etc. Tasks: Lexical Sample, All words, etc.
SemCor, Hector, etc
Research Motivation Manual sense tagging
Labor-intensive and high cost Limitation of available sense tagged corpus
Except English, other languages have a few corpus for WSD.
Coverage of sense tagged words Some corpus has only one or a few words whose
sense was tagged. “Line” corpus, “interest” corpus, etc
If using supervised disambiguation method, the only word that appeared in the sense tagged corpus is disambiguated.
Research Goals
Minimize or eliminate the cost of manual labeling. Automatic sense tagging using MRD and heuristic
rules Improve the performance of word sense disa
mbiguation. Using supervised disambiguation
Naïve Bayes classifier
Effective Word Sense Disambiguation Model
Automatic Tagging Technique Experimental Environment Evaluation of Automatic Tagging
Technique Evaluation of Sense Classification Evaluation of Fusion Method
An Outline Diagram for the Proposed Research
Key WordExtraction
CollectionCollocationExtraction
Auto SenseTagging Module
Context Extractionof Target Word
Naïve BayesClassifier
Test Context
Evaluation
Automatic Sense Tagging and Training Sense Classification
ClassifyWord Sense
Training Set
SenseTagging
Dictionary
Automatic Tagging Technique
Dictionary Information-based Method Collocation Overlap-based Method Data Fusion Method
Dictionary Information-based Method + Collocation Overlap-based Method
Dictionary Information-based Method(1/2)
Extract necessary information from dictionary. Heuristic 1: One Sense per Collocation / One Sense per Dis
course Telephone line, 景氣展望 /Gyeonggi-jeonmang(econo
mic prospect) Heuristic 2: Using of corresponding Chinese characters
감자 /Gamja : 柑子 (Potato)/ 減資 (Reduction of capital)
Heuristic 3: Co-occurrence of synonym, antonym and related terms.
Heuristic 4: Occurrence of the derived words
Dictionary Information-based Method(2/2)
Heuristic 5: Co-occurrence of key feature that is extracted from definition of target word entry like Lesk(1986).
Algorithm:1. Retrieve from MRD all sense definitions of the
words to be disambiguated2. Determine the overlap between each sense
definition and the current context 3. Choose senses that lead to highest overlap
Collocation Overlap-based Method
Semantic similarity metric using the collocation overlap Algorithm:
1. Retrieve keywords from MRD all sense definitions of the words to be disambiguated
2. Extract collocation words of the keywords from test collection by threshold
3. Extract collocation words of the target words from the test collection
4. Determine the overlap of each collocation words(2, 3)
5. Choose senses that lead to highest overlap
Feature Selection
By document frequency Test Collection -> docDF Definitions as documents -> dicDF docDF <= 5000 & dicDF <= 300
Sense Classification : Naïve Bayes Classifier*
Algorithm:
* source:
Manning and Schütze. 1999. Foundations of Statistical Natural Language Processing.
Experimental Environment(1/2)
Test Collection Includes all the articles(127,641) in three Korean
daily newspapers for the year 2004 Use part-of-speech tagger and lexical analysis
Evaluation Accuracy
Target Word for WSD
wordsNo of
SensesNo of
ArticlesTotal
Frequency
감자 /Gamja 2 622 1,115
경기 /Gyeonggi 4 18,484 37,763
기간 /Gigan 2 11,255 15,803
신병 /Sinbyeong 3 360 469
신장 /Sinjang 4 703 952
연기 /Yongi 4 3,227 5,147
인도 /Indo 5 2,022 2,750
지구 /Jigu 2 4,017 9,372
지원 /Jiwon 3 12,577 21,320
Evaluation of Automatic Sense Tagging
Dictionary Information-based Method By Rule
No Information TypeAll Target Words
Total Correct Accuracy
1 Collocation 3,229 2,931 0.9077
2 Chinese characters 74 74 1.0000
3
Synonym 2,107 1,598 0.7584
Antonym 237 195 0.8228
Related Terms 846 791 0.9350
4 Derived Words 1,078 1,071 0.9935
5 Definitions 128,520 60,810 0.5091
SUM 136,091 67,470 0.4958
Results of Feature Selection- words
wordsAll Information Type
Total Correct Accuracy
감자 /Gamja 802 800 0.9975
경기 /Gyeonggi 6,200 4,833 0.7795
기간 /Gigan 2,128 1,271 0.5973
신병 /Sinbyeong 299 265 0.8863
신장 /Sinjang 653 471 0.7213
연기 /Yongi 4,732 4,169 0.8810
인도 /Indo 3,956 2,274 0.5748
지구 /Jigu 2,207 2,124 0.9624
지원 /Jiwon 4,826 1,870 0.3875
SUM 25,803 18,077 0.7006
Results of Feature Selection - Rule
No Information TypeAll Target Words
Total Correct Accuracy
1 Collocation 1,603 1,548 0.9657
2 Chinese characters 74 74 1.0000
3
Synonym 1,650 1,556 0.9430
Antonym 237 195 0.8228
Related Terms 846 791 0.9350
4 Derived Words 1,078 1,071 0.9935
5 Definitions 20,315 12,842 0.6321
SUM 25,803 18,077 0.7006
0.0000
0.2000
0.4000
0.6000
0.8000
1.0000
1.2000
Collocation Chinese Synonym Antonym RelatedTerm
DerivedTerm
Definitions
Information Type
Auto
Tag
ging
Acc
urac
y
NoFeature Selection
Evaluation of Automatic Sense Tagging
Collocation Co-occurrence-based Method Performance by threshold
Rank Total Correct Accuracy
Top10 6,155 3,727 0.6055
Top30 9,258 5,215 0.5633
Top50 11,544 6,264 0.5426
Top100 13,432 6,751 0.5026
All 19,436 7,796 0.4009
0
5000
10000
15000
20000
25000
Top10 Top30 Top50 Top100 All
Rank
No o
f Tra
inin
g Se
t by
Auto
Tag
ging
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Total
Correct
Accuracy
Auto Tagging Result of Top 30
By Target WordsWord Main Source Total Correct Accuracy
감자 /Gamja Definitions 273 251 0.9194
경기 /Gyeonggi Definitions 3,540 2,951 0.8336
기간 /Gigan Synonym, Definitions 1,205 365 0.3029
신병 /Sinbyeong Definitions 112 67 0.5982
신장 /Sinjang Definitions 101 77 0.7624
연기 /Yongi Definitions 520 435 0.8365
인도 /Indo Antonym, Definitions 277 195 0.7040
지구 /Jigu Definitions 609 546 0.8966
지원 /Jiwon Related Words, Definitions 2,621 328 0.1251
Sum 9,258 5,215 0.5633
Auto Tagging Result of Top 30
By Information type
Information TypeAll Target Words
Total Correct Accuracy
Synonym 544 402 0.7390
Antonym 129 119 0.9225
Related Terms 230 166 0.7217
Definitions 8,355 4,528 0.5420
SUM 9,258 5,215 0.5633
Comparison of Two Auto Tagging Methods
0.0000
0.2000
0.4000
0.6000
0.8000
1.0000
1.2000
Target Words
Accu
racy Dictionary
Informationbase Method
CollocationOverlap baseMethod
Build a Classifier
Train set : 600 Test set : others Window size: 50byte length Rule for making train set
There are errors in the automatic sense tagging.
For reducing errors and improving tagging accuracy of train set, information type of the high accuracy is firstly used.
Sense Classification- Dictionary Information-based Method
WordAll Information Type
Total Correct Accuracy
감자 /Gamja 1,270 1,139 0.8966
경기 /Gyeonggi 37,763 30,897 0.8182
기간 /Gigan 15,803 9,278 0.5871
신병 /Sinbyeong 469 386 0.8230
신장 /Sinjang 953 671 0.7043
연기 /Yongi 5,147 4,302 0.8359
인도 /Indo 2,750 1,212 0.4408
지구 /Jigu 9,375 8,373 0.8932
지원 /Jiwon 21,321 8,345 0.3914
SUM 94,851 64,604 0.6811
Sense Classification- Collocation Overlap-based Method
By rank
Rank Total Correct Accuracy
Top10 94,851 57,201 0.6031
Top30 94,851 58,891 0.6209
Top50 94,851 56,871 0.5996
Top100 94,851 56,916 0.6001
All 94,851 53,218 0.5611
Sense Classification- Collocation Overlap-based Method
By target wordsWord Total Correct Accuracy
감자 /Gamja 1,270 1,016 0.8000
경기 /Gyeonggi 37,763 30,787 0.8153
기간 /Gigan 15,803 10,239 0.6479
신병 /Sinbyeong 469 290 0.6183
신장 /Sinjang 953 622 0.6527
연기 /Yongi 5,147 2,834 0.5507
인도 /Indo 2,750 1,683 0.6118
지구 /Jigu 9,375 8,499 0.9065
지원 /Jiwon 21,321 2,922 0.1371
SUM 94,851 58,891 0.6209
Comparison of Two Sense Classifications
0.0000
0.1000
0.2000
0.3000
0.4000
0.5000
0.6000
0.7000
0.8000
0.9000
1.0000
Target Words
Acc
uracy
DictionaryInformaionbased Method
CollocationOverlap basedMethod
Data Fusion of Two Auto Tagging Methods
Dictionary Information base Method : Using all the information type except definitions
Collocation Overlap base Method : Using only the information type of Top10
Results of the Auto Tagging Method in Data Fusion - Words
Word Total Correct Accuracy
감자 /Gamja 503 485 0.9642
경기 /Gyeonggi 3,086 2,582 0.8367
기간 /Gigan 2,189 1,507 0.6884
신병 /Sinbyeong 96 56 0.5833
신장 /Sinjang 305 271 0.8885
연기 /Yongi 950 888 0.9347
인도 /Indo 367 273 0.7439
지구 /Jigu 939 918 0.9776
지원 /Jiwon 2,917 1,657 0.5680
SUM 11,352 8,637 0.7608
Results of the Auto Tagging Method in Data Fusion – Information Type
No Information TypeAll Target Words
Total Correct Accuracy
1 Collocation 1,603 1,548 0.9657
2 Chinese characters 74 74 1.0000
3
Synonym 2,064 1,856 0.8992
Antonym 336 290 0.8631
Related Terms 978 907 0.9274
4 Derived Words 1,078 1,071 0.9935
5 Definitions 5,219 2,891 0.5539
SUM 11,352 8,637 0.7608
Comparison of the Three Auto Tagging Methods
Auto Tagging Method Total Correct Accuracy
Dictionary Information base Method
25,803 18,077 0.7006
Collocation Overlap base Method
9,258 5,215 0.5633
Fusion Method 11,352 8,637 0.7608
Sense Classification in Data Fusion - Words
Word Total Correct Accuracy
감자 /Gamja 1,270 1,087 0.8559
경기 /Gyeonggi 37,763 32,128 0.8508
기간 /Gigan 15,803 13,055 0.8261
신병 /Sinbyeong 469 121 0.2580
신장 /Sinjang 953 702 0.7366
연기 /Yongi 5,147 4,437 0.8621
인도 /Indo 2,750 1,251 0.4547
지구 /Jigu 9,375 8,205 0.8752
지원 /Jiwon 21,321 11,251 0.5277
SUM 94,851 72,237 0.7616
Comparison of Three WSD Methods
WSD Method TotalCorrec
tAccurac
y
Improvement(
%)
Fusion Method 94,851 72,237 0.7616 -
Dictionary Information base Method
94,851 64,604 0.6811 11.82
Collocation Overlap base Method
94,851 58,891 0.6209 22.66
Conclusion(1/2)
The performance of the automatic tagging technique differed depending on the type of information sources in the dictionary.
In case of the frequently used keywords extracted from the dictionary, to apply feature selection method is needed.
Conclusion(2/2)
The word sense disambiguation model using the automatic tagging method based on dictionary information showed a comparable performance to the supervised learning method using manual tagging information.
The WSD model using data fusion technique combing two automatic tagging methods outperforms the model using a single tagging method.
Q&A