Date post: | 14-Apr-2017 |
Category: |
Science |
Upload: | seid-hassen |
View: | 173 times |
Download: | 9 times |
1
AMHARIC WORD SENSE DISAMBIGUATION USING WORDNET
By: Segid Hassen Yesuf
Advisor: Yaregal Assabie (PhD)March 2015
Addis Ababa University
School of Graduate Studies
College of Natural Sciences
Department of Computer Science
2
Presentation outline
• Introduction
• Literature Review
• Related Works
• Design of Amharic Word Sense
Disambiguation(AWSD)
• Experiment
• Conclusion
• Recommendations2
3
INTRODUTION• NLP facilitates the implementation of natural language based interface to computer system.
• Word sense disambiguation is the task of automatically identify the correct meaning of a
word that has multiple meanings or It is the problem of selecting a sense for a word from a
set of predefined possibilities.
• Word sense discrimination is the problem of dividing the usages of a word into different
meanings, without regard to any particular existing sense inventory.
• Words in the Amharic language often correspond to different meanings in different
contexts.
3
4
INTRODUTIONStatement of the Problem
The development of Amharic WSD is essential in the development of various NLP
applications
A few researchers attempted using machine-learning approach to develop Amharic
WSD. However, the previous researches have the following limitations.
•The study was limited to experiment only on five ambiguous words ,Training data
is required and No morphological processor
•Available sense-annotated corpora are largely insufficient to cover all the senses of
each of the target words
•Corpus used as a source of information for disambiguation and Corpus based
approach suffers from the knowledge acquisition bottleneck
•Amharic WSD developed by researchers requires manually labeled sense examples
for every word sense 4
5
INTRODUTION (Cont’d)
Statement of the Problem
•In our study, Amharic WordNet is used as a source of information for
disambiguation.
•knowledge-based Amharic WSD method allows the system to disambiguate words
in running text.
•AWSD avoiding the need of sense-annotated data
•It doesn’t require any manually-annotated training corpus but makes extensive use
of the lexical database WordNet for disambiguation
5
6
INTRODUTIONObjectives
•General Objective:
- To design and develop a system for Amharic word sense disambiguation using Amharic WordNet.
•Specific Objectives:
-Conducting literature review and related works to understand the approaches of WSD
-Collect data from Amharic dictionary to develop Amharic WordNet.
-Identifying Amharic ambiguous words and their contextual meaning
6
7
INTRODUTION (Cont’d)
Objectives
-Developing Amharic WordNet
-Designing an algorithm for Amharic Word Sense Disambiguation
-Developing a prototype of the system.
-Testing and evaluating the performance of the developed system.
7
8
INTRODUTION
Scope and Limitation of the Study
•Identifying the ambiguous word and its context in the given text.
•Retrieving senses of ambiguous words from Amharic WordNet.
•The limitation of the study is that does not perform grammar and spelling
correction and does not work for words which do not exist in Amharic WordNet.
8
9
LITERATURE REVIEW
• The task of WSD is to assign a sense to an instance of a
polysemous word in a particular context.
• WSD involves two steps.
– The first step is to determine all the different senses for every word relevant
to the text or discourse under consideration
– The second step involves a means to assign the appropriate sense to each
occurrence of a word in a context.
• WSD is to analyze word tokens in context and specify exactly,
which sense of several word is being used.
9
10
LITERATURE REVIEWWSD Approaches
•Knowledge Based Approaches
– Rely on knowledge resources like WordNet, Thesaurus etc.
– May use grammar rules for disambiguation.
– May use hand coded rules for disambiguation
•Machine Learning Based Approaches
– Rely on corpus evidence.
– Train a model using tagged or untagged corpus.
– Probabilistic/Statistical models
• Hybrid Approaches
– Use corpus evidence as well as semantic relations form WordNet.10
11
WSD ApproachesAmharic Language
•There are six types of Ambiguities in Amharic language
Lexical ambiguity Semantic ambiguity
Phonological Ambiguity Structural ambiguity
Referential Ambiguity Orthographic Ambiguity
•Not well studied for NLP
11
12
RELATED WORKSAuthor/Year Approaches Performance Algorithm
Teshome Kassie, 2008
Cosine distancesimilarity measuresfrom thesaurus
58%-82% averageprecision and recallrespectively
kNN
Solomon Mekonnen, 2010
Supervised 70%- 83% accuracy
Naïve bayes
Solomon Assemu, 2011
Unsupervised 51.9–79.4% K-meanagglomerativeEM
Getahun Wassie, 2012
Semi-supervised 47-83% Naïve bayes, EMAdtrees, bagging
Tesfa Kebede, 2013
Supervised 79% accuracy Naïve bayes
12
13
Proposed Solution
•The related works have shown that systems based on knowledge-based approaches
perform better.
•Knowledge-based approaches is preferable because it does not require training and
is capable of disambiguating arbitrary text.
•The proposed solution resolved ambiguous words using contextual information
found in the sentence
RELATED WORKS
13
14
Architecture
Input Sentence
Tokenization
Normalization
Stop Word Removal
Morphological Analysis
Root Word
Amharic WordNet
Ambiguous Word Identification
Context Selection
Sense Selection
Sense Retrievals
Disambiguated Word Sense 14
15
Amharic WordNet
•WordNet is a network of words linked by lexical and semantic relation
•it is a combination of dictionary and thesaurus in which created and maintained by
cognitive science lab of Princeton University.
•WordNet grouped into sets of cognitive synsets, each expressing a distinct
concept.
•Synsets are interlinked by means of conceptual-semantic and lexical relations.
•The Structure of the Amharic WordNet is based on the principle of English
WordNet
DESIGN OF AWSD
15
16
DESIGN OF AWSDAmharic WordNet•Amharic WordNet is a lexical database which groups Amharic words into sets
of synonyms called synsets.
•Words are grouped together according to their similarity of meanings.
•It contains open class words such as noun, verb, adjectives and adverbs
• It can be used as input to build AWSD
16
18
Preprocessing
•Preparing the input sentence into a format that is suitable for the morphological
analysis
•Tokenization: takes the input text supplied from a user and tokenizes it into a
sequence of tokens.
•Normalization: is the identification and replacement of Amharic alphabets that have
the same use and pronunciations
•Stop word removal: are words which are filtered out prior to, or after processing of
natural language data.
18
DESIGN OF AWSD
19
Preprocessing
•Consider the following Sentence:
ልጁ ጉንፋን ስለያዘው ሳለ።
•After preprocessing component applied on the given input sentence the output
will be
ልጁ,ጉንፋን,ስለያዘው,ሳለ
•Preprocessing component gives this output to morphological analysis component.
19
DESIGN OF AWSD
20
DESIGN OF AWSDMorphological Analysis
•Morphological analysis is the segmentation of words into their
component morphemes.
•It is used to reduce various original forms of a word to a single root
words.
•The variants of ሳለ,ተሳለ, መሳል, አሳለው,ሳል,ሳሉ,ሳለች,ተሳሉ, አሳሳሉ and
etc. are changed in to their root word ስ-እ-ል
20
21
Morphological Analysis
•It is practically impossible to store all possible words in Amharic
WordNet
•Therefore, Morphological analyzer reduces memory usage for storing
words
•From the given input sentence that are produce by preprocessing
components morphological analysis produce root word for each word.
i.e. ልጅ,ጉንፋን,ይ-እ-ዝ,ስ-እ-ል
•It is important to morphologically complex language21
22
DESIGN OF AWSD
Ambiguous Word Identification
•It is used to identify the ambiguous word from the input sentence based on
information provided on Amharic WordNet and Morphological Analysis.
•The root words that are produce by morphological analyzer (i.e. ልጅ,ጉንፋን,ይ-እ-
ዝ,ስ-እ-ል) with respect to their sense is count in Amharic WordNet.
•The root word “ስ-እ-ል” and its sense exist five times from Amharic WordNet
22
23
Ambiguous Word Identification
• So that, “ስ-እ-ል” detected as ambiguous word in the input
sentence and “ስ-እ- ” ል is the root word for the word
“ ”ሳለ .Therefore,“ ” ሳለ is ambiguous word.
23
24
DESIGN OF AWSDContext Selection
•Context in WSD is refers to words surrounding the ambiguous word which are
used to decide the meaning of the ambiguous words.
•For example the sentence “ ” አበበ ቢላዋ ሳለ። after morphological analysis, it will be
“አበበ, ቢላዋ, ስ-እ- ”ል .therefore, ambiguous word is “ ” ሳለ based on AWI
component.
•The context selection select “ ”ቢላዋ as a context of the ambiguous word “ ”ሳለ
from Amharic WordNet in the Ontology based related word table.
24
25
DESIGN OF AWSD
Sense Selection
•It is used to identify the possible sense of ambiguous words in the given input
sentence with the help of Amharic WordNet
•It is used to determine the intended sense of ambiguous word.
•It takes the set of words out put by CS and AWI components which is used to
identify the sense of ambiguous word by using either sense overlap or ontology
based related words.
25
26
Sense Selection•The sense of “ ” ቢላዋ are: 1. ካራ፣ሰንጢ
2. በአንድ ጎኑ ስለት ያለው ለስጋ ለሽንከርት መቁረጫ ፥መክተፊያ የሚያገለግል የወጥ ቤትእቃ
•The sense of “ ” ሳለ are: 1. ሞረ ደ ፥አሾለ፥መቁረጫ ጠርዝን አተባ ፥ ስለት አወጣ
2. ይህን ካደረክለኝ ይህን አደርግልሀለው በማለት ለአምላክ፡ለመላእክት የሚቀርብ ለመና፡ብፅኣት
3. ከጉሮሮ አየርን በሀይል ና በተደጋጋሚነት አስወጣ፡ ትክትክ አደረገ፥ኡህ ኡህ አለ
The overlap of Sense 1 of the word “ሳለ” and Sense 2 of the word “ቢላዋ” has two
overlaps, whereas the other senses have zero overlap, so that the first sense is s
selected for ambiguous word.
26
27
DESIGN OF AWSDSense Retrieval
•Sense Retrieval is responsible for extracting the sense of ambiguous words from
Amharic WordNet.
•Therefore, the sense of ambiguous word “ሞረ ደ ፥አሾለ፥መቁረጫ ጠርዝን አተባ ፥ ስለት
አወጣ ” is extract from Amharic WordNet.
27
28
EXPERIMENT
Amharic WordNet
•Amharic WordNet contains 2000 words and 10,000 synsets
•A test sentence of 200 random sentences containing ambiguous words
from the knowledge base is created
28
29
EXPERIMENTPerformance Evaluation Criteria
•The precision is the percentage of correct disambiguated senses for the ambiguous
word.
•It is used to measure of how much of the information the system returned is correct.
•Recall is used to measure of how much relevant information the system has
extracted.
•It is the number of correct disambiguated senses over the total number of senses
•F1-measure is a single measure that combines recall and precision•Accuracy is defined as the proportion of instances that were disambiguated correctly
29
30
ExperimentMorphological analyzer
•Morphological analyzer improve the accuracy of AWSD.
•HornMorpho:Morphological analyzer. developed by Gasser do not works for all
words. This affects the accuracy of AWSD.
•Gasser developed morphological analyzer for 200 words and after he checks the
accuracy of morphological analyzer, he found morphological analyzer does not work
for 2 words.
• We do not know which word used for morphological analyzer
30
32
Test Results
•Small Window size for all word tend to results is high precision and low recall.
•In terms of recall, if there are more words in the context, the chance of finding
related word with ambiguous word at least one of them is higher and hence
increased window size would lead to a higher recall
•We achieved a good results from window size 2.
32
EXPERIMENT
33
Test ResultsFor example, in the sentence “ አበበ ቢላዋ ሳለ”, window size=2 and the target is “ ”ሳለ , since
there is no word in the right context, we assigns the sense to the target word. This induces
enormous sense of resulting in good precision for window size = 2.
33
34
CONCLUSION AND RECOMMENDATIONS
Conclusion• AWSD which addresses the problem of automatically deciding the correct sense of an
ambiguous word based on ontology based related words and sense overlap
• Knowledge based methods rely on information that can be extracted from Amharic
WordNet
• Morphological analyzer in Amharic WordNet enhance the accuracy of AWSD.
• For Amharic there is no standard optimal context window size based on our experiment we
have found that two-word window on each side of the ambiguous word is enough for
Amharic WSD.
34
35
Contribution Of The Thesis
•Proposing a new architecture for Amharic Word Sense Disambiguation
system with Knowledge-Based Approach.
•Identifying the nature and patterns of Amharic ambiguous words and
how they can be processed to be understood by a machine.
•Identifying and proposing the procedures, techniques, algorithms and
tools used for the development of Amharic Word Sense Disambiguation
35
36
Contribution Of The Thesis
•The development of Amharic WordNet.
•Preparing an encouraging environment for the development of
other Amharic NLP studies that need WSD as a component in their
work.
36
CONCLUSION AND RECOMMENDATIONS (Cont’d)
37
Recommendations
•Developing thesaurus and machine readable dictionaries
•Developing Amharic word Ontology
•Developing a hybrid approach
•A project to develop a full-fledged Amharic WordNet system
•A project to develop a full-fledged Amharic WSD system
37