+ All Categories
Home > Science > Amharic WSD using WordNet

Amharic WSD using WordNet

Date post: 14-Apr-2017
Category:
Upload: seid-hassen
View: 173 times
Download: 9 times
Share this document with a friend
38
1 AMHARIC WORD SENSE DISAMBIGUATION USING WORDNET By: Segid Hassen Yesuf Advisor: Yaregal Assabie (PhD) March 2015 Addis Ababa University School of Graduate Studies College of Natural Sciences Department of Computer Science
Transcript

1

AMHARIC WORD SENSE DISAMBIGUATION USING WORDNET

By: Segid Hassen Yesuf

Advisor: Yaregal Assabie (PhD)March 2015

Addis Ababa University

School of Graduate Studies

College of Natural Sciences

Department of Computer Science

2

Presentation outline

• Introduction

• Literature Review

• Related Works

• Design of Amharic Word Sense

Disambiguation(AWSD)

• Experiment

• Conclusion

• Recommendations2

3

INTRODUTION• NLP facilitates the implementation of natural language based interface to computer system.

• Word sense disambiguation is the task of automatically identify the correct meaning of a

word that has multiple meanings or It is the problem of selecting a sense for a word from a

set of predefined possibilities.

• Word sense discrimination is the problem of dividing the usages of a word into different

meanings, without regard to any particular existing sense inventory.

• Words in the Amharic language often correspond to different meanings in different

contexts.

3

4

INTRODUTIONStatement of the Problem

The development of Amharic WSD is essential in the development of various NLP

applications

A few researchers attempted using machine-learning approach to develop Amharic

WSD. However, the previous researches have the following limitations.

•The study was limited to experiment only on five ambiguous words ,Training data

is required and No morphological processor

•Available sense-annotated corpora are largely insufficient to cover all the senses of

each of the target words

•Corpus used as a source of information for disambiguation and Corpus based

approach suffers from the knowledge acquisition bottleneck

•Amharic WSD developed by researchers requires manually labeled sense examples

for every word sense 4

5

INTRODUTION (Cont’d)

Statement of the Problem

•In our study, Amharic WordNet is used as a source of information for

disambiguation.

•knowledge-based Amharic WSD method allows the system to disambiguate words

in running text.

•AWSD avoiding the need of sense-annotated data

•It doesn’t require any manually-annotated training corpus but makes extensive use

of the lexical database WordNet for disambiguation

5

6

INTRODUTIONObjectives

•General Objective:

- To design and develop a system for Amharic word sense disambiguation using Amharic WordNet.

•Specific Objectives:

-Conducting literature review and related works to understand the approaches of WSD

-Collect data from Amharic dictionary to develop Amharic WordNet.

-Identifying Amharic ambiguous words and their contextual meaning

6

7

INTRODUTION (Cont’d)

Objectives

-Developing Amharic WordNet

-Designing an algorithm for Amharic Word Sense Disambiguation

-Developing a prototype of the system.

-Testing and evaluating the performance of the developed system.

7

seid

8

INTRODUTION

Scope and Limitation of the Study

•Identifying the ambiguous word and its context in the given text.

•Retrieving senses of ambiguous words from Amharic WordNet.

•The limitation of the study is that does not perform grammar and spelling

correction and does not work for words which do not exist in Amharic WordNet.

8

9

LITERATURE REVIEW

• The task of WSD is to assign a sense to an instance of a

polysemous word in a particular context.

• WSD involves two steps.

– The first step is to determine all the different senses for every word relevant

to the text or discourse under consideration

– The second step involves a means to assign the appropriate sense to each

occurrence of a word in a context.

• WSD is to analyze word tokens in context and specify exactly,

which sense of several word is being used.

9

10

LITERATURE REVIEWWSD Approaches

•Knowledge Based Approaches

– Rely on knowledge resources like WordNet, Thesaurus etc.

– May use grammar rules for disambiguation.

– May use hand coded rules for disambiguation

•Machine Learning Based Approaches

– Rely on corpus evidence.

– Train a model using tagged or untagged corpus.

– Probabilistic/Statistical models

• Hybrid Approaches

– Use corpus evidence as well as semantic relations form WordNet.10

11

WSD ApproachesAmharic Language

•There are six types of Ambiguities in Amharic language

Lexical ambiguity Semantic ambiguity

Phonological Ambiguity Structural ambiguity

Referential Ambiguity Orthographic Ambiguity

•Not well studied for NLP

11

12

RELATED WORKSAuthor/Year Approaches Performance Algorithm

Teshome Kassie, 2008

Cosine distancesimilarity measuresfrom thesaurus

58%-82% averageprecision and recallrespectively

kNN

Solomon Mekonnen, 2010

Supervised 70%- 83% accuracy

Naïve bayes

Solomon Assemu, 2011

Unsupervised 51.9–79.4% K-meanagglomerativeEM

Getahun Wassie, 2012

Semi-supervised 47-83% Naïve bayes, EMAdtrees, bagging

Tesfa Kebede, 2013

Supervised 79% accuracy Naïve bayes

12

13

Proposed Solution

•The related works have shown that systems based on knowledge-based approaches

perform better.

•Knowledge-based approaches is preferable because it does not require training and

is capable of disambiguating arbitrary text.

•The proposed solution resolved ambiguous words using contextual information

found in the sentence

RELATED WORKS

13

14

Architecture

Input Sentence

Tokenization

Normalization

Stop Word Removal

Morphological Analysis

Root Word

Amharic WordNet

Ambiguous Word Identification

Context Selection

Sense Selection

Sense Retrievals

Disambiguated Word Sense 14

15

Amharic WordNet

•WordNet is a network of words linked by lexical and semantic relation

•it is a combination of dictionary and thesaurus in which created and maintained by

cognitive science lab of Princeton University.

•WordNet grouped into sets of cognitive synsets, each expressing a distinct

concept.

•Synsets are interlinked by means of conceptual-semantic and lexical relations.

•The Structure of the Amharic WordNet is based on the principle of English

WordNet

DESIGN OF AWSD

15

16

DESIGN OF AWSDAmharic WordNet•Amharic WordNet is a lexical database which groups Amharic words into sets

of synonyms called synsets.

•Words are grouped together according to their similarity of meanings.

•It contains open class words such as noun, verb, adjectives and adverbs

• It can be used as input to build AWSD

16

17

Amharic WordNet Database Schema

17

DESIGN OF AWSD

18

Preprocessing

•Preparing the input sentence into a format that is suitable for the morphological

analysis

•Tokenization: takes the input text supplied from a user and tokenizes it into a

sequence of tokens.

•Normalization: is the identification and replacement of Amharic alphabets that have

the same use and pronunciations

•Stop word removal: are words which are filtered out prior to, or after processing of

natural language data.

18

DESIGN OF AWSD

19

Preprocessing

•Consider the following Sentence:

ልጁ ጉንፋን ስለያዘው ሳለ።

•After preprocessing component applied on the given input sentence the output

will be

ልጁ,ጉንፋን,ስለያዘው,ሳለ

•Preprocessing component gives this output to morphological analysis component.

19

DESIGN OF AWSD

20

DESIGN OF AWSDMorphological Analysis

•Morphological analysis is the segmentation of words into their

component morphemes.

•It is used to reduce various original forms of a word to a single root

words.

•The variants of ሳለ,ተሳለ, መሳል, አሳለው,ሳል,ሳሉ,ሳለች,ተሳሉ, አሳሳሉ and

etc. are changed in to their root word ስ-እ-ል

20

21

Morphological Analysis

•It is practically impossible to store all possible words in Amharic

WordNet

•Therefore, Morphological analyzer reduces memory usage for storing

words

•From the given input sentence that are produce by preprocessing

components morphological analysis produce root word for each word.

i.e. ልጅ,ጉንፋን,ይ-እ-ዝ,ስ-እ-ል

•It is important to morphologically complex language21

22

DESIGN OF AWSD

Ambiguous Word Identification

•It is used to identify the ambiguous word from the input sentence based on

information provided on Amharic WordNet and Morphological Analysis.

•The root words that are produce by morphological analyzer (i.e. ልጅ,ጉንፋን,ይ-እ-

ዝ,ስ-እ-ል) with respect to their sense is count in Amharic WordNet.

•The root word “ስ-እ-ል” and its sense exist five times from Amharic WordNet

22

23

Ambiguous Word Identification

• So that, “ስ-እ-ል” detected as ambiguous word in the input

sentence and “ስ-እ- ” ል is the root word for the word

“ ”ሳለ .Therefore,“ ” ሳለ is ambiguous word.

23

24

DESIGN OF AWSDContext Selection

•Context in WSD is refers to words surrounding the ambiguous word which are

used to decide the meaning of the ambiguous words.

•For example the sentence “ ” አበበ ቢላዋ ሳለ። after morphological analysis, it will be

“አበበ, ቢላዋ, ስ-እ- ”ል .therefore, ambiguous word is “ ” ሳለ based on AWI

component.

•The context selection select “ ”ቢላዋ as a context of the ambiguous word “ ”ሳለ

from Amharic WordNet in the Ontology based related word table.

24

25

DESIGN OF AWSD

Sense Selection

•It is used to identify the possible sense of ambiguous words in the given input

sentence with the help of Amharic WordNet

•It is used to determine the intended sense of ambiguous word.

•It takes the set of words out put by CS and AWI components which is used to

identify the sense of ambiguous word by using either sense overlap or ontology

based related words.

25

26

Sense Selection•The sense of “ ” ቢላዋ are: 1. ካራ፣ሰንጢ

2. በአንድ ጎኑ ስለት ያለው ለስጋ ለሽንከርት መቁረጫ ፥መክተፊያ የሚያገለግል የወጥ ቤትእቃ

•The sense of “ ” ሳለ are: 1. ሞረ ደ ፥አሾለ፥መቁረጫ ጠርዝን አተባ ፥ ስለት አወጣ

2. ይህን ካደረክለኝ ይህን አደርግልሀለው በማለት ለአምላክ፡ለመላእክት የሚቀርብ ለመና፡ብፅኣት

3. ከጉሮሮ አየርን በሀይል ና በተደጋጋሚነት አስወጣ፡ ትክትክ አደረገ፥ኡህ ኡህ አለ

The overlap of Sense 1 of the word “ሳለ” and Sense 2 of the word “ቢላዋ” has two

overlaps, whereas the other senses have zero overlap, so that the first sense is s

selected for ambiguous word.

26

27

DESIGN OF AWSDSense Retrieval

•Sense Retrieval is responsible for extracting the sense of ambiguous words from

Amharic WordNet.

•Therefore, the sense of ambiguous word “ሞረ ደ ፥አሾለ፥መቁረጫ ጠርዝን አተባ ፥ ስለት

አወጣ ” is extract from Amharic WordNet.

27

28

EXPERIMENT

Amharic WordNet

•Amharic WordNet contains 2000 words and 10,000 synsets

•A test sentence of 200 random sentences containing ambiguous words

from the knowledge base is created

28

29

EXPERIMENTPerformance Evaluation Criteria

•The precision is the percentage of correct disambiguated senses for the ambiguous

word.

•It is used to measure of how much of the information the system returned is correct.

•Recall is used to measure of how much relevant information the system has

extracted.

•It is the number of correct disambiguated senses over the total number of senses

•F1-measure is a single measure that combines recall and precision•Accuracy is defined as the proportion of instances that were disambiguated correctly

29

30

ExperimentMorphological analyzer

•Morphological analyzer improve the accuracy of AWSD.

•HornMorpho:Morphological analyzer. developed by Gasser do not works for all

words. This affects the accuracy of AWSD.

•Gasser developed morphological analyzer for 200 words and after he checks the

accuracy of morphological analyzer, he found morphological analyzer does not work

for 2 words.

• We do not know which word used for morphological analyzer

30

31

Test Results

31

EXPERIMENT

32

Test Results

•Small Window size for all word tend to results is high precision and low recall.

•In terms of recall, if there are more words in the context, the chance of finding

related word with ambiguous word at least one of them is higher and hence

increased window size would lead to a higher recall

•We achieved a good results from window size 2.

32

EXPERIMENT

33

Test ResultsFor example, in the sentence “ አበበ ቢላዋ ሳለ”, window size=2 and the target is “ ”ሳለ , since

there is no word in the right context, we assigns the sense to the target word. This induces

enormous sense of resulting in good precision for window size = 2.

33

34

CONCLUSION AND RECOMMENDATIONS

Conclusion• AWSD which addresses the problem of automatically deciding the correct sense of an

ambiguous word based on ontology based related words and sense overlap

• Knowledge based methods rely on information that can be extracted from Amharic

WordNet

• Morphological analyzer in Amharic WordNet enhance the accuracy of AWSD.

• For Amharic there is no standard optimal context window size based on our experiment we

have found that two-word window on each side of the ambiguous word is enough for

Amharic WSD.

34

35

Contribution Of The Thesis

•Proposing a new architecture for Amharic Word Sense Disambiguation

system with Knowledge-Based Approach.

•Identifying the nature and patterns of Amharic ambiguous words and

how they can be processed to be understood by a machine.

•Identifying and proposing the procedures, techniques, algorithms and

tools used for the development of Amharic Word Sense Disambiguation

35

36

Contribution Of The Thesis

•The development of Amharic WordNet.

•Preparing an encouraging environment for the development of

other Amharic NLP studies that need WSD as a component in their

work.

36

CONCLUSION AND RECOMMENDATIONS (Cont’d)

37

Recommendations

•Developing thesaurus and machine readable dictionaries

•Developing Amharic word Ontology

•Developing a hybrid approach

•A project to develop a full-fledged Amharic WordNet system

•A project to develop a full-fledged Amharic WSD system

37

38

Thank You!!!

38


Recommended