Dr. T.M.Rangaswamy · decorated events: precision obtained were 62%, recall was 53%. Precision and...

Multi-Perspective Comparative Study:

Common Context Based knowledge integration

in Word Sense Disambiguation for Information

Retrieval

By

Vinay Hegde

Supervisor

Dr. T.M.Rangaswamy

A Thesis Submitted to Avinashilingam University for Women

Coimbatore – 43

In partial fulfilment of the requirements for the award of the degree of

Doctor of Philosophy

in

Computer Science and Engineering

October - 2012

CERTIFICATE

This is to certify that the dissertation entitled “Multi-Perspective Comparative

Study: Common Context Based knowledge integration in Word Sense

Disambiguation for Information Retrieval” submitted to Department of Computer

Science and Engineering, Faculty of Engineering, Avinashilingam University for

Women, Coimbatore, for the award of Doctor of Philosophy in Computer Science

and Engineering is a record of original research work done by Vinay Hegde, during

the period of his study (April 2009 – October 2012) in the Department of Computer

Science and Engineering, Faculty of Engineering, Avinashilingam University for

Women, under my supervision and guidance and the thesis has not formed the basis

for the award of any Degree / Diploma / Associateship / Fellowship of any other

similar title to any candidate of any other University.

Signature of Guide

DECLARATION

I hereby declare that the matter embodied in the thesis entitled “Multi-

Perspective Comparative Study: Common Context Based knowledge integration

in Word Sense Disambiguation for Information Retrieval” submitted to

Department of Computer Science and Engineering, Faculty of Engineering,

Avinashilingam University for Women, Coimbatore, for the award of Doctor of

Philosophy in Computer Science and Engineering is a record of original research

work done by the undersigned under the supervision and guidance of Dr. T.M.

Rangaswamy, Associate Dean (P.G.Studies) and Professor, Industrial Engineering

and Management Department, R V College of Engineering and that it has not formed

the basis for the award of any Degree / Diploma / Associateship / Fellowship or any

other similar title to any other candidate of any other University.

Signature of Candidate

2

ACKNOWLEDGEMENT

First and Foremost, I would like to thank my Guide Dr. T.M.Rangaswamy, for his warm

encouragement and thoughtful guidance to complete my PhD at Avinashilingam University for

Women, Coimbatore as a full time candidate.

I express my gratitude to honorable Vice Chancellor of Avinashilingam University for Women,

Coimbatore, for giving me an opportunity to carry out the research work.

My sincere thanks to Controller of Examinations, Avinashilingam University for Women,

Coimbatore, for timely suggestions and help extended throughout the research work.

I express my gratitude to Registrar and Director of Research, Avinashilingam University for

Women, Coimbatore, for their support.

I am grateful to the Management of Rashtreeya Sikshana Samithi Trust (RSST), Bangalore for

giving me permission to carry out the research work, and providing logistics, financial support

and study leave to complete the research work.

I would like to express my gratitude to Dr. Satyanarayana B.S, Principal, Prof K.N. RajaRao,

Advisor, and Dr. Satyanarayana.S, Vice Principal, R V College of Engineering for their

encouragement and support to carry out my research work.

I would like to express my deep sense of gratitude to Dr.S.C.Sharma, honorable Vice

Chancellor, Tumkur University, Karnataka, India, for igniting the passion for research.

I would express my gratitude to Dr. N.K Srinath, Professor and Head, Department of

CSE,RVCE and also to my colleagues of the department of Computer science and Industrial

Engineering and Management, R V College of Engineering for extending their support during

my research work.

I sincerely thank Dr. Krishna M, professor; R&D Department, RVCE and Dr. Ramakanth Kumar

P, Professor, Dept of Information Science, R V College of Engineering for their continuous

evaluation and review of thesis with respect to grammatical and technical corrections helped me

in timely completion of my work.

3

My special thanks Dr. Rajashree shettar, Professor, Computer Science Department, Prof.

Manjunath A.E, Assistant professor, CSE Department, Prof. Anjan K, Assistant professor, Mr.

Sathyanarayana, Psychiatry counselor, R&D department R. V. College of Engineering and Mrs.

Vedavathy Instructor, CSE department for their unconditional support in providing cognitive

inputs like brain mapping, human psychology during my research work in designing and typing

my thesis with a quality focus.

I dedicate this thesis to my beloved parents ,Sri. V.S .Hegde, Smt. Parvati Hegde, and Dr.

Aparna Hamsa. A special thanks to Mr. Narayanappa A.P and other family members for their

unconditional support and encouragement to pursue my interests.

I would like to thank all the people who have helped me directly or indirectly in carrying out my

research work and preparation of the thesis.

(Vinay Hegde)

4

LIST OF PUBLICATIONS

Papers Published in International and National Journals:

1. Vinay Hegde , Dr.T.M.Rangaswamy published a paper on “Efficient word sense

Disambiguation with context based searching” International Journal of Advanced

Engineering Technology. E-ISSN 0976-3945 Volume II ,Issue I January-March (2011),

pp. 43-46

2. Vinay Hegde, Dr.T.M.Rangaswamy published a paper on “Putting Corpus into dictionary

(WSD Approach) ”at International Conference / Journal on Computer Applications

[ICCA-2010],during 24th to 27th December 2010, held at Pondicherry,India.pp.303-306 .

Same is published in the journal doi:10.3850/978-981-08-7304-2_1445.

(http://www.rpsonline.com.sg/proceedings/9789810873042/html/978-981- 08-7304-

2_1445.xml)

3. Vinay Hegde, Jeevan HE, Prashanth PP, Punith Kumar S N & Dr.T.M.Rangaswamy

Published a paper on “Web pages Clustering: A New Approach” International Journal of

Innovative Technology and Creative Engineering (IJITCE), ISSN :2045- 8711 Volume

1,Number 4, APRIL 2011,PP. 42-44

http://www.rpsonline.com.sg/proceedings/9789810873042/html/978-981-%20%2008-7304-2_1445.xml

http://www.rpsonline.com.sg/proceedings/9789810873042/html/978-981-%20%2008-7304-2_1445.xml

5

CONFERENCE PUBLICATIONS

Papers Presented in International and National Conferences

1. Vinay Hegde , Dr.T.M.Rangaswamy, Dipti Shankar, Deepiks Sin D.V. Ashwini presented

a paper on “A simple graph based Information Retrieval system” at National

conference on Recent Trends in Computing Technology [RTCT 2011],during 29th to 30th

April 2011,held at Department of Computer science & Engineering, R.V.College of

Engineering, Bangalore, Karnataka, India. Volume 2, pp. 110-113

2. Vinay Hegde, Dr.T.M.Rangaswamy presented a paper on “Word Sense Disambiguation- a

new approach” at International conference on Emerging Trends in Computer Science,

Communication and Information Technology [CCIT-2010], during 9th to 11th Jan 2010,

held at Nandel, Maharashtra, India.

3. Vinay Hegde, Dr.T.M.Rangaswamy presented a paper on “A Review on Word Sense

Disambiguation” at National conference on Computing Communication & Technology,

[CCT-2010] during 12th to 13th Jan 2010,held at RVCE, Bangalore. PP. 51-53

4. Vinay Hegde, Dr.T.M.Rangaswamy presented a paper on “Survey on Word Sense

Disambiguation” at National conference on Network security and Information

Retrieval,[NCNSIR09] on 14th December 2009, held at Department of MCA, RVCE,

Bangalore. Pp.34-36

6

Abstract

In natural language processing (NLP), knowledge discovery is an important area of research.

Knowledge discovery process involves lexical ambiguity and semantically ambiguity. Further it

concentrates on developing and building intelligent open domain to resolve the issues in the

language processing areas. Knowledge discrimination and sense disambiguation are the two

important areas of research. Word sense disambiguation (WSD) is the problem of selecting a

right sense for a word from a set of predefined possibilities. Sense Inventory usually comes

from a dictionary or thesaurus. Knowledge intensive methods, supervised learning, and

sometimes bootstrapping approaches are being used.

This research work incorporates experienced learning styles such as converger for

active conceptualization and experimentation, diverger for concrete experience and reflective

observation, Assimilator for abstract conceptualization and reflective observation, Accomodator

for concrete experience and active experimentation. Hence the results obtained on combining

all strategies helped to obtain event instances precision to 86% and recall to 81% . For Fully

decorated events: precision obtained were 62%, recall was 53%. Precision and recall have been

two of the traditional performance measures for evaluating information retrieval systems.

Precision is the fraction of the retrieved result documents that are relevant to the user's

information need. Recall is defined as the fraction of relevant documents in the entire collection

that are returned as result documents. contextualized search, personal information management,

information integration strategies are employed in the work to make a noteworthy contribution.

Further cues, confidence, rule accuracy have been deployed to provide more stability to the

claims made in the research along with precision and recall.

The main contribution in this proposed research work is to present a new holistic or integrated

approach using information access architecture, 3C (Content, context, and common-

background) knowledge integration architecture by multi perspective comparative study of

various methods ,to effectively resolve Word Sense Disambiguation (WSD) problem of Natural

language processing. This approach is context-sensitive, concept-based, and proactive.

7

TABLE OF CONTENT

Chapter No. Particulars Page No.

Acknowledgement i

List of Publications iii

Abstract v

List of Figures xv

List of Tables xviii

Chapter-1 Introduction 01

1.1

Background: Achieving Word Sense Disambiguation

through Configuration of an Information Retrieval

(IR) system.

04

Core Information Retrieval(IR) concepts:

1.2 The Document and Query Representation 05

1.3 Relevance Feed back 06

1.4 Evaluation parameters used in WSD system 08

1.5 Stemming in WSD 09

1.6 Parts of speech tagging of text 10

1.7 Feature Extraction 10

1.8 Contextual Feature 10

1.9 Co training algorithm 10

1.10 State of art of research topic 13

1.11 Definition of the Problem 16

1.12 Overview of the problem 17

1.13 Applications of WSD 18

1.14 Research gap, Research objectives and scope of

Research work

18,19,20

8


1.15 Summary of the thesis or Thesis organization 23

Chapter-2 Literature Survey 25

2.1 Reviews on Knowledge Based Word Sense

Disambiguation (WSD) 25

2.2 Reviews on Supervised WSD or Corpus Based WSD. 25

2.3 Review on Natural Language Processing(NLP) 26

2.4 Review on WSD by unsupervised methods. 27

2.5 Review on WSD by clustering. 30

2.6 Reviews on Information retrieval techniques. 31

2.7 Review on Predominant sense. 33

2.8 Review on indexing Techniques. 33

2.9 Review on Latent semantic indexing. 34

2.10 Review on Text Data Mining. 35

2.11 Review on Keyword based association analysis. 35

Chapter-3 Integrated Framework 3C (context, content and common

knowledge)

36

3.1 Introduction to 3C Framework and NLP 42

3.1.1 Goals in NLP 43

3.1.2 Classification in NLP 43

3.2. Communication in NLP 43

3.2.1 Communication in NLP( for speaker) 43

3.2.2 Communication for the hearer (comprehension) 44

3.2.2 Subtasks of NLP(syntactic, semantic, pragmatic) 44

3.3 Modular comprehension (NLP). 54

3.3.1 Reasons for Language being ambiguous. 55

3.4 Multi perspective Domain Analysis of IR systems based

on WSD.

56

3.5 DFD or Functional View of WSD based Paradigm IR

system

57

3.6 Innovative Frame work (WSD). 59

3.7 Knowledge Representation in Expert systems(WSD) 62

3.7.1 Requirements of experts system (WSD). 62

9


3.7.2 Comparison of Data processing and Knowledge

Engineering.

65

3.8 Advanced Text pre-processing and indexing in WSD

system

65

3.8.1 The most common filtering /processing operations 65

3.8.2 Indexing Large Data sets without sorting (Clustering): 66

3.9 Stemming Algorithms (WSD) 67

3.9.1 Conflation methods (WSD) 67

3.10 Advanced syntactic, semantic and pragmatic/discourse

Tasks

68

3.10.1 Word Segmentation 68

3.10.2 Morphological Analysis in NLP 69

3.11 Tokenizing, Tagging and Lemmatizing (TTL) 69

3.11.1 TTL Functionalities 70

3.12 Semantic parsing 71

3.12.1 Entailment in WSD. 71

3.13 Anaphora resolution or co-occurrence determination. 72

3.14 Enhancing the effectiveness of Re-training filters. 73

3.14.1 (Model details on effective Re-training strategy)

Enhancing the effectiveness.

75

3.15 Enhanced Natural Language Generator Architecture

(WSD).

76

Chapter 4 Research Methodology 77

4.0 Methodology, Tools and Techniques in comparative

experimental setup for word sense disambiguation.

77

4.1 User Requirements collection in developing efficient

learning system (WSD).

80

4.1.1 Resources or Data sources used. 80

4.1.2 Algorithms and Type of Data structures in Research

design.

82

4.1.3 Typical knowledge resources used in WSD. 82

4.2 Research Design: Developing Methods for Word Sense

Disambiguation (knowledge integrated approach):

82

4.3 Proposed Method [METHOD 1] 85

4.3.1 Information Extraction in cognitive approach to WSD. 86

10


4.3.2 Building Text Representation 88

4.4 Context Knowledge Acquisition and Representation

Modeling Process using corpus harvesting procedure and

IR databases. [METHOD-2]

93

4.4.1 Knowledge acquisition through Web Document Retrieval

and Cleaning.

97

4.4.2 Sentence segmentation and parsing. 97

4.4.3 Merging Dependency relations. 98

4.4.4 Handling unsupervised method to integrate the

Knowledge in WSD based on cognitive approach.

99

4.5

Proposed Method [METHOD-3] : Knowledge Extraction

using Dynamical updating of Representation Model.

100

4.6 WordNet (Data source used in WSD Research). 103

4.7 Proposed Design of WSD Algorithm and program

implementation procedures.

106

Chapter 5 Results and Discussions. 120

5.0 Evaluation of WSD systems. 120

5.1 Assumptions or criterions aimed before Evaluation of

Integrated Framework of WSD systems:

120

5.2 Results: Experimental Setup for Evaluation of WSD or

Disambiguation System.

122

5.3 Error Analysis using Decision Lists form basic evaluation

procedure in WordNet.

122

5.3.1 Evaluation of senses from WordNet. 123

5.4 Determination of clue words by disambiguator 125

5.4.1 Knowledge gathering from contexts of seed words. 127

5.5 Establishing the upper and lower bounds of effectiveness 127

5.5.1 The similarity Measure 128

5.6 Knowledge gathering in WSD system – A evaluation. 132

5.6.1 WordNet Sense Tag Evaluation. 132

5.6.2 Multi perspective Analysis of Collocations in Knowledge

store (WordNet) Vs Senseval-II data

133

5.6.3 Evaluation by designing co–occurrence matching filter. 133

11


5.6.4 Algorithm of co-occurrence matching filter. 133

5.6.5 Performance of Integrated WSD co-occurrence Vs

SENSEVAL II data:

135

5.6.6 Inference with two plot of WSD co-occurrence. 136

5.7 Frequency Matching 136

5.8 Evaluation of Combined Multiple Knowledge source. 138

5.8.1 Overall WSD Evaluation. 139

5.9 Evaluation of WSD Task Definition in Supervised WSD

Method

140

5.9.1 Evaluation of sample sense tagged Text. 140

5.9.2 Evaluation Example of Two Bags of Words (Co-

occurrences in the “window of context”)

141

5.10 Evaluation of Simple supervised approach. 141

5.10.1 Evaluation of Supervised Methodology 141

5.10.2 Evaluation Example from text to feature vectors. 142

5.10.3 Supervised learning Algorithms applied to feature

vectors.

142

5.11 Evaluation by Naïve Bayesian Classifier (introduction) 143

5.11.1 Bayesian Inference 143

5.11.2 Evaluation of right sense with Naïve Bayesian Classifier 144

5.11.3 Evaluation procedures for Decision List in WSD. 145

5.11.4 Evaluating and Building the Decision List score a

example.

145

5.12 Evaluation of feature sets. 146

5.13 Evaluation of Ensemble Considerations. 146

5.14 Evaluation of bootstrapping of WSD classifier. 147

5.15 Evaluation of Co-training / Self-training WSD example. 147

5.16 Experiments with Co-training / Self-training for WSD. 148

5.17 Evaluation of Bootstrapping Algorithm. 148

5.18 Multi perspective Evaluation of senses. 149

5.18.1 Evaluation of unsupervised learning. 149

5.18.2 Example set for Evaluation of cluster. 149

5.18.3 Evaluation of unsupervised WSD. 150

12


5.18.4 Experimental Evaluation and results. 150

5.19 Evaluation of context representation. 151

5.19.1 Second Order Context Representation Example 151

5.20 The overall effectiveness of integrated approach involving

content, context and common knowledge in resolving

word sense disambiguation problems.

152

Chapter 6 Combining Multiple Knowledge sources 153

6.0 Knowledge characteristics. 153

6.1 Knowledge Management. 154

6.2 Knowledge creation or knowledge acquisition. 156

6.3 Knowledge Management – Integration frame work helps

to build enterprise solutions.

159

6.4 Beneficiaries of Knowledge Integration (using WSD

cognitive approach).

160

6.5 Benefitting Managerial Issues by using integrated WSD

approach of Knowledge Management.

160

6.6 Knowledge-based Content Navigation in e-Learning

Applications (WSD approach).

162

6.7 Dynamic Knowledge Discovery and Representation. 163

6.8 Adaptive Knowledge-based Content Navigation.

164

6.9 Automatic extraction of knowledge from the web. 167

6.10 Instructional design and learning knowledge objects. 168

6.11 The Knowledge Puzzle integrated WSD Approach for the

Creation of Learning Knowledge Objects

170

Chapter 7 WSD Future Scope of research 172

7.1 Conclusion 176

References 179

Appendix

13

List of Figures

Figure No. Title Page No.

1.1 Overview of WSD Systems 03

1.2 Example of an IR system. 04

3.1a. Machine Learning approach (WSD). 37

3.1b. Knowledge Integration Framework 3C (content, context,

common knowledge) for WSD.

40

3.2 Goals In communication: Production (speaker side),

Comprehension (hearer side) View.

45

3.3 Modular Comprehension. 54

3.4 DFD / Functional View of WSD based Paradigm IR system 57

3.5 Conceptual representation of Domain Adaptation, Pre-

processing, supervised and Unsupervised methods of WSD.

60

3.6 Structure of an Expert System(WSD). 62

3.7 Advanced Text Pre-processing Model. 65

3.8 Steps involved in new indexing method (for large data sets). 67

3.9 Morphological Analysis 69

3.10 Co-occurrence Determination for “speaker” example 73

3.11 Effectiveness in Re-training Method. 75

3.12 Natural Language Generator architecture (WSD). 76

3.13 Conceptual view of Modern Retrieval system. 76

4.0 Methodology used in integrated framework for WSD. 78

4.1 Interaction between WSD Expert system with Language

Models.

79

4.2 Centralized data source in an Expert system (WSD). 81

4.3 Extension of IR systems to WSD Retrieval systems. 85

4.4 Architecture for information extraction and WSD. 85

4.5 Information extraction (cognitive approach) 87

4.6

4.7

Disambiguation Task Definition.

Sense Tagger Architecture in WSD.

87

89

14


(Interaction of knowledge sources in WSD).

4.8 Sense Learner, Learning Mechanism or Model. 90

4.9 WSD classifier. 90

4.10 Preparing the training corpus (cognitive approach). 91

4.11 Context Knowledge Acquisition Process. 93

4.12 Cognitive pattern Acquisition process. 95

4.13 Selection of best cognitive senses. 99

4.14 Knowledge Extraction using Dynamical updating

of Representation Model.

100

4.15 WordNet 2.0 entry for the noun. 103

4.16 Noun relations in WordNet. 104

4.17 Verb relations in WordNet. 104

4.18 Adjective and adverb relations in WordNet. 104

4.19 Hyponymy chain for the noun valley. 105

4.20 A fragment of WordNet. 106

4.21 Overview of disambiguation program implementation. 109

4.22 WSD algorithm - call graph. 110

4.23 Pseudocode of obtaining sense divisions in supervised way. 111

4.24 Pseudo code of Sense _Divisions. 111

4.25 Psedocode of Training Accuracy. 112

4.26 Pseudocode for Pruning. 113

4.27 creation of decision list pseudocode. 114

4.28 Psuedocode to find Accuracy. 115

4.29 converge pseudocode. 115

4.30 correct_tagged_sets. 116

4.31 pseudocode to disambiguate untagged_set.

116

4.32

pseudocode for Best Match 116

4.33 Creation of decision list based on current sets. 118

4.34 pseudocode of Parts of speech tagging. 118

15


4.35 pseudocode of extract_features. 119

5.1a. Accuracy parameters. 120

5.1b. Multi perspective Evaluation of Integrated Framework of

WSD system.

122

5.2 Fragment of WordNet Hierarchy. 123

5.3 A sense of the word ‘assembly’ from the WordNet . 124

5.4 Accuracy of disambiguator against the no of clue words. 126

5.5 upper and lower bound plots based on retrieval effectiveness. 128

5.6 Sample training test data paragraph in XML format. 131

5.7 Performance of Integrated WSD Vs SENSEVALII data. 136

5.8 WSD co-occurrence Vs All words participants. 137

5.9 WSD’s Frequency Vs SENSEVAL Data. 138

5.10 Frequency Matching in WSD Vs SENSE EVAL of all words

Data.

152

6.1 Example of – Knowledge Management Systems (WSD). 155

6.2 Multi perspective Knowledge integration framework. 159

6.3 Knowledge Representation Frame work 165

6.4 Automatic Extraction of knowledge from web. 167

6.5 Combined Knowledge Platform Functional overview. 168

6.6 Blooms taxonomy of knowledge verb representation. 168

6.7 Semantic Web Rule Language. 169

6.8 Mapping between KPCM and SCORM .

170

16

List of Tables

Table No. Title Page No.

3.1 Faceted classification of Information Retrieval systems. 56

3.2 Comparison of Data Processing and Knowledge Engineering. 65

3.3 Stemmers, confidence, score. 68

3.4 Textual Entailment comparison examples. 71

5.1 The accuracy of the disambiguator against against two random

selection and most common Strategies.

124

5.2 5 senses evaluated by disambiguator for word ‘assembly’. 124

5.3 Accuracy against the number of clue words to be used by

disambiguator.

125

5.4 Pseudo words used to find accuracy of disambiguator. 126

5.5 Evaluation of seed words in example contexts. 127

5.6 Sense, score (confidence determination). 128

5.7 Combining the output of three manual taggers. 128

5.8 Comparative analysis of two disambiguation systems. 129

5.9 Multi perspective analysis of integrated framework. 130

5.9a Three senses of word ‘speaker’ Evaluated by disambiguation

system compared with WordNet. Knowledge store.

130

5.10 Performance of Disambiguator-collocation with SENSEVAL

II data.

131

5.11 Performance of Integrated WSD co-occurrence Vs

SENSEVAL II data:

133

5.12 Performance WSD system’s frequency Vs SENSEVAL data 135

5.13 Sense tagged text. 137

5.14 co-occurrences in the ‘window of context 140

5.15 Steps in Evaluation of Supervised methodology with example. 140

5.16 Text sense into Feature Vectors determination. 141

5.16a EXAMPLE sentence of text with different senses. 141

5.20 Example procedures to implement decision list 145

5.21 Decision List Score example. 146

5.23 steps in co-training /self training for WSD. 147

17

Table No. Title Page No.

5.24 Boot strapping Algorithm 148

5.25 Cluster this Data 148

5.22 Experimental evaluation and Results from WordNet 151

5.26 Evaluation of second order context representation 148

5. Overall effectiveness of integrated knowledge approach 151

Date post:	27-Jun-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Dr. T.M.Rangaswamy · decorated events: precision obtained were 62%, recall was 53%. Precision and...

Documents