Multi-Perspective Comparative Study:
Common Context Based knowledge integration
in Word Sense Disambiguation for Information
Retrieval
By
Vinay Hegde
Supervisor
Dr. T.M.Rangaswamy
A Thesis Submitted to Avinashilingam University for Women
Coimbatore – 43
In partial fulfilment of the requirements for the award of the degree of
Doctor of Philosophy
in
Computer Science and Engineering
October - 2012
CERTIFICATE
This is to certify that the dissertation entitled “Multi-Perspective Comparative
Study: Common Context Based knowledge integration in Word Sense
Disambiguation for Information Retrieval” submitted to Department of Computer
Science and Engineering, Faculty of Engineering, Avinashilingam University for
Women, Coimbatore, for the award of Doctor of Philosophy in Computer Science
and Engineering is a record of original research work done by Vinay Hegde, during
the period of his study (April 2009 – October 2012) in the Department of Computer
Science and Engineering, Faculty of Engineering, Avinashilingam University for
Women, under my supervision and guidance and the thesis has not formed the basis
for the award of any Degree / Diploma / Associateship / Fellowship of any other
similar title to any candidate of any other University.
Signature of Guide
DECLARATION
I hereby declare that the matter embodied in the thesis entitled “Multi-
Perspective Comparative Study: Common Context Based knowledge integration
in Word Sense Disambiguation for Information Retrieval” submitted to
Department of Computer Science and Engineering, Faculty of Engineering,
Avinashilingam University for Women, Coimbatore, for the award of Doctor of
Philosophy in Computer Science and Engineering is a record of original research
work done by the undersigned under the supervision and guidance of Dr. T.M.
Rangaswamy, Associate Dean (P.G.Studies) and Professor, Industrial Engineering
and Management Department, R V College of Engineering and that it has not formed
the basis for the award of any Degree / Diploma / Associateship / Fellowship or any
other similar title to any other candidate of any other University.
Signature of Candidate
2
ACKNOWLEDGEMENT
First and Foremost, I would like to thank my Guide Dr. T.M.Rangaswamy, for his warm
encouragement and thoughtful guidance to complete my PhD at Avinashilingam University for
Women, Coimbatore as a full time candidate.
I express my gratitude to honorable Vice Chancellor of Avinashilingam University for Women,
Coimbatore, for giving me an opportunity to carry out the research work.
My sincere thanks to Controller of Examinations, Avinashilingam University for Women,
Coimbatore, for timely suggestions and help extended throughout the research work.
I express my gratitude to Registrar and Director of Research, Avinashilingam University for
Women, Coimbatore, for their support.
I am grateful to the Management of Rashtreeya Sikshana Samithi Trust (RSST), Bangalore for
giving me permission to carry out the research work, and providing logistics, financial support
and study leave to complete the research work.
I would like to express my gratitude to Dr. Satyanarayana B.S, Principal, Prof K.N. RajaRao,
Advisor, and Dr. Satyanarayana.S, Vice Principal, R V College of Engineering for their
encouragement and support to carry out my research work.
I would like to express my deep sense of gratitude to Dr.S.C.Sharma, honorable Vice
Chancellor, Tumkur University, Karnataka, India, for igniting the passion for research.
I would express my gratitude to Dr. N.K Srinath, Professor and Head, Department of
CSE,RVCE and also to my colleagues of the department of Computer science and Industrial
Engineering and Management, R V College of Engineering for extending their support during
my research work.
I sincerely thank Dr. Krishna M, professor; R&D Department, RVCE and Dr. Ramakanth Kumar
P, Professor, Dept of Information Science, R V College of Engineering for their continuous
evaluation and review of thesis with respect to grammatical and technical corrections helped me
in timely completion of my work.
3
My special thanks Dr. Rajashree shettar, Professor, Computer Science Department, Prof.
Manjunath A.E, Assistant professor, CSE Department, Prof. Anjan K, Assistant professor, Mr.
Sathyanarayana, Psychiatry counselor, R&D department R. V. College of Engineering and Mrs.
Vedavathy Instructor, CSE department for their unconditional support in providing cognitive
inputs like brain mapping, human psychology during my research work in designing and typing
my thesis with a quality focus.
I dedicate this thesis to my beloved parents ,Sri. V.S .Hegde, Smt. Parvati Hegde, and Dr.
Aparna Hamsa. A special thanks to Mr. Narayanappa A.P and other family members for their
unconditional support and encouragement to pursue my interests.
I would like to thank all the people who have helped me directly or indirectly in carrying out my
research work and preparation of the thesis.
(Vinay Hegde)
4
LIST OF PUBLICATIONS
Papers Published in International and National Journals:
1. Vinay Hegde , Dr.T.M.Rangaswamy published a paper on “Efficient word sense
Disambiguation with context based searching” International Journal of Advanced
Engineering Technology. E-ISSN 0976-3945 Volume II ,Issue I January-March (2011),
pp. 43-46
2. Vinay Hegde, Dr.T.M.Rangaswamy published a paper on “Putting Corpus into dictionary
(WSD Approach) ”at International Conference / Journal on Computer Applications
[ICCA-2010],during 24th to 27th December 2010, held at Pondicherry,India.pp.303-306 .
Same is published in the journal doi:10.3850/978-981-08-7304-2_1445.
(http://www.rpsonline.com.sg/proceedings/9789810873042/html/978-981- 08-7304-
2_1445.xml)
3. Vinay Hegde, Jeevan HE, Prashanth PP, Punith Kumar S N & Dr.T.M.Rangaswamy
Published a paper on “Web pages Clustering: A New Approach” International Journal of
Innovative Technology and Creative Engineering (IJITCE), ISSN :2045- 8711 Volume
1,Number 4, APRIL 2011,PP. 42-44
5
CONFERENCE PUBLICATIONS
Papers Presented in International and National Conferences
1. Vinay Hegde , Dr.T.M.Rangaswamy, Dipti Shankar, Deepiks Sin D.V. Ashwini presented
a paper on “A simple graph based Information Retrieval system” at National
conference on Recent Trends in Computing Technology [RTCT 2011],during 29th to 30th
April 2011,held at Department of Computer science & Engineering, R.V.College of
Engineering, Bangalore, Karnataka, India. Volume 2, pp. 110-113
2. Vinay Hegde, Dr.T.M.Rangaswamy presented a paper on “Word Sense Disambiguation- a
new approach” at International conference on Emerging Trends in Computer Science,
Communication and Information Technology [CCIT-2010], during 9th to 11th Jan 2010,
held at Nandel, Maharashtra, India.
3. Vinay Hegde, Dr.T.M.Rangaswamy presented a paper on “A Review on Word Sense
Disambiguation” at National conference on Computing Communication & Technology,
[CCT-2010] during 12th to 13th Jan 2010,held at RVCE, Bangalore. PP. 51-53
4. Vinay Hegde, Dr.T.M.Rangaswamy presented a paper on “Survey on Word Sense
Disambiguation” at National conference on Network security and Information
Retrieval,[NCNSIR09] on 14th December 2009, held at Department of MCA, RVCE,
Bangalore. Pp.34-36
6
Abstract
In natural language processing (NLP), knowledge discovery is an important area of research.
Knowledge discovery process involves lexical ambiguity and semantically ambiguity. Further it
concentrates on developing and building intelligent open domain to resolve the issues in the
language processing areas. Knowledge discrimination and sense disambiguation are the two
important areas of research. Word sense disambiguation (WSD) is the problem of selecting a
right sense for a word from a set of predefined possibilities. Sense Inventory usually comes
from a dictionary or thesaurus. Knowledge intensive methods, supervised learning, and
sometimes bootstrapping approaches are being used.
This research work incorporates experienced learning styles such as converger for
active conceptualization and experimentation, diverger for concrete experience and reflective
observation, Assimilator for abstract conceptualization and reflective observation, Accomodator
for concrete experience and active experimentation. Hence the results obtained on combining
all strategies helped to obtain event instances precision to 86% and recall to 81% . For Fully
decorated events: precision obtained were 62%, recall was 53%. Precision and recall have been
two of the traditional performance measures for evaluating information retrieval systems.
Precision is the fraction of the retrieved result documents that are relevant to the user's
information need. Recall is defined as the fraction of relevant documents in the entire collection
that are returned as result documents. contextualized search, personal information management,
information integration strategies are employed in the work to make a noteworthy contribution.
Further cues, confidence, rule accuracy have been deployed to provide more stability to the
claims made in the research along with precision and recall.
The main contribution in this proposed research work is to present a new holistic or integrated
approach using information access architecture, 3C (Content, context, and common-
background) knowledge integration architecture by multi perspective comparative study of
various methods ,to effectively resolve Word Sense Disambiguation (WSD) problem of Natural
language processing. This approach is context-sensitive, concept-based, and proactive.
7
TABLE OF CONTENT
Chapter No. Particulars Page No.
Acknowledgement i
List of Publications iii
Abstract v
List of Figures xv
List of Tables xviii
Chapter-1 Introduction 01
1.1
Background: Achieving Word Sense Disambiguation
through Configuration of an Information Retrieval
(IR) system.
04
Core Information Retrieval(IR) concepts:
1.2 The Document and Query Representation 05
1.3 Relevance Feed back 06
1.4 Evaluation parameters used in WSD system 08
1.5 Stemming in WSD 09
1.6 Parts of speech tagging of text 10
1.7 Feature Extraction 10
1.8 Contextual Feature 10
1.9 Co training algorithm 10
1.10 State of art of research topic 13
1.11 Definition of the Problem 16
1.12 Overview of the problem 17
1.13 Applications of WSD 18
1.14 Research gap, Research objectives and scope of
Research work
18,19,20
8
Chapter No. Particulars Page No.
1.15 Summary of the thesis or Thesis organization 23
Chapter-2 Literature Survey 25
2.1 Reviews on Knowledge Based Word Sense
Disambiguation (WSD) 25
2.2 Reviews on Supervised WSD or Corpus Based WSD. 25
2.3 Review on Natural Language Processing(NLP) 26
2.4 Review on WSD by unsupervised methods. 27
2.5 Review on WSD by clustering. 30
2.6 Reviews on Information retrieval techniques. 31
2.7 Review on Predominant sense. 33
2.8 Review on indexing Techniques. 33
2.9 Review on Latent semantic indexing. 34
2.10 Review on Text Data Mining. 35
2.11 Review on Keyword based association analysis. 35
Chapter-3 Integrated Framework 3C (context, content and common
knowledge)
36
3.1 Introduction to 3C Framework and NLP 42
3.1.1 Goals in NLP 43
3.1.2 Classification in NLP 43
3.2. Communication in NLP 43
3.2.1 Communication in NLP( for speaker) 43
3.2.2 Communication for the hearer (comprehension) 44
3.2.2 Subtasks of NLP(syntactic, semantic, pragmatic) 44
3.3 Modular comprehension (NLP). 54
3.3.1 Reasons for Language being ambiguous. 55
3.4 Multi perspective Domain Analysis of IR systems based
on WSD.
56
3.5 DFD or Functional View of WSD based Paradigm IR
system
57
3.6 Innovative Frame work (WSD). 59
3.7 Knowledge Representation in Expert systems(WSD) 62
3.7.1 Requirements of experts system (WSD). 62
9
Chapter No. Particulars Page No.
3.7.2 Comparison of Data processing and Knowledge
Engineering.
65
3.8 Advanced Text pre-processing and indexing in WSD
system
65
3.8.1 The most common filtering /processing operations 65
3.8.2 Indexing Large Data sets without sorting (Clustering): 66
3.9 Stemming Algorithms (WSD) 67
3.9.1 Conflation methods (WSD) 67
3.10 Advanced syntactic, semantic and pragmatic/discourse
Tasks
68
3.10.1 Word Segmentation 68
3.10.2 Morphological Analysis in NLP 69
3.11 Tokenizing, Tagging and Lemmatizing (TTL) 69
3.11.1 TTL Functionalities 70
3.12 Semantic parsing 71
3.12.1 Entailment in WSD. 71
3.13 Anaphora resolution or co-occurrence determination. 72
3.14 Enhancing the effectiveness of Re-training filters. 73
3.14.1 (Model details on effective Re-training strategy)
Enhancing the effectiveness.
75
3.15 Enhanced Natural Language Generator Architecture
(WSD).
76
Chapter 4 Research Methodology 77
4.0 Methodology, Tools and Techniques in comparative
experimental setup for word sense disambiguation.
77
4.1 User Requirements collection in developing efficient
learning system (WSD).
80
4.1.1 Resources or Data sources used. 80
4.1.2 Algorithms and Type of Data structures in Research
design.
82
4.1.3 Typical knowledge resources used in WSD. 82
4.2 Research Design: Developing Methods for Word Sense
Disambiguation (knowledge integrated approach):
82
4.3 Proposed Method [METHOD 1] 85
4.3.1 Information Extraction in cognitive approach to WSD. 86
10
Chapter No. Particulars Page No.
4.3.2 Building Text Representation 88
4.4 Context Knowledge Acquisition and Representation
Modeling Process using corpus harvesting procedure and
IR databases. [METHOD-2]
93
4.4.1 Knowledge acquisition through Web Document Retrieval
and Cleaning.
97
4.4.2 Sentence segmentation and parsing. 97
4.4.3 Merging Dependency relations. 98
4.4.4 Handling unsupervised method to integrate the
Knowledge in WSD based on cognitive approach.
99
4.5
Proposed Method [METHOD-3] : Knowledge Extraction
using Dynamical updating of Representation Model.
100
4.6 WordNet (Data source used in WSD Research). 103
4.7 Proposed Design of WSD Algorithm and program
implementation procedures.
106
Chapter 5 Results and Discussions. 120
5.0 Evaluation of WSD systems. 120
5.1 Assumptions or criterions aimed before Evaluation of
Integrated Framework of WSD systems:
120
5.2 Results: Experimental Setup for Evaluation of WSD or
Disambiguation System.
122
5.3 Error Analysis using Decision Lists form basic evaluation
procedure in WordNet.
122
5.3.1 Evaluation of senses from WordNet. 123
5.4 Determination of clue words by disambiguator 125
5.4.1 Knowledge gathering from contexts of seed words. 127
5.5 Establishing the upper and lower bounds of effectiveness 127
5.5.1 The similarity Measure 128
5.6 Knowledge gathering in WSD system – A evaluation. 132
5.6.1 WordNet Sense Tag Evaluation. 132
5.6.2 Multi perspective Analysis of Collocations in Knowledge
store (WordNet) Vs Senseval-II data
133
5.6.3 Evaluation by designing co–occurrence matching filter. 133
11
Chapter No. Particulars Page No.
5.6.4 Algorithm of co-occurrence matching filter. 133
5.6.5 Performance of Integrated WSD co-occurrence Vs
SENSEVAL II data:
135
5.6.6 Inference with two plot of WSD co-occurrence. 136
5.7 Frequency Matching 136
5.8 Evaluation of Combined Multiple Knowledge source. 138
5.8.1 Overall WSD Evaluation. 139
5.9 Evaluation of WSD Task Definition in Supervised WSD
Method
140
5.9.1 Evaluation of sample sense tagged Text. 140
5.9.2 Evaluation Example of Two Bags of Words (Co-
occurrences in the “window of context”)
141
5.10 Evaluation of Simple supervised approach. 141
5.10.1 Evaluation of Supervised Methodology 141
5.10.2 Evaluation Example from text to feature vectors. 142
5.10.3 Supervised learning Algorithms applied to feature
vectors.
142
5.11 Evaluation by Naïve Bayesian Classifier (introduction) 143
5.11.1 Bayesian Inference 143
5.11.2 Evaluation of right sense with Naïve Bayesian Classifier 144
5.11.3 Evaluation procedures for Decision List in WSD. 145
5.11.4 Evaluating and Building the Decision List score a
example.
145
5.12 Evaluation of feature sets. 146
5.13 Evaluation of Ensemble Considerations. 146
5.14 Evaluation of bootstrapping of WSD classifier. 147
5.15 Evaluation of Co-training / Self-training WSD example. 147
5.16 Experiments with Co-training / Self-training for WSD. 148
5.17 Evaluation of Bootstrapping Algorithm. 148
5.18 Multi perspective Evaluation of senses. 149
5.18.1 Evaluation of unsupervised learning. 149
5.18.2 Example set for Evaluation of cluster. 149
5.18.3 Evaluation of unsupervised WSD. 150
12
Chapter No. Particulars Page No.
5.18.4 Experimental Evaluation and results. 150
5.19 Evaluation of context representation. 151
5.19.1 Second Order Context Representation Example 151
5.20 The overall effectiveness of integrated approach involving
content, context and common knowledge in resolving
word sense disambiguation problems.
152
Chapter 6 Combining Multiple Knowledge sources 153
6.0 Knowledge characteristics. 153
6.1 Knowledge Management. 154
6.2 Knowledge creation or knowledge acquisition. 156
6.3 Knowledge Management – Integration frame work helps
to build enterprise solutions.
159
6.4 Beneficiaries of Knowledge Integration (using WSD
cognitive approach).
160
6.5 Benefitting Managerial Issues by using integrated WSD
approach of Knowledge Management.
160
6.6 Knowledge-based Content Navigation in e-Learning
Applications (WSD approach).
162
6.7 Dynamic Knowledge Discovery and Representation. 163
6.8 Adaptive Knowledge-based Content Navigation.
164
6.9 Automatic extraction of knowledge from the web. 167
6.10 Instructional design and learning knowledge objects. 168
6.11 The Knowledge Puzzle integrated WSD Approach for the
Creation of Learning Knowledge Objects
170
Chapter 7 WSD Future Scope of research 172
7.1 Conclusion 176
References 179
Appendix
13
List of Figures
Figure No. Title Page No.
1.1 Overview of WSD Systems 03
1.2 Example of an IR system. 04
3.1a. Machine Learning approach (WSD). 37
3.1b. Knowledge Integration Framework 3C (content, context,
common knowledge) for WSD.
40
3.2 Goals In communication: Production (speaker side),
Comprehension (hearer side) View.
45
3.3 Modular Comprehension. 54
3.4 DFD / Functional View of WSD based Paradigm IR system 57
3.5 Conceptual representation of Domain Adaptation, Pre-
processing, supervised and Unsupervised methods of WSD.
60
3.6 Structure of an Expert System(WSD). 62
3.7 Advanced Text Pre-processing Model. 65
3.8 Steps involved in new indexing method (for large data sets). 67
3.9 Morphological Analysis 69
3.10 Co-occurrence Determination for “speaker” example 73
3.11 Effectiveness in Re-training Method. 75
3.12 Natural Language Generator architecture (WSD). 76
3.13 Conceptual view of Modern Retrieval system. 76
4.0 Methodology used in integrated framework for WSD. 78
4.1 Interaction between WSD Expert system with Language
Models.
79
4.2 Centralized data source in an Expert system (WSD). 81
4.3 Extension of IR systems to WSD Retrieval systems. 85
4.4 Architecture for information extraction and WSD. 85
4.5 Information extraction (cognitive approach) 87
4.6
4.7
Disambiguation Task Definition.
Sense Tagger Architecture in WSD.
87
89
14
Figure No. Title Page No.
(Interaction of knowledge sources in WSD).
4.8 Sense Learner, Learning Mechanism or Model. 90
4.9 WSD classifier. 90
4.10 Preparing the training corpus (cognitive approach). 91
4.11 Context Knowledge Acquisition Process. 93
4.12 Cognitive pattern Acquisition process. 95
4.13 Selection of best cognitive senses. 99
4.14 Knowledge Extraction using Dynamical updating
of Representation Model.
100
4.15 WordNet 2.0 entry for the noun. 103
4.16 Noun relations in WordNet. 104
4.17 Verb relations in WordNet. 104
4.18 Adjective and adverb relations in WordNet. 104
4.19 Hyponymy chain for the noun valley. 105
4.20 A fragment of WordNet. 106
4.21 Overview of disambiguation program implementation. 109
4.22 WSD algorithm - call graph. 110
4.23 Pseudocode of obtaining sense divisions in supervised way. 111
4.24 Pseudo code of Sense _Divisions. 111
4.25 Psedocode of Training Accuracy. 112
4.26 Pseudocode for Pruning. 113
4.27 creation of decision list pseudocode. 114
4.28 Psuedocode to find Accuracy. 115
4.29 converge pseudocode. 115
4.30 correct_tagged_sets. 116
4.31 pseudocode to disambiguate untagged_set.
116
4.32
pseudocode for Best Match 116
4.33 Creation of decision list based on current sets. 118
4.34 pseudocode of Parts of speech tagging. 118
15
Figure No. Title Page No.
4.35 pseudocode of extract_features. 119
5.1a. Accuracy parameters. 120
5.1b. Multi perspective Evaluation of Integrated Framework of
WSD system.
122
5.2 Fragment of WordNet Hierarchy. 123
5.3 A sense of the word ‘assembly’ from the WordNet . 124
5.4 Accuracy of disambiguator against the no of clue words. 126
5.5 upper and lower bound plots based on retrieval effectiveness. 128
5.6 Sample training test data paragraph in XML format. 131
5.7 Performance of Integrated WSD Vs SENSEVALII data. 136
5.8 WSD co-occurrence Vs All words participants. 137
5.9 WSD’s Frequency Vs SENSEVAL Data. 138
5.10 Frequency Matching in WSD Vs SENSE EVAL of all words
Data.
152
6.1 Example of – Knowledge Management Systems (WSD). 155
6.2 Multi perspective Knowledge integration framework. 159
6.3 Knowledge Representation Frame work 165
6.4 Automatic Extraction of knowledge from web. 167
6.5 Combined Knowledge Platform Functional overview. 168
6.6 Blooms taxonomy of knowledge verb representation. 168
6.7 Semantic Web Rule Language. 169
6.8 Mapping between KPCM and SCORM .
170
16
List of Tables
Table No. Title Page No.
3.1 Faceted classification of Information Retrieval systems. 56
3.2 Comparison of Data Processing and Knowledge Engineering. 65
3.3 Stemmers, confidence, score. 68
3.4 Textual Entailment comparison examples. 71
5.1 The accuracy of the disambiguator against against two random
selection and most common Strategies.
124
5.2 5 senses evaluated by disambiguator for word ‘assembly’. 124
5.3 Accuracy against the number of clue words to be used by
disambiguator.
125
5.4 Pseudo words used to find accuracy of disambiguator. 126
5.5 Evaluation of seed words in example contexts. 127
5.6 Sense, score (confidence determination). 128
5.7 Combining the output of three manual taggers. 128
5.8 Comparative analysis of two disambiguation systems. 129
5.9 Multi perspective analysis of integrated framework. 130
5.9a Three senses of word ‘speaker’ Evaluated by disambiguation
system compared with WordNet. Knowledge store.
130
5.10 Performance of Disambiguator-collocation with SENSEVAL
II data.
131
5.11 Performance of Integrated WSD co-occurrence Vs
SENSEVAL II data:
133
5.12 Performance WSD system’s frequency Vs SENSEVAL data 135
5.13 Sense tagged text. 137
5.14 co-occurrences in the ‘window of context 140
5.15 Steps in Evaluation of Supervised methodology with example. 140
5.16 Text sense into Feature Vectors determination. 141
5.16a EXAMPLE sentence of text with different senses. 141
5.20 Example procedures to implement decision list 145
5.21 Decision List Score example. 146
5.23 steps in co-training /self training for WSD. 147
17
Table No. Title Page No.
5.24 Boot strapping Algorithm 148
5.25 Cluster this Data 148
5.22 Experimental evaluation and Results from WordNet 151
5.26 Evaluation of second order context representation 148
5. Overall effectiveness of integrated knowledge approach 151